This is part 2 in a two-part series of blogs on large-scale and explainable forecasting using APL. In part 1 I have outlined a way to utilize the APL library for in-database training of a regression model in HANA in order to be used together with an external Node.js inference script.
In this part of the blog I will dive deeper into built-in functionality to retrieve insights into a trained model which is called the ‘model debrief’.
Being able to explain the inner workings of a model (also “explainable AI” or “XAI”) is an important topic in applying machine learning in enterprise environments and is recently gaining much attention. The APL regression modeling algorithm uses a gradient boosting tree under the hood which is extremely well suited for this purpose.
Now what is exactly meant by this? A common perception is that a machine learning model is a “black box” generating predictions from which it is very difficult to obtain the root cause telling why a specific prediction was made. For instance if a prediction is off by 300% it is valuable to learn which specific variable was the root cause for the wrong prediction. The Advanced Predictive Library allows to view the exact influencers of each model and allows to decompose a prediction into the components of the various variables. It is therefore extremely helpful in communicating its forecast results to business users.
Installing example data
For the debrief example I will be using one of the bundled APL examples from the package you can download from https://support.sap.com. Go to ‘Software downloads’ and type ‘APL’ to download a recent version. After extracting the package you should be able to see a ‘samples/data’ folder which contains the samples to be imported into HANA.
Now open your Eclipse / HANA Studio and go to File -> Import and choose Delivery Unit. Navigate to the ‘samples/data’ folder to import the sample tables into your system. By default these will be loaded in their own schema named ‘APL_SAMPLES’. After importing you should see about 10 sample tables installed which cater to different use cases.
I will use the ADULT01 table for this example which is a dataset of persons with some properties on their education, marital status, relationship, etc. There are multiple numeric fields which are predictacle, for this example we will try to predict a person’s age based on the other input characteristics. An excerpt of the table is shown below (not all columns are included).
For training the model I will present an example using the new “any” procedure syntax which requires HANA 2.0 SPS03. For developers who have prior experience with APL or PAL you may have seen the former syntax where you needed to create a wrapper procedure first before you were able to execute an APL function. This is called the ‘direct’ method which also works fine but just required a bit more boilerplate code and physical table creation. Therefore I find the below setup preferable for experimentation purposes.
The new APL procedure syntax uses functions which are supplied in the SAP_PA_APL schema and can be called from anywhere in the database.
Also note that the above code is not creating physical tables but uses objects to reference internal tables instead. This is possible due to wrapping the code in a DO BEGIN … END block which in fact handles the inner code as if it were an anonymous stored procedure and at the same time enables stored procedure constructs like internal tables.
The program trains a regression model with the ‘age’ variable as target and the others as independent variables. After training it calls the GET_MODEL_DEBRIEF function to extract the model statistics and stores these in a set of tables called the DEBRIEF_PROPERTY and DEBRIEF_METRIC tables. Of course these could be physically stored in HANA for later use as well.
To extract the debrief information you’ll require access to the DEBRIEF_PROPERTY and DEBRIEF_METRIC tables which have been filled using GET_MODEL_DEBRIEF. SAP recommends not to query directly on these two tables as their internal structure or the way they store information may change in the future. The preferred way to extract information is to use the supplied functions which live in the SAP_PA_APL schema by using the following syntax:
The example shows the ‘ContinuousVariables’ report which looks as below. It lists the continuous variables in the dataset with their descriptive statistics split into the estimation and validation datasets (eg. the train and test sets APL uses internally).
There is a number of other functions you can use, depending on the type of model you have selected for your problem. Please refer to the APL documentation under topic ‘Statistical Reports’ for the full list by modeling category. For a regression model as we are doing in this example the following functions are available:
If you just want to the output for each of these functions just call all the corresponding SELECTs sequentially and Eclipse / HANA Studio will create tabs for the output of each separate query. This allows you to quickly navigate all statistical debrief information from a single screen.
One of the most interesting overviews is the ‘ClassificationRegression_VariablesContribution’ report which displays all independent variables in the model and their contribution towards explaining the target. Remember we are using these variables to predict an person’s age, so apparently the martial status and relationship variables explain this for about 52%.
Now the question arises how these variables are exactly explaining the age target. For this information you should look at the ‘ContinuousTarget_GroupCrossStatistics’ report as shown below. As you can see a Widowed person has a mean age of 59 whereas a person who has never been married has a mean age of 28. This all seems quite logical but is now supported by the statistical algorithm.
Model apply details
I have now shown you how to get more insights into trained models based on describing the model internals using various built-in debrief functions. One of the most helpful functions is the model influencers overview which lists the influencing variables based on their contribution. It would also be interesting not only to retrieve the statistics of the trained models but also get more details on the model apply phase to see how each of these influencers is guiding the prediction.
The first step is that you require to create a table with the correct structure to store both the apply result and the influencers together. Because this is dependent on the apply settings you need to call a function to get the correct table type first:
This code will first train the model (again) and then retrieve the table structure to be used for storing the “Advanced Apply Functions”. This setting configures the extra information you want to store together with the actual prediction from the apply function, in this case set by the ‘APL/ApplyReasonCode/TopCount’ parameter which gives the top 5 reasons explaining the prediction. There are many other settings available here which you can find in the APL manual.
This result resembles the following table structure:
Make sure to create to execute this CREATE statement to create the proper output table. Now we will create a new table containing some unseen records resembling the ADULT01 structure to do the apply on. Use the following statement to create an empty copy of the ADULT01 table:
Now insert a few records:
Now for the final step is to run the apply on these new records and have a look at the influencers:
This should give you the below output.
As you can see the model predicts an age of 32 for the first person and 45 for the second. The top 5 reasons for this prediction are listed to the right and are split in these categories:
- Reason name: name of the influencing variable
- Reason value: value which led to the prediction
- Strength indicator: tells if the variable is giving an uplift (positive) or downlift (negative) to the prediction and its strength (e.g. strong, meaningful, weak)
For our example records it means that the first record was mostly influenced because the marital status was “Never-married” which is negatively strong, meaning that age is pushed down. As you can see a relatively low age of 32 is predicted. The second record has positive influencers on workclass which is Self employed and marital-status is Married, which both give a positive uplift to the age leading to a prediction of 45.
Note that this shows that the model influencers as described in the model debrief will be applied with different priorities and with different weights dependent on each individual forecast!
In this blog I have shown you an approach to train a regression model in APL and extract its influencers from the model by using debrief functions. I have also shown in what way each of these influencers are contributing to the target by looking into the cross-statistics. In a second step I have shown how to get the influencers for each individual apply together with the apply results.
Using an approach as described here is very useful in debugging forecasts made by APL and can assist in explaining why certain forecasts which are too far off have been made. This will also allow to better surface data quality issues.
Note: For viewing the code, click here.