Last year I became involved in a project for a retailer based in The Netherlands who had finished construction of a new distribution center. This new innovative DC is fully mechanized and is operated with a minimal amount of personnel, which is different from conventional distribution centers where sudden large order spikes are fulfilled by having more staff pick these orders in parallel. The automated nature of the machinery in this new DC means that the order picking process has a fixed capacity and cannot be scaled up easily by adding more personnel on the fly. In order to optimize its capacity, the DC will therefore also be performing tasks during quiet hours.
These tasks are derived from a forecast the client must supply to the DC controlling software. In practice a 7-day’s forecast needs to be provided to the DC on a daily basis.
Forecasting setup and project approach
For generating these forecasts an SAP HANA system was loaded with a data mart of previous daily goods movements from the DC to all stores on an individual article level. The data mart contains about 5 years of history for each of the 70 stores with an assortment of about 10.000 articles. As you can tell this is a quite large dataset but it the large history is very helpful in determining seasonality patterns occurring yearly.
During the project we are loosely applying the CRISP-DM project approach in an iterative fashion. This approach makes sure phases as data understanding, modeling and evaluation are conducted in iteratively, improving upon the setup each time round.
The technology platform was based on the Advanced Predictive Library (APL) on a HANA 2.0 SPS05 system and an SAP Data Services system for the triggering of the database stored procedures and performing ETL on the resulting forecast.
Models are being trained on a store/article level, which means that 700.000 models are trained during a single run. As the underlying hardware has a large number of available cores all stores can be trained in parallel. Every model has about 150 input variables, ranging from data about the articles themselves, the stores, banking holidays, the weather and the past goods movements of that article for the last days. All of these variables are fed to APL which handles the selection of most relevant variables, removes the variables which have correlations amongst each other, etc. The modeling process takes about 12 hours to complete, running in parallel for all stores.
The main technical challenge of this setup was the optimization of the performance of both the train and apply procedures. As indicated the system manages to train the 700.000 models in about 12 hours, which was acceptable because this process is scheduled to run weekly. It is assumed that the models retain their value during the week as they have been trained on multiple years of data and the external circumstances will not alter that quickly. However the in-database APL apply procedure ran for about 9 hours which would not fit in the appointed time slot between other tasks the system was performing daily.
The high-level steps to make this setup work are as follows:
- Train a model using the in-database APL train functions as normal.
- Export the model to JSON format using the EXPORT_APPLY_CODE built-in function.
- Extract the JSON model and a JSON-formatted extract of the future days the model needs to be applied on from the database into a filesystem.
- Apply the model onto the future days to get the forecast.
As you can see this is a few extra steps when compared to simply calling an in-database procedure to perform a forecast within HANA. But even when taking into account the additional exports of the model and forecast data the total runtime is several orders of magnitude faster than the in-database apply. In fact this process now runs in about 45 minutes including the time taken to move data to the file system.
Detailed technical explanation
I will now give a detailed explanation of the high level steps so you can try a similar setup for you own forecasting scenario. You will need access to an HANA environment with APL >= version 2018.2 using Eclipse (or HANA studio).
Training the APL model
For the brevity of this blog I am assuming a bit prior knowledge of training a model using APL. There are good examples available in the APL documentation on https://help.sap.com under the CREATE_MODEL_AND_TRAIN function reference. A set of good examples is also available by downloading the APL package from the SAP support site and looking into the ‘samples’-folder.
In this setup I will be training a regression model with a target called TARGET_GOODS_MOVEMENTS based on a set of independent variables. For the sake of this example I will just be using two: DISCOUNT_INDICATOR which is a boolean (0 or 1) variable indicating if the article is discounted and a DISCOUNT_VALUE which a continuous value indicating the applied discount. A real-world example may possibly use hundreds of independent variables.
This makes the following table which we call MOVEMENT_HISTORY_TABLE:
Note that this is the structure of the historical dataset. Obviously the future dataset will not contain values for the TARGET_GOODS_MOVEMENTS column, however you do need future information on whether the article is discounted. This needs to be available in the same time horizon on which you want to perform the forecast.
When calling CREATE_MODEL_AND_TRAIN make sure to output the trained model into a table called OUT_MODEL_REGRESSION.
Export the model to JSON format
At this point table OUT_EXPORT_JSON_REGRESSION will contain your model in JSON format. This is a human-readable format which may expose some details about the model you haven’t noticed earlier. To get this using HANA studio make sure to convert the model payload to a VARCHAR:
SELECT TO_VARCHAR(VALUE) FROM OUT_EXPORT_JSON_REGRESSION
Now right-click on the output cell and select “Copy cells” and paste it into your favorite text editor. This will display the JSON in a formatted way. Note for instance that the DATE-variable is automatically transformed into others like “MonthOfYear”, “DayOfWeek” and many more to make a single date more descriptive. This is one of the automated activities that APL offloads for you.
Extract the JSON format of the model and future days
Please refer to https://nodejs.org for instructions on installing Node.js, there are native binaries for Windows and MacOS available. When using Linux you should look into using your package manager.
Node.js comes with a package manager called ‘npm’ which is useful for installing and managing libraries or add-ons you require in your program. You can check the full list of available packages at https://www.npmjs.com/. For this example we require the SAP HANA client package to natively and easily connect Node.js to HANA. Use command ‘npm install @sap/hana-client’ from your working directory where the scripts are saved.
Save the below code to a file (eg. extract-models.js) and run it using ‘node extract-models.js’. The program will set up a connection to the HANA system and will write the model from the table to a single file. Note: do not forget to modify the connection parameters to your HANA system.
To extract the future days you will be forecasting create a new file called ‘extract-forecast-days.js’ and use the below code. I have highlighted the lines that have been changed compared to the previous extraction program.
After running the modified code you should have an additional file called ‘forecast-days-output.json’ together with the ‘model-output.json’ file created in the previous program.
Apply the model onto future days to get the forecast
The both files created in the previous steps need to be applied to one another to generate the forecast. The model encodes information about the expected target based on the independent variables DATE (and its derived variables), DISCOUNT_INDICATOR and DISCOUNT_AMOUNT. These dependent variables are part of the future days in the forecasting horizon and are expected to be known upfront.
Now save the below code into a file called ‘predict.js’:
After running the code you will see a statement logging the prediction for the future day you want to get the forecast for. These steps will need to be repeated for each day in the forecasting horizon, so you will possibly need to create a ‘forecast-days-output.json’ file for each of those days. The prediction program should now loop over those files to get a full prediction, possibly logging them into a local file or writing them back into the HANA database.
I have shown you an approach to use a model which was trained in HANA using APL in a standalone environment based on Node.js by exporting the model to a specific format and applying this to the forecasting horizon. The end result in our solution was a huge runtime improvement over the in-database apply functions that APL provides in HANA itself which went down from hours to minutes. This makes the solution very useful for large-scale processing of forecasts.
Of course there are many improvements that could be made to this setup which require a more in-depth knowledge of Node.js engine and programming model, like managing a larger number of models or enabling multithreading to allow for parallel model scoring. If interested you should search the internet for resources on this as the current approach gives you a baseline setup.
In the next part of this blog I will present an approach for explainable forecasting using the APL which allows a business user to look “under the hood” in case of forecasting anomalies.
Note: For viewing the code, click here.