What is backtesting?

This article explains what backtesting is in the AI & Analytics Engine.

Backtesting is the standard recommended method to evaluate machine-learning models built for forecasting. It is devised to give the best indication of how it will perform on future data. Backtesting involves testing a model on recent periods of historical data after training it with data that is even older. The model’s predictions in historical periods are compared with the actual values (such as demand for products) in these periods, to generate metrics such as forecasting quality.

Why backtesting is needed

To build machine-learning models for use cases that are not time sensitive (for e.g.: detecting the presence of a disease from a patient’s symptoms), data is split randomly into train and test portions. Such methods are not appropriate for time-sensitive use cases like forecasting, since it would result in temporal data leakage, where data from past and future are randomly mixed in both the training and test portions. This can potentially lead to a misleadingly good-looking model performance report, only for the model to later perform poorly in production. Instead, we want models to be trained on data from earlier periods and tested on data from later periods.

In addition, often a sufficiently large single pair of train and test portions generated by randomly splitting available data is enough for use cases that are not time sensitive. However for time-sensitive applications such as demand forecasting, this may not be sufficient due to the dynamic nature of the data. For example: if the single test data was to fall in a period with a lot of holidays such as the year end, the model’s performance in this period may not represent its performance in the rest of the year.

How the data is split in backtesting

Due to above reasons, backtesting is used to evaluate model performance in time-sensitive applications such as demand forecasting. In backtesting, multiple train and test splits are created, based on the timestamps of the observations. Following diagram shows how the data can be split for 5 backtesting periods.

Splitting data during backtestingSplitting data during backtesting

 

In the above diagram, past data beyond a certain past time is not used in each split. This is because data in very remote past may not be relevant to forecast the future observations as they may contain patterns or behaviours which are out of date. Instead here the model focuses more on most recent patterns in the data. Also, this can be more computationally efficient as the training size is consistent across multiple splits. This method is called sliding window (or rolling window) walk-forward validation. Due to these reasons, the AI & Analytics Engine uses sliding window method in its forecasting applications.

💡For more details about sliding window backtesting vs other backtesting methods, refer to this article.

 

For example, let’s assume that we have 5 years of data from 2018-01-01 to 2023-12-31 and up to 3 years of data are relevant for training. Test portion size is 6 months. Then 5 splits would be:

  • split 1: train 2018-07-01 to 2021-06-31, test 2021-07-01 to 2021-12-31

  • split 2: train 2019-01-01 to 2021-12-31, test 2022-01-01 to 2022-05-31

  • split 3: train 2019-07-01 to 2022-06-31, test 2022-07-01 to 2022-23-31

  • split 4: train 2020-01-01 to 2022-12-31, test 2023-01-01 to 2023-05-31

  • split 5: train 2020-07-01 to 2023-06-31, test 2023-07-01 to 2023-12-31

Note that the data between 2018-01-01 and 2018-07-01 is not used in any splits as those data are too old.

The AI & Analytics Engine automatically conducts backtesting to give you an understanding of your model's performance. It creates an appropriate number of backtesting periods to assess the performance of your predictions. The test portion size is the same as the forecasting horizon size in the Engine as this will ensure that the models are evaluated similar to real forecasting scenarios. The Engine also provides you with an in-depth backtesting evaluation report that shows different kinds of insights.

Backtesting evaluation reports from the EngineBacktesting evaluation reports from the Engine