Hey folks, welcome to the final blog post of the four-part series, “Evaluating Deep Learning Models with Abacus.AI”. In this part, we will deal with arguably one of the hardest problem types in terms of model evaluation – Forecasting. I’d encourage you to check out our Youtube channel for our workshop on Forecasting especially if you haven’t trained any models using our platform.
Forecasting is the process of training models on historical data and then using them to predict future observations. It is known as time-series forecasting because it comes under time series analysis and there’s a timestamp associated with every data point for this problem type. As I mentioned earlier, it is arguably one of the hardest problem types because of the fact that the future is completely unknown and uncertain and we only have the luxury of the past to estimate the uncertain future. So, as you might imagine, the skill or quality of a time series forecasting model is determined by its performance at predicting the future. Let me quickly define our quick evaluation recipe for this problem type too. Before that, I want to again highlight that Machine Learning (ML) is for everyone and one doesn’t need to have a solid statistical background to be an ML practitioner and gain from the intelligent predictions it provides. With this, it is time to lay down our quick eval recipe:
Quick Evaluation Recipe
1. Compare every metric score with the baseline score to see if there’s an improvement or not
2. Start with cNRMSE, if it’s less than 1.3 then it’s good to go
3. Check cov-nrmse graph (left one), if it looks dispersed then your cov would be a bit high (0.2 to 0.5 in general) otherwise it should be low (<0.15 generally)
4. Check the breakdown of items by quartile of nrmse (left one again) and make sure the prediction graph (on the prediction dashboard) for item ids in the top 25% are close to the actual values but not entirely overlapping
5. Check the worst 25% as well to see if the results are making sense and not very far from the actual values. If they are, you might need to consider adding more data for those item ids
If all of the above things look good then the model is good to go and you have trained a world-class deep learning-based forecasting model in just a few minutes with only a few clicks.
Diving Deeper into Forecasting Model Evaluation
For people who have fallen in love with ML and willing to dig deeper into forecasting or who really need to understand their forecasting models more deeply for any number of reasons, we have the remaining part of this tutorial, so read on :). The metric dashboard for Forecasting use-cases has five accuracy measures: NRMSE, SMAPE, NRMSE of Prediction Total, SMAPE of Prediction Total, and Coefficient of Variation to evaluate the forecasting model. But a strong intuition for the accuracy measures – NRMSE (cNRMSE is another variant which will be very helpful to evaluate forecasting models) and Coefficient of Variation, is sufficient to understand the model performance. Another thing that comes in handy to evaluate the prediction quality is the change in NRMSE and cNRMSE with respect to the Coefficient of Variation that allows an understanding of the model’s behavior and determine if predictions would make sense. If this doesn’t make sense yet, don’t worry, let’s take it step by step, one step at a time.
Let’s start with the most common one – NRMSE:
Normalized Root Mean Square Error (NRMSE) is an accuracy measure whose value is obtained by calculating the square root of the average of squares of all the differences between the predicted and actual result values and then normalizing the output using the mean value of the actual results. In general, an NRMSE score of less than 1.3 is considered good.
In a lot of practical forecasting scenarios, cNRMSE makes more sense, so let’s understand that too:
NRMSE OF PREDICTION TOTAL (cNRMSE) – This accuracy measure assumes a mathematical form similar to Mean Absolute Percentage Error (MAPE) such that the sum of all the actual values is subtracted from the sum of all the forecasted values and then divided by the absolute sum of actual values. In general, similar to NRMSE, a cNRMSE score of less than 1.3 is considered good.
This metric is particularly useful in scenarios where the error with respect to the overall sum of actual values and the sum of forecasted values is more useful or relevant as compared to calculating error by taking one prediction at a time. For example, let’s say that there is a store XYZ that sells goods in bulk all around the year. The store does sales forecasting and is interested in knowing if all of its goods could be sold in the coming six months. In other words, the monthly sales forecast errors are irrelevant for the store as long as the semi-annual forecast is somewhat accurate.
The final metric we need to make sure our model is indeed a great one, we need some intuition on COV:
Coefficient of Variation (COV): It is defined as the ratio of the standard deviation to the mean for the data distribution. In simple words, it shows the extent of variation with respect to the mean of the data. We compute it on zero-centered data and thus its value lies between 0 and 1. As a general rule, the lower the value of this metric, the better.
The coefficient of variation should be computed only for data measured on a ratio scale, that is, scales that have a meaningful zero and hence allow relative comparison of two measurements. Unlike the standard deviation that must always be considered in the context of the mean of the data, the coefficient of variation provides a relatively simple and quick way to compare different data series. Abacus.AI takes care of computing the COV for you on the right scale and you only need to make sure that your COV is high if your data has a lot of variation. For example, let’s say you have been selling TV sets every month and your sales data ranges from 0 to 100 per month with an average of 50. But the variation in sales is high, which means that in some months you end up selling 20 TVs while in others you would be selling 90 TVs. In such a case, you won’t want your model to predict something close to the average for each month (for eg., 40 to 60 TVs) which goes along with low COV scores (generally <0.15). Therefore, a slightly higher coefficient of variation (0.2 to 0.5 in general) depending upon the variation in data would be optimal in such cases.
Now that you have a good understanding of the metrics, the next step is to observe the improvements achieved in the scores as compared to the corresponding baseline model scores. As a quick recap, if your NRMSE is improved and COV looks optimal, you are good to go to the next step.
Breakdown of Items by Quartiles of NRMSE
The histogram below is quite easy to understand I hope when you have already gotten a grip on NRMSE and cNRMSE. But there are a few easy to figure out yet important points to take note of:
Top 25 %: Items with the top 25% NRMSE scores. These items might have a lot of data rows corresponding to their item IDs in the dataset or/and the model was able to decipher the patterns for these items very well. The forecasts for these items are likely to be very accurate as the model is most confident about them.
Top 25 – 50%: Items with the top 25 to 50% NRMSE scores.
Top 50 – 75%: Items with the top 50 to 75% NRMSE scores.
Worst 25%: Items with the worst 25% NRMSE scores. These items either don’t have a lot of data records within the dataset used for training or the records might have a high variation within them to be able to figure out patterns for accurate predictions. Please note that the worst 25% doesn’t necessarily mean a bad forecast. It just means that relatively the model doesn’t feel as confident for these items as it does for the others.
Breakdown of Items by Quartiles of Volume
Similar to the previous one, this histogram also has similar points that increases our understanding of the trained forecasting model.
Top 25%: The items with the largest 25% of the total volume (demand/sold) of items in the dataset(s)
Top 25-50%: The items with the largest 25% to 50% of the total volume of items in the dataset(s)
Top 50-75%: The items with the largest 50 to 75% of the total volume of items in the dataset(s)
Worst 25%: The items with the smallest 25% of the total volume (demand/sold) of items in the dataset(s)
The items with relatively larger volumes are likely to have better NRMSE and SMAPE scores. You should verify this by clicking on the “Item Ids” button for the Top and Worst 25% items on the left as well as the right histogram graphs. Additionally, we also provide an option to download this data into a CSV or a JSON format for analyzing the items and their corresponding scores.
COV – NRMSE and COV – SMAPE Graphs
As briefly mentioned earlier, NRMSE and SMAPE vs Coefficient of Variation graph allows an understanding of the model’s behavior and helps determine if predictions would make sense. If you know that your data has a lot of variance, then the points in the graphs will look dispersed. If ~50% of the items end up with c_nrmse > 0.5 chances are that the forecast will not be actionable. In such cases, when your metric scores are not falling in optimal ranges, there’s no improvement from baseline scores, or a huge part of your data comes under low metric scores then you could verify the following points:
- Check if the columns are marked as future wherever applicable.
- See if the item ids were grouped or split by additional id columns.
- Confirm if you are working with the right frequency of forecast.
- In general, the history of ~ 5x the length of time into the future for which forecasts are to be prepared for all items is a reasonable length to train on. Make sure your items have enough history, especially the low scoring item IDs.
- If >50% of points in the history data are zero data is likely not forecastable.
- Currently, we support positive valued timeseries, make sure that there are not a lot of negative points in the data.
The prediction dashboard for forecasting use cases is designed to give you a high level (overall) picture as well as an in-depth view of the model’s performance. You can select any Item/store from the “Store (NRMSE)” dropdown menu and get an overall picture of the forecast for that item/store. You could put the cursor on the Date Ranges tooltip to check the start and end date for the store within the train set and the test set. The prediction start date can be set using the calendar dropdown depending upon your requirements. We recommend checking 4-6 store forecasts and compare them with the actual values to make sure that the differences between them are in conjunction with the SMAPE and NRMSE scores.
If the forecast plots coincide with the actual values for most of the timeseries, it is a sign of overfitting which means that the model is fitting to the training dataset(s) and not learning useful patterns to generalize well (produce good forecasts) on unseen data. In such cases, you’d probably need to add more data for the specific stores it is overfitting for, or you might need to manually set some of the advanced training options to train a new model (here’s where we would be happy to help you).
If you are experienced with timeseries forecasting or if you have a requirement where you are concerned with the overall forecasting error for a long duration of time rather than short term forecasting errors then you might be interested in gaining intuition on the following metrics in addition to the ones we already discussed:
SMAPE: Symmetric Mean Absolute Percentage Error (SMAPE or sMAPE) is an accuracy measure based on percentage errors similar to another accuracy measure called MAPE (Mean Absolute Percentage Error). For calculating MAPE, the average of the difference between the prediction and observed results are calculated as a percentage with respect to observed results over all the data points. SMAPE is similar to MAPE and it measures the accuracy as the average absolute percent error for each time period where it subtracts forecasted value with the actual value and then divides it by the sum of absolute actual value and absolute forecasted value.
In contrast to the MAPE, this metric has both a lower bound and an upper bound. The result ranges from 0% to 200% where the former resembles the perfect model and the latter stands for the worst model. A SMAPE score of less than 40% is generally considered good. A limitation to SMAPE is that if the actual value or forecast value is 0, the value of error will boom up to the upper limit of error (200%)
SMAPE OF PREDICTION TOTAL (cSMAPE): It assumes a mathematical form such that the total prediction error is calculated by subtracting the sum of all the actual values from the sum of all the forecasted values. Then the prediction error obtained is normalized by dividing it by the average of the absolute sum of actual values and the absolute sum of forecasted values. Finally, the result can be multiplied by 100 to get a percentage value (where the value ranges from 0 to 200%). Similar to SMAPE, a cSMAPE score of less than 40% is generally considered good.
Now that you have a decent amount of knowledge on model evaluation for the four major problem types: classification, regression, recommendation, and forecasting; I can say goodbye for now and mark the end of this four-part series with the happiness of being able to add a few dishes in your ML buffet.
Take care you all and stay safe!