Training Parameters And Accuracy Measures

Our platform provides the flexibility to adjust a set of training parameters. There are general training parameters and advanced training options that could influence the model predictions. The predictions are measured on the basis of a set of accuracy measures or metrics that are also discussed in this section.


Training Options

Once you have fulfilled all the feature group requirements for the use case, you can set the following general and advanced training configuration options to train your ML model:

Training Option Name Description Possible Values
Prediction Length Number of timestamps in the future to predict. 1 to 100
Probability Quantiles Quantile of the forecast distribution. The usual point forecast is often the mean or the median of the forecast distribution. To get a sense of quantile forecasts, you would first need to understand what a prediction interval is. A prediction interval is an interval within which forecasts may lie, with a certain probability. For example, 90% prediction interval (P90) is defined by the 5% and 95% quantiles of the forecast distribution. Similarly, P10 is defined by the 45.5% and 55.5% quantiles of the forecast distribution. So, you could think of a prediction interval as a confidence value for the model to be sure of its predictions. Thus, for P90, the model would be 90% sure that the actual value would lie in the corresponding prediction interval. Ideally, it is recommended to include a probability quantiles for a lower bound (like P10) and another one for an upper bound (P90) to train a forecasting model. The point forecast (P50) should lie between the P10 and P90 values for most of the timestamps. 10 / 25 / 50 / 75 / 90
Forecast Frequency It sets how often to make forecasts. Hourly / Daily / Weekly (Monday to next Monday) / Weekly / Monthly / Yearly
Name The name you would like to give to the model that is going to be trained. The system generates a default name depending upon the name of the project the model is a part of. The name can be comprised of any alphanumeric character and the length can be anywhere from 5 to 60 characters.
Set Refresh Schedule (UTC) Refresh schedule refers to the schedule when your dataset is set to be replaced by an updated copy of the particular dataset in context from your storage bucket location. This value to be entered is a CRON time string that describes the schedule in UTC time zone. A string in CRON Format. If you're unfamiliar with Cron Syntax, Crontab Guru can help translate the syntax back into natural language.

Advanced Training Options

For Advanced Options, our AI engine will automatically set the optimum values. We recommend overriding these options only if you are familiar with deep learning. Overview of Advanced Options:

Training Option Name API Configuration Name Description Possible Values
Type of split TYPE_OF_SPLIT Defines the underlying method that will be used to split data into train & test set. We support the following ways to do this in forecasting:

(i) Automatic Time Based: Time based split where the actual splits are decided automatically

(ii) Timestamp Based: A timestamp can be set such that the model could be trained on a sequence of data and then made to predict on the next sequence of data based on the timestamp i.e. data points before the selected timestamp will go into train set and after that timestamp will go into test set.

(iii) Item Based: Partition train/test data by item instead of time.

(iv) Force Prediction Length: Force length of the test window to be the same as the prediction length. This flag overrides all other related options. Please note that when this flag is set, the start date for the test period is computed as the maximum date in the dataset (after filtering) minus the prediction length of frequency periods.
Automatic Time Based / Timestamp Based / Item Based / Force Prediction Length
Test Split TEST_SPLIT Percentage of the dataset to use as test data. A range from 5% to 20% of the dataset as test data is recommended. The test data is used to estimate the accuracy and generalizing capability of the model on unseen (new) data. Percentage of the dataset to use as test data. A range from 5% to 20% of the dataset as test data is recommended. The test data is used to estimate the accuracy and generalizing capability of the model on unseen (new) data.
Dropout DROPOUT Dropout percentage in deep neural networks. It is a regularization method used for better generalization over new data. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much and enhances their isolated learning. 0 to 90
Batch Size BATCH_SIZE The number of data points provided to the model at once (in one batch) during training. The batch size impacts how quickly a model learns and the stability of the learning process. It is an important hyperparameter that should be well-tuned. For more details, please visit our Blog. 16 / 32 / 64 / 128
History Length HISTORY_LENGTH The extent of history (last n number of data points) to consider while training the model. 1 to 200
Test Start TEST_START Limit training data to dates before the provided test start. datetime
Test By Item TEST_BY_ITEM Partition train/test data by item instead of time. true or false
Force prediction length FORCE_PREDICTION_LENGTH Force length of the test window to be the same as the prediction length. This flag overrides all other related options. Please note that when this flag is set, the start date for the test period is computed as the maximum date in the dataset (after filtering) minus the prediction length of frequency periods. true or false
Prediction Offset PREDICTION_OFFSET Offset range for the prediction 0 to 365
No Validation Set NO_VALIDATION_SET Do not generate validation set, test set will be used instead true or false
Filter Items FILTER_ITEMS Quality of the dataset plays a key role for data science projects. If there is a lot of noise in the dataset, it will be difficult to create a good model ("garbage in - gargage out"). Hence, within our product we implemented filtering heuristics, that look at different statistics of timeseries and decide, whether the item is forecastable or not. Filtered items are not used for training and evalution of models (but they are still available for inspection in prediction dashboard and for generating batch predictions). This flag allows to control, whether such filtering heuristics should be used or whether as many items should be used as possible with no regards to their quality. true or false
Enable Multiple Backtests ENABLE_MULTIPLE_BACKTESTS Use this option if you want to use backtesting for your model. Backtests is fundamentally cross-validation for time series and uses time aware partitions instead of random shuffles. The more backtests used, the better we can validate model stability with the main drawback being that more historicity is needed. true or false
Number of backtesting windows NUM_BACKTESTING_WINDOWS This option sets up the number of backtests Abacus.AI will use to validate your model. 1 to 5
Backtesting Window Step Size BACKTESTING_WINDOW_STEP_SIZE When the number of backtesting windows is >= 2, the backtesting window step size defines the time difference between each subsequent backtesting window test start date. For a weekly forecast model with a test start date of 29-Aug-2022 and a backtesting window step size of 2, this would mean that the subsequent backtests would have test start dates of 15-Aug-2022, 01-Aug-2022 and so on… The window's step size unit of time is always the same as the forecast frequency parameter. 1 to 10
Full Data Retraining FULL_DATA_RETRAINING If Full Data Training is turned on, a final model will be retrained with the complete data including test. This model will also serve as the default deployment. true, false, or auto
Use Statistical Model For Filtered Items USE_STATISTICAL_MODEL_FOR_FILTERED_ITEMS Since filtered items are most likely inappropriate for training a deep learning model, it is better to use a simpler traditional statistical technique with small number of parameters to make predictions of filtered items. This parameter allows you to control, which model should be used for filtered items. true or false
Use Item ID USE_ALL_ITEM_TOTALS Whether to treat the item id as a prediction input or not. Should avoid this input if the model will be used to predict items not seen at training time. true or false
Lags LAGS Target column lags. 7 / 13 / 28 / 30 / 91 / 364 / 365
Prediction Step Size PREDICTION_STEP_SIZE Number of future periods to include in objective for each training point. 1 to 90
Training Point Overlap TRAINING_POINT_OVERLAP Amount of overlap to allow between training points. This controls how much on average individual training points can overlap. When the dataset is small it is hard to avoid overlap and this parameter could be used to tune how much overlap to allow. For a timeseries data with a lot of structure more overlap is generally fine. 0.1 to 0.5
Max Scale Context MAX_SCALE_CONTEXT When local scaling is performed on the target this option controls the maximum amount of history that will be used for computing the local scale. 2 to 200
Quantiles Extension Method QUANTILES_EXTENSION_METHOD Method to use for expanding quantiles to cover full prediction length. direct / quadratic / simulation
Number Of Samples NUMBER_OF_SAMPLES Number of samples for ancestral simulation. 10 to 2000
Use Logarithmic Transformations USE_LOG_TRANSFORMS This option enables logarithmic transformations of all input data for neural networks. If this is enabled, neural networks outputs would be automatically exponentiated to bring them to the original scale. Logarithmic transformations are useful, when data has a big dynamic range (e.g., some time window of your timeseries is in enumarted in hundreds and another in millions). true or false
Skip Local Scale Target SKIP_LOCAL_SCALE_TARGET Scales the input based on the some fixed amount of recent history if set to False (No) and skips local scaling to use a constant scaling factor across all the inputs if set to True (Yes). true or false
Symmetrize Quantiles SYMMETRIZE_QUANTILES Force symmetric quantiles (like in Gaussian distribution). true or false
Loss Function LOSS_FUNCTION This option provides a choice of objective (loss) function, that is being optimized during training. Choice of loss function could be important in achieving good performance on desired metric. Automatic / Custom / MAE - Mean Absolute Error / NMAE - Mean Absolute Error Normalized by Output Mean / MAPE - Mean Absolute Percentage Error / Point-wise Accuracy / RMSE - Root Mean Squared Error / NRMSE - Root Mean Squared Error Normalized by Output Mean / Asymmetric MAPE - MAPE variant to reduce negative bias / SSMC - Stable Standardized MAPE or CMAPE / MLE - Gaussian / MLE - Gaussian (Full Covariance) / MLE - Mixture of Gaussian and Exponential / MLE - Mixture of Gaussians / MLE - Weibull / MLE - Negative Binomial
Custom Loss Functions CUSTOM_LOSS_FUNCTIONS Select all/any registered custom loss functions which you want to use as objective function during training. When a loss function of a certain type is selected, it will be applied to all algorithms which support that particular loss type. At most one loss function of a given type can be added. Unused selections will be ignored during training and algorithms which don't support any of the selected loss functions will not be trained.

The following enlists the loss types and the algorithms that are compatible with it.
1. Forecasting - Deep Learning (TensorFlow)
-> Abacus Deep Learning 1 - AutoML
-> Abacus Deep Learning 6 - AutoML+HPO
-> Abacus Deep Learning 12 - AutoML Custom Loss
-> Abacus Deep Learning 13 - Transformer
Registered custom loss functions eligible for training
Custom Metrics CUSTOM_METRICS Metrics are used to evaluate the perfomance of the trained model. The platform already calculates a number of metrics for each problem which are shown in the metrics page. Use this option to select and evaluate any additional custom metric that are registered.
Registered custom metrics eligible for the model
Underprediction Weight UNDERPREDICTION_WEIGHT Weight for underpredictions. 0 to 100
Initial Learning Rate INITIAL_LEARNING_RATE Initial learning rate to set. 0.00001 to 0.01
Disable Networks Without Analytic Quantiles DISABLE_NETWORKS_WITHOUT_ANALYTIC_QUANTILES Disable neural networks, which quantile functions do not have analytic expressions (e.g, mixture models) true or false
L2 Regularization Factor L2_REGULARIZATION_FACTOR L2 regularization factor. In order to create high performing models that generalize well to new data, regularization techniques are used to address over-fitting (over-complex models fitted to some particular dataset). L2 regularization adds "squared magnitude" of coefficient as penalty term to the loss function. 0 to 0.1
Recurrent Layers RECURRENT_LAYERS Number of recurrent layers to stack in network. 1 to 5
Recurrent Units RECURRENT_UNITS Number of units in each recurrent layer. 10 to 100
Convolution Layers CONVOLUTIONAL_LAYERS Number of convolution layers to stack on top of recurrent layers in network. 0 to 20
Convolution Filters CONVOLUTION_FILTERS Number of filters to use in each convolution. 1 to 10
Zero Predictor ZERO_PREDICTOR Include subnetwork to classify points where target equals zero. The subnetwork specifically learns to classify whether a point is zero or not. If it is sufficiently confident then a zero is predicted regardless of the continuous output. true or false
Batch Renormalization BATCH_RENORMALIZATION Enable batch renormalization between layers. A batch normalization layer is introduced after each RNN layer in the stack if this option is set to yes. true or false

Metrics

Our AI engine will calculate the following metrics for this use case:

Metric Name Description
Accuracy This is the measure of how close the forecast predictions are to the actual values in the test period. Suppose we actually made 100 sales for an item in a day but the system predicted 50 then the accuracy of the system would be 50/100 *100 = 50%. Similarly if system would have predicted 200 then accuracy would be 100/200 *100 which again comes out to be 50%. Accuracy ranges from 0 to 100 such that the higher the value the better. We weight the accuracy according to the volume of each item. The higher the volume, the greater the penalty of the forecast diverging from the actual values. Further Details: For each test sample, first, we calculate the two ratios: sum of actuals / sum of predictions and sum of predictions / sum of actuals; to select the minimum of the two. Let's call this ratio, the accuracy ratio of the test sample. Finally, we average this accuracy ratio on volumes (sum of all the target values with a specific item ID) of all the test samples and multiple it by 100 to get a percentage accuracy value for the forecast. You could call it a volume-weighted cumulative accuracy for the entire forecast on the test set.
Point-wise Accuracy It is the accuracy of the entire forecast calculated by averaging the accuracies on all the test data points. That is why, it is known as point-wise accuracy. It ranges from 0 to 100 such that the higher the value the better. First, we calculate the accuracy (min of actual/prediction and prediction/actual) for each test point within a test sample. Next, we find the average of all the test samples and multiply it by 100 to achieve the point-wise accuracy for the entire forecast.
Percentage Items Forecasted The percentage of items in the catalog for which forecasts were generated by the trained model. You can click on, "filtered item IDs" button to get: (i) the total number of items forecasted, (ii) the list of all forecasted items IDs, and (iii) options to download the item list in CSV and JSON formats.
Percentage of Filtered Out Items Forecasted Abacus.AI splits your feature group into items that have signal and are forecastable and items that are too noisy and are filtered out. Neural network techniques are more effective on forecastable items, and statistical techniques work better when there is too much noise in the data. Abacus.AI splits the data into Forecastable items and Filtered out Items and performs 2 model auctions and picks the best model that fits each feature group.
Mean Absolute Error (MAE) MAE is the average difference between the predicted and actual values. The lower the value of this metric the better. A score of 0 means that the model has perfect results. MAE uses the same scale as the data being measured. So, it is a scale-dependent accuracy measure and therefore cannot be used to make comparisons between the models that use different scales. It measures the average magnitude of the errors in a set of predictions, without considering their direction. It is a common measure of forecast error in time series analysis and is relatively easier to interpret (as compared to Root Mean Square Error). For further details, please visit this link.
Mean Absolute Percentage Error (MAPE) This metric is defined as the average of the difference between the prediction and observed results calculated as percentage with respect to observed results over all the data points. A metric score of 0% means that the model forecasted with perfect results whereas a metric score of 100% means that the model was completely inaccurate. In other words, it is a statistical measure of how accurate a forecast system is, such that it measures this accuracy as the average absolute percent error for each time period where it subtracts forecasted value with the actual value and then divides it by actual value. MAPE is commonly used as a loss function for regression problems and in model evaluation, because of its very intuitive interpretation in terms of relative error. The interpretation of MAPE is quite simple and allows it to be used as a very common measure for forecast error but it has drawbacks in practical application. It works best if there are no extremes to the data (and no zeros because of division by zero problem). For further details, please visit this link.
Symmetric Mean Absolute Percentage Error (SMAPE or sMAPE) It is an accuracy measure based on percentage errors similar to MAPE. For MAPE, the average of the difference between the prediction and observed results are calculated as percentage with respect to observed results over all the data points. SMAPE is similar to MAPE and it measures the accuracy as the average absolute percent error for each time period where it subtracts forecasted value with the actual value and then divides it by the sum of absolute actual value and absolute forecasted value. In contrast to the MAPE, this metric has both a lower bound and an upper bound. The result ranges from 0% to 200% where the former resembles the perfect model and the latter stands for the worst model. A limitation to SMAPE is that if the actual value or forecast value is 0, the value of error will boom up to the upper-limit of error (200%) For further details, please visit this link.
SMAPE OF PREDICTION TOTAL This accuracy measure assumes a mathematical form such that the total prediction error is calculated by subtracting the sum of all the actual values from the sum of all the forecasted values. Then the prediction error obtained is normalized by dividing it with the average of the absolute sum of actual values and the absolute sum of forecasted values. Finally, the result can be multiplied by 100 to a get a percentage value (where the value ranges from 0 to 200%). This metric is particularly useful in scenarios where the error with respect to the overall sum of actual values and the sum of forecasted values is more useful or relevant as compared to calculating error by taking one prediction at a time. For example, let's say that there is a store XYZ that sells goods in bulk all around the year. The store does sales forecasting and is interested in knowing if all of its goods could be sold in coming six months. In other words, the monthly sales forecast errors are irrelevant for the store as long as the semi-annual forecast is somewhat accurate. In such cases SMAPE of Prediction Total becomes valuable.
Root Mean Square Error (RMSE) It is the square root of the average of squares of all the differences between the predicted and actual result values. In other words, the difference between the predicted and actual values is calculated and squared. Next, an average of all the squared differences is calculated. Finally, the root of the average value is taken and considered as RMSE. This makes RMSE, a non-negative value, and a score of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSE score is better than a higher one. However, comparisons across different types of data would be invalid because the metric is dependent on the scale of the numbers used. The errors are squared before they are averaged, so RMSE gives a relatively high weight to large errors. This means that it is more useful when large errors are particularly undesirable, for example, camera calibration where being off by 5 degrees is more than twice as bad as being off by 2.5 degrees. Further this makes RMSE sensitive to outliers. RMSE is fundamentally harder to understand and interpret as compared to the Mean Absolute Error (MAE). Each error in MAE influences it in direct proportion to the absolute value of the error, which is not the case for RMSE. For further details, please visit this link.
Normalized Root Mean Square Error (NRMSE) It is an accuracy measure whose value is obtained by calculating the square root of the average of squares of all the differences between the predicted and actual result values and then normalizing the output (we use mean value of the actual results for normalization). There is no consistent means of normalization in the literature and the common choices observed are the mean and the range (defined as the maximum value minus the minimum value) of the measured data. For further details, please visit this link.
Cumulative MAPE (C-MAPE) This metric measures accuracy of predicting cumulatives (totals) and assumes a mathematical form of Mean Absolute Percentage Error (MAPE) such that the sum of all the actual values is subtracted from the sum of all the forecasted values and then divided by the absolute sum of actual values. This metric is particularly useful in scenarios where the error with respect to the overall sum of actual values and the sum of forecasted values is more useful or relevant as compared to calculating error by taking one prediction at a time. For example, let's say that there is a store XYZ that sells goods in bulk all around the year. The store does sales forecasting and is interested in knowing if all of its goods could be sold in coming six months. In other words, the monthly sales forecast errors are irrelevant for the store as long as the semi-annual forecast is somewhat accurate. In such cases NRMSE of Prediction Total becomes valuable.
Coefficient of variation The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean for the data distribution. In simple words, it shows the extent of variation with respect to the mean of the data. We compute it on zero-centered data and thus its value lies between 0 and 1. As a general rule, the lower the value of this metric, the better. The coefficient of variation should be computed only for data measured on a ratio scale, that is, scales that have a meaningful zero and hence allow relative comparison of two measurements. Unlike the standard deviation that must always be considered in the context of the mean of the data, the coefficient of variation provides a relatively simple and quick way to compare different data series. For further details, please visit this link.
Statistical bias with respect to the sample mean In statistics, an estimator is a rule for calculating an estimate of a quantity based on observed data. For example, you might have a rule to calculate a population mean. The result of using the rule is an estimate (a statistic) that hopefully is a true reflection of the population. The bias of an estimator is the difference between the statistic's expected value and the true value of the population parameter. If the statistic is a true reflection of a population parameter it is an unbiased estimator. If it is not a true reflection of a population parameter it is a biased estimator. Therefore, the lower the value of this metric the better.
Validation-Test Difference We define this metric as a mean of absolute differences of NRMSE and C-MAPE between validation and test sets. This metric tries to capture the ability of the model to generalize, generally lower value is better. High value (with respect to NRMSE/C-MAPE) might indicate that model was not able to generalize for a variaty of reasons: for example, there is a significant regime/distribution shift in timeseries between validation and test time periods, or model is overfitting. Difference between validation and test errors is usually not zero due to randomness, but it should not be high.

Note: In addition to the above metrics, our engine will train a baseline model and generate metrics for the baseline model. Typically the metrics for your custom deep learning model should be better than the baseline model.