Hi, all. Welcome to the first post in the series, “Evaluating Deep Learning Models with Abacus.AI”. There will be four parts to this series where we will cover four types of problems: classification, regression, forecasting, and recommendation. In this post, we will start with classification. As such, there is no prerequisite for understanding this post but I am assuming that you have some familiarity with machine learning. If you don’t, you could check out my last webinar here which covers some of the basic machine learning concepts and also contains a demo on how to train models using the Abacus.AI platform.
As the name suggests, the classification model takes the input and classifies it as a class label. For example, a binary classifier trained on a set of dog images would classify an image as dog or not dog. With that, let me introduce a quick evaluation recipe for those who might have a company to run and who just want to get a high-level idea of the quality of the model in a few seconds:
Quick Evaluation Recipe
1. Compare every metric score with the baseline score to see if there’s an improvement or not
2. Start with AUC, make sure it’s between 0.6 and 0.95 (higher the AUC the better)
3. Check accuracy, anything above 60% means something useful is being learned
4. Check precision and recall, one of them should be more than 0.5 for useful results
5. Finally, go to the prediction dashboard and see if the results generated by the model matched with actual class labels for different classes
If all of the above things look good then the model is good to go and you have trained a world-class deep learning model in just a few minutes with only a few clicks.
Now, Let’s Dive Deeper
If you are interested in machine learning and willing to learn more about it, then you are in the right place. This post covers arguably the simplest and the most common problem type in machine learning – classification. Once you complete training your classification model using our easy to use self-serve console, the most critical metric to observe is AUC for classification problems. It describes a model’s capability to distinguish between two or more classes. AUC values lie between 0 and 1. Generally speaking, the higher the AUC, the better the model. However, AUC values that are greater than 0.95 are generally suspicious (too good to be true). This means that there might be some column(s)/feature(s) in your data that is (are) directly correlated to your target column. An AUC value of less than 0.5 implies that the model is unusable and you might need to verify that the data is clean (and correct) and/or you’d need to add more data. Ideally, good AUC values lie between 0.5 and 0.95.
After determining the usability of the model using the AUC metric, the next logical step is to look at the confusion matrix to visualize the accuracy of the classifier (classification model) by comparing the actual and predicted classes.
What is Confusion Matrix?
A confusion matrix is a way to measure the performance of a Machine learning classification model. It is a matrix that helps you to know the performance of the classification model on the test set.
Terminology in Confusion Matrix
- True Positive: Model predicted positive and it’s true. Eg., model predicted ‘Car’ and it actually is.
- True Negative: Model predicted negative and it’s true. Eg., model predicted ‘Not a car’ and it is not a car.
- False Positive: Model predicted positive and it’s false. E.g., model predicted ‘Car’ and it actually isn’t.
- False Negative: Model predicted negative and it’s false. E.g., model predicted ‘Not a car’ and it actually is a car.
Accuracy, Precision, and Recall – Illustrated with examples
Accuracy – It answers how often the classifier is correct overall. In other words, it is the percentage of data points correctly classified of the total number of data points. Accuracy is easy to interpret but can be misleading if used in isolation. For example, there’s a dataset having 1000 data points out of which 950 points have the label, “Car” (Support = 950 for Car Class Label) and 50 points have the label, “Bus” (Support = 50 for Bus Class Label). Then the accuracy will come out to be 95% even if the model doesn’t learn anything and predicts the label as Car all the time.
Accuracy = (TP+TN) / total
Precision – The number of times a particular class label is predicted correctly out of the total predicted class labels.
When it predicts yes, how often is it correct?
Precision = TP / TP + FP
Recall – The ratio of the number of times a particular class label was predicted correctly and the total number of actual values under that class label in the test set.
When it’s actually yes, how often does it predict yes?
Recall = TP / TP + FN
Let’s say the test data has 100 data points of which 20 data points belong to the ‘Dog’ class label.
Now let’s say the model predicted 25 of the data points as ‘Dog’ class of which 18 points were correctly classified and the remaining 7 points were incorrectly classified as ‘Dog’.
The model would then have a precision of 18/25 = 0.72 and a recall of 18/20 = 0.9.
Precision vs Recall – The significance of precision and recall is driven by the relative cost of a false positive and a false negative in reference to the scenario you are dealing with. For example, let’s say you are into a business of selling cars and you also have a list of potential leads who would buy such cars. There is a set of luxury cars that are make up a lot of your profit and the production of these cars takes up a lot of time. Therefore, one needs to preorder a specific number of units in advance. Since you have limited space in your warehouse to store the cars you would be very interested in classifying the leads as “luxury buyer or normal buyer” and grabbing most of these cars through a preorder. In such a case, you will become more interested in recall (number of correctly predicted luxury buyers of the total predicted luxury buyers) as compared to precision (number of correctly predicted luxury buyers of total buyers) because you want to increase your revenue.
Now, let’s tweak this scenario a bit and add another clause where you have another showroom in a southern state where the demand for normal cars is high and you understand that bringing in a lot of luxury cars leads to loss as they normal cars get sold easily without much risk. Whereas in the case of luxury cars there is a risk of them sitting in your warehouse and taking a lot of space for a long time without selling. In such a case, both precision becomes more significant for you as you are now interested in filling your warehouse with cars that actually sell.
Finally, let’s add another tweak in the same story and assume that you have been running your business for a long time now and you have a lot of data to classify the buyers better. In such a case, you might expect a relatively higher precision and recall where you want to fill your warehouse with plenty of luxury cars and normal cars in a balanced way to have minimum risk but maximum profit.
Prediction Dashboard – Using Test Points To Evaluate Models
Another awesome tool that you have in your toolbox for evaluating models on Abacus.AI is the prediction dashboard. It provides you all the details you need about the predictions generated by the model including the explanations as to why those predictions were made. Isn’t it cool? how far we have come in Machine Learning that we now have reliable ways to white box a model to a good extent.
The prediction dashboard has a button called ‘Experiment With Test Data’ – this data is from the test data split and is kept exclusively for model evaluation. In other words, the trained model HAS NOT SEEN the test data yet. So if you find the actual class labels to be the same as the predicted class labels with corresponding high probabilities, you can safely conclude that the model is doing well. Otherwise, know that the model is performing badly, and changes need to be made.
We typically advise to pick 50 to 100 test data points and repeat the process to verify the predictions generated by the trained model using the prediction dashboard.
Prediction Explanation – Analyzing Feature Contribution Scores to understand the reasons behind the model’s predictions
Another awesome feature of the Abacus.AI platform is Machine Learning Model Explanation. You can get a sense of the most important features that contribute to the model predictions by just a quick glance at the Feature Importance Score graph on the Metric Dashboard. We provide the Top 20 features that contribute the most towards the generated predictions. Further, you can also use the Prediciton Dashboard and observe the feature contribution scores for individual test data points to understand the reasons behind the prediction made by the model.
For example, in the screenshot above, the model says that the person’s spending level is high mostly because the person is affluent. Similarly, for another test data point, the top reason comes out to be the luxurious brand the person buys. On the other hand, for a data point where the prediction comes out to be a low spending level, the model provides top reasons such as less monthly expenses, occupation, etc., which makes a lot of sense. Therefore, after analyzing a few test data points, you would be confident that the model is in fact learning something meaningful and provides valuable inferences. This way, the model’s prediction doesn’t remain a complete black box and useful insights into the reasons for the model’s behavior are gained.
Another important thing to keep in mind is that the feature score suggests the importance of the feature(s) for the specific prediction (local scope where features importance changes from one data point to another and the “val” shows the current data point value for the column). The scores are not kept directional and the absolute magnitude of contribution scores is taken to avoid confusion and make them easy to understand.
Using Advanced Training Options
We have an array of advanced training options on our platform which provide a ton of flexibility to customize your deep learning-based model for your specific needs. If you are a data scientist or machine learning engineer you are going to really appreciate this flexibility (don’t forget to shoot me an email at firstname.lastname@example.org sharing how you used any of these features). Further, we automate the training process for you to get the optimal configuration for your model in many ways. Let me bring out one of those automated features, correlation with the target.
Correlation with Target
As mentioned earlier, in many cases, a strong correlation exists between the columns of the dataset and the target column. This leads to high accuracy measure scores but the model is basically overfitting and would not generalize well. Depending upon the recommendation from our system and your discretion on the importance of the columns, you should decide which column(s) to keep and which ones to ignore. Additionally, you could click on the “Ignore correlated columns and train model” button to start training a new model where all suggested columns are ignored. Finally, once your model gets trained, you would again analyze the accuracy measures to evaluate the model and see the difference from the earlier model. Additionally, you should also use the prediction dashboard to further check the quality of predictions for the new model.
Note: Depending upon your data, you might not get any recommendation to ignore a column which is absolutely fine.
Solving Class Imbalance Problem
In cases where the ratio of data points belonging to different classes is not close to 1:1 (unbalanced classes in the dataset) then you should train two models with different configurations. Let’s define the metric, “Support” to start discussing the class imbalance problem. The “Support” for a Class in the dataset is the number of data points for the class present in the dataset. In cases where the ratio of Support of class labels is not close to 1:1 (unbalanced classes problem), you should train 2 different models, one with the advanced training option – “Rebalance Classes” as ‘Yes’, and another with the same option set to ‘No’:
- High Recall Model – REBALANCE_CLASSES: true
- High Precision Model – REBALANCE_CLASSES: False
You can provide any names to your models but intuitive naming conventions go a long way in avoiding confusion and making it easy to evaluate the models. For the model where the training parameter was set to “REBALANCE_CLASSES: true”, you are likely to achieve higher recall values (as compared to the second model) at the cost of precision. Let’s name this model with high recall values as High Recall Model and the other with high precision values as High Precision Model. You can use the drop-down at the top of the metric dashboard and predictions dashboard screens to switch between the two models and compare them following the same process we used for evaluating the model trained with the default configuration. The only additional step to take care of is that you will need to use the same set of 50-100 test points from the prediction dashboard to compare the predictions from the two models and decide which one is most suited for your requirements.
Now that you have become a classification champ, stay tuned for part 2 of the series where we will take on the second challenge: Regression.