Hi guys, welcome to the third post in the four-part series, “Evaluating Deep Learning Models with Abacus.AI”. We will cover most of the latest methods that are used to evaluate deep recommender systems in this part. Other than having basic familiarity with machine learning, there is no prerequisite for understanding this post. I’d also encourage you to check out our YouTube channel for our workshop on deep recommender systems in addition to reading this blog post if you are interested in learning the basics of a recommendation system and see how we use our python client to generation great recommendations.
Needless to say much about them, they are everywhere these days. Be it Amazon, Netflix, or YouTube, you will find yourself utilizing the goodness provided by an AI powered recommender system. I’d like to define recommender system as a type of information filtering system that draws from huge data sets and create an impression for a user to generate recommendations for them. With this, we are now in a good position to spill the quick evaluation recipe that you can rely upon, to evaluate the deep learning-based recommendation system you train on Abacus.AI platform:
Quick Evaluation Recipe
1. Compare every metric score with the baseline score to see if there’s an improvement or not
2. Check NDCG, personalization, and coverage, they should lie in the optimal ranges (NDCG – 0.15 to 0.40, Personalization – 0.4 or higher, and coverage – 0.01 to 0.2)
3. Using the prediction dashboard, compare the recommendations generated (on the right side) to the user’s purchase history (list on the left side) for resemblance
4. Repeat step 4 for 5 to 10 users to make sure that the recommendations make sense
If all of the above things look good then the model is good to go and you have trained a world-class deep learning-based recommendation model in just a few minutes with only a few clicks.
Deep Dive into Recommender System Evaluation
Similar to what we had in previous posts, you will have two tools to evaluate your recommender system trained on Abacus.AI: the metric dashboard and the prediction dashboard. The metric dashboard for Personalized recommendations use-case has five different accuracy measures based on different principles: NDCG, MAP, MRR, Personalization, and Coverage. Together they can provide a thorough understanding of the model performance. But I’d say that it is sufficient if you have an intuition for the three major metrics that represent three different dimensions of the recommendations generated by the model: NDCG, Personalization, and Coverage. These together will allow an understanding of the model’s behavior and determine the prediction quality.
Let me start with NDCG. It is the relative relevance of a list of items compared to an ideal ranked list of items. In our system, we recommend 50 items per user. So the NDCG score is for the list of 50 recommended items. It ranges from 0 to 1. The closer it gets to 1, the better. The ideal range of NDCG values for a recommender system generally lies anywhere from 0.15 to 0.40.
The basic principle behind this metric is that some items are ‘more’ relevant than others. Highly relevant items should come before less relevant items, which in turn should come before irrelevant items.
Another variation of NDCG is NDCG@N which is defined as the relevancy of a list of top N items compared to an ideal ranked list of the same number of items. It also ranges from 0 to 1. The closer it gets to 1, the better. The ideal range of NDCG@5 and NDCG@10 for a recommender system generally increases with higher N and lies anywhere from 0.15 to 0.40. Many a time, the Top 5 or 10 items need special attention and their accuracy matters the most. Our system internally takes care of getting the topmost predictions right for you and provides you with NDCG@5 and NDCG@10 metric scores:
- NDCG @5 – It is defined as the relative relevance of a list of top 5 items compared to an ideal ranked list of the same number of items.
- NDCG @10 – It is defined as the relevancy of a list of top 10 items compared to an ideal ranked list of the same number of items.
Another useful metric that focuses on a different evaluation aspect is Personalization @10. This metric measures the level of personalization achieved for the users using different sets of top 10 items recommended to different users. The average pairwise overlap between the sets of 10 items recommended to different users is computed. It ranges from 0 to 1, where 0 means all users are recommended the same set of 10 items and 1 means no two users have overlapping sets of recommendations. In general, a score of around 0.4 or higher indicates a high level of user personalization and is considered good.
In Recommender systems, the level of personalization plays a key role in understanding how far the system understands the preferences of individual users. Thus, this metric that focuses on the level of personalization is important to evaluate the personalization aspect of the model performance.
Finally, we need to understannd Coverage – This metric measures the ratio of the items the system recommends to the total number of items present in the catalog. It ranges from 0 to 1. Generally, high coverage is considered good but after a certain point, the increase in value may lead to less relevant recommendations of the Top items.
The ideal range of coverage for a recommender system generally lies anywhere from 0.001 to 0.2. If coverage value is lower than the ideal range, it means that the model is unable to recommend a lot of the items present in the catalog (in some cases recommending only the popular items) which is usually caused by an insufficient number of ratings, and this is popularly known as the cold start problem.
It is a great idea to show your users a diverse set of products recommended according to their preferences and also based on the similarities between the products. Coverage shows the amount of diversity in the recommendations generated by the system.
Now that you have a good understanding of the metrics, the next step is to observe the improvements achieved in the scores as compared to the corresponding baseline model scores. As a quick recap, if your NDCG, Personalization@10, and Coverage are improved and lie within the optimal ranges as discussed earlier, you are good to go to the final step and utilize the prediction dashboard to verify if the recommendations make sense.
Note that the values higher than the optimal range for the metrics with a high coverage might be an indication of overfitting i.e., the model is fitting to the data provided and is not providing useful recommendations (or it may just be recommending the most popular items). In such a case, you would either need to add more data or try out our advanced training options to train a new model.
The prediction dashboard has a UserID dropdown menu. It consists of a set of user IDs for which the model generates recommendations. You can select any ID or type an userID to get the list of recommendations for that user. On the left, you’d find a list of items the user interacted with in the past. While on the right, the recommendations generated by the model are shown. If you find a resemblance between the items the user has used and the recommended items then you can safely conclude that the model is learning user preferences and item similarities to generate useful recommendations and is doing well. Otherwise, know that the model is performing badly, and changes need to be made.
We typically advise to pick 20 to 40 user IDs and repeat the process to verify the recommendations generated by the trained model using the prediction dashboard.
If you are experienced with recommender systems or have a lot of interest in them, you could gain a deeper understanding of the model’s prediction quality by gaining intuition on the following metrics in addition to the ones we already discussed:
MAP – It is the relative relevance of a recommender system such that it rewards ranking the relevant recommendations higher. In our system, we recommend 50 items per user. So the MAP score is for the list of 50 recommended items. It ranges from 0 to 1. The closer it gets to 1, the better.
The principle behind this metric is that there should be a reward for having relevant recommendations in the list and even higher rewards for putting the most relevant recommendations at the top ranks. There’s a penalty whenever incorrect guesses are higher up in the rankings.
Another variation of MAP is MAP@N. It is the same as MAP except that it takes top N items in the list of recommendations instead of taking all of them. It ranges from 0 to 1. The closer it gets to 1, the better.
Here also, the idea is the same, Top 5 or 10 items need special focus and their accuracy matters the most. Thus, we provide you MAP@5 and MAP@10 metric scores for evaluating your model better:
- MAP@5 – It is the same as MAP but here the relative relevance of the top 5 items in the list of recommendations is considered.
- MAP@10 – It is the same as MAP but here the relative relevance of the top 10 items in the list of recommendations is considered.
MRR – This metric measures the quality of recommendations (50 recommendations per user in our case) by verifying the position of the most relevant item. It rewards the most when the most relevant item is at the top. It ranges from 0 to 1. The closer it gets to 1, the better.
To understand MRR better, reciprocal rank needs to be defined first. Reciprocal rank is the “multiplicative inverse” of the rank of the first correct item. MRR is calculated by finding the mean of the reciprocal rank for all the ranked items in the list of recommendations.
The ideal range of MRR values for a recommender system generally lies anywhere from 0.1 to 0.3.
This concludes our recommender system evaluation guide. I hope you enjoyed it and I wish to see you again in the next part where we will take on timeseries forecasting.
Stay tuned for part 4!