Abacus.AI - Evaluating Predictions

Overview Use Cases Customer Churn Prediction Time Series Anomaly Detection Event Anomaly Detection Cloud Spend Alerts Personalized Promotions Predictive Modeling Real-Time Forecasting Financial Metrics Forecasting Demand Forecasting Cumulative Forecasting Text extraction and classification NLP Powered Search Sentiment Analysis Finetuned LLM ChatLLM Feature Group Requirements Training Models Evaluating Predictions Predictions Language Detection Image Classification & Detection Object Detection Clustering Timeseries Clustering Sales and Revenue Forecasting Predictive Lead Scoring Personalized Search Personalized Recommendations Related Items Model Drift and Monitoring Tensorflow with Vector Matching Custom Python Model Data Ingestion Streaming Feature Store Vector Store AI Workflows Named Entity Recognition Guidelines Optimization Connectors Authentication Getting Started with the Python SDK API Documentation Chat Bot API Search How to

Problem Type: ChatLLM

Use ChatLLM to create your custom ChatGPT and answer questions based on your specialized document corpus. ChatLLM models leverage the best of both fine-tuning and Retrieval Augmented Generation (RAG) techniques to build a custom chatbot on your knowledge base. This model can be trained on documents in a variety of formats and will then be able to answer questions quickly and accurately. After training your first model, you can use an evaluation set to assess the model's performance on your most important questions. Here's a quick evaluation recipe to ensure you have a high-level understanding of your model's performance.

Quick Evaluation Recipe

Train Your Model with an Evaluation Feature Group
Ensure you have trained your model with an Evaluation Feature Group. This will be a set of questions and answers that you expect the LLM to answer accurately, based on your personal dataset.

ChatLLM

Compare Metric Scores Across Models
Compare metric scores between different models to determine the ideal model to use. For all scores, the ideal value is 1.0. The higher the score, the more closely the LLM response matches the ground truth. If the evaluation questions are well-matched to the documents provided, you should expect a BERT F1 score between 0.7 and 1.0.
Check the ROUGE Score
Examine the ROUGE score to see the direct overlap between the LLM and human responses. A ROUGE score above 0.3 is usually considered good, given the variation in how responses are formed. These scores are highly dependent on the specific ground truth you use and can be skewed by more or less verbose responses.
Examine Individual Questions
Review individual questions and compare the answers from different LLMs. Verify that the LLM-generated answers are accurate and align with the ground truth.

Once you have verified that the LLM responses are accurate to the ground truth, you have officially trained your own world-class ChatLLM to respond to your customized documents. You can now deploy your model and continue to ask specific questions in the Predictions Dashboard.