Use ChatLLM to create your custom ChatGPT and answer questions based on your specialized document corpus. ChatLLM models leverage the best of both fine-tuning and Retrieval Augmented Generation (RAG) techniques to build a custom chatbot on your knowledge base. This model can be trained on documents in a variety of formats and will then be able to answer questions quickly and accurately. After training your first model, you can use an evaluation set to assess the model's performance on your most important questions. Here's a quick evaluation recipe to ensure you have a high-level understanding of your model's performance.
Compare Metric Scores Across Models
Compare metric scores between different models to determine the ideal model to use. For all scores, the ideal value is 1.0. The higher the score, the more closely the LLM response matches the ground truth. If the evaluation questions are well-matched to the documents provided, you should expect a BERT F1 score between 0.7 and 1.0.
Check the ROUGE Score
Examine the ROUGE score to see the direct overlap between the LLM and human responses. A ROUGE score above 0.3 is usually considered good, given the variation in how responses are formed. These scores are highly dependent on the specific ground truth you use and can be skewed by more or less verbose responses.
Examine Individual Questions
Review individual questions and compare the answers from different LLMs. Verify that the LLM-generated answers are accurate and align with the ground truth.
Once you have verified that the LLM responses are accurate to the ground truth, you have officially trained your own world-class ChatLLM to respond to your customized documents. You can now deploy your model and continue to ask specific questions in the Predictions Dashboard.