Our platform provides the flexibility to adjust a set of training parameters. There are general training parameters and advanced training options that could influence the model predictions. The predictions are measured on the basis of a set of accuracy measures or metrics that are also discussed in this section.
Once you have fulfilled all the feature group requirements for the use case, you can set the following general and advanced training configuration options to train your ML model:
Training Option Name | Description | Possible Values |
---|---|---|
Name | The name you would like to give to the model that is going to be trained. The system generates a default name depending upon the name of the project the model is a part of. | The name can be comprised of any alphanumeric character and the length can be anywhere from 5 to 60 characters. |
List of documents | A compilation of text-based documents that encompass articles, reviews, and written material for training or evaluating NLP models. | Documents can be in formats like plain text, CSV, JSON, PDF, docx, or zip |
Evaluation | Pairs of search queries and ground truth answerts designed to assess a model's performance in a project. While optional, it is highly recommended to include an evaluation set for evaluating the model's effectiveness | Pairs of questions and answers stored in JSON or CSV |
Set Refresh Schedule (UTC) | Refresh schedule refers to the schedule when your dataset is set to be replaced by an updated copy of the particular dataset in context from your storage bucket location. This value to be entered is a CRON time string that describes the schedule in UTC time zone. | A string in CRON Format. If you're unfamiliar with Cron Syntax, Crontab Guru can help translate the syntax back into natural language. |
For Advanced Options, our AI engine will automatically set the optimum values. We recommend overriding these options only if you are familiar with deep learning. Overview of Advanced Options:
Training Option Name | API Configuration Name | Description | Possible Values |
---|---|---|---|
Larger Embeddings | LARGER_EMBEDDINGS | Use a higher-dimensional embedding model. This choice has the potential to enhance the model's quality but may also escalate computational requirements. | yes/no |
Chunk Size | CHUNK_SIZE | Dictates the size of individual data chunks for processing. Finding the right size is crucial for optimizing computational resources and ensuring effective information handling. | Integers ranging from 20 to 2000, specifying the desired size of individual data chunks within the processing pipeline, measured in tokens. |
Index Fraction | INDEX_FRACTION | Determine the fraction of a chunk utilized for indexing during document processing. Opting for a higher value increases the portion of the chunk used for indexing, potentially improving search accuracy at the expense of heightened computational requirements. | Floating-point numbers from 0.05 to 1.00 |
Chunk Overlap Fraction | CHUNK_OVERLAP_FRACTION | Ensures a smooth transition between these chunks, avoiding abrupt breaks during document breakdown. The goal is to capture all necessary context for answering questions, striking a balance between seamless flow and avoiding unnecessary repetition. | Floating-point numbers from 0.05 to 0.90. A value of 0.05 implies minimal overlap, while 0.90 indicates a significant overlap between consecutive data chunks. |