REQUIRED |
KEY |
TYPE |
DESCRIPTION |
No |
docIds |
List[str] |
A list of document store IDs to retrieve the snippets from.
|
No |
blobs |
dict |
A dictionary mapping document names to the blob data.
|
No |
query |
str |
Query string to find relevant snippets in the documents.
|
No |
documentRetrieverConfig |
DocumentRetrieverConfig |
If provided, used to configure the retrieval steps like chunking for embeddings.
KEY |
TYPE |
Description |
indexMetadataColumns |
bool |
If True, metadata columns of the FG will also be used for indexing and querying. |
chunkSize |
int |
The size of text chunks in the vector store. |
standaloneDeployment |
bool |
If True, the document retriever will be deployed as a standalone deployment. |
scoreMultiplierColumn |
str |
If provided, will use the values in this metadata column to modify the relevance score of returned chunks for all queries. |
chunkOverlapFraction |
float |
The fraction of overlap between chunks. |
pruneVectors |
bool |
Transform vectors using SVD so that the average component of vectors in the corpus are removed. |
chunkSizeFactors |
list |
Chunking data with multiple sizes. The specified list of factors are used to calculate more sizes, in addition to `chunk_size`. |
textEncoder |
VectorStoreTextEncoder |
Encoder used to index texts from the documents. |
summaryInstructions |
str |
Instructions for the LLM to generate the document summary. |
useDocumentSummary |
bool |
If True, uses the summary of the document in addition to chunks of the document for indexing and querying. |
|
No |
honorSentenceBoundary |
bool |
If provided, will honor sentence boundary when returning the snippets.
|
No |
numRetrievalMarginWords |
int |
If provided, will add this number of words from left and right of the returned snippets.
|
No |
maxWordsPerSnippet |
int |
If provided, will limit the number of words in each snippet to the value specified.
|
No |
maxSnippetsPerDocument |
int |
If provided, will limit the number of snippets retrieved from each document to the value specified.
|
No |
startWordIndex |
int |
If provided, will start the snippet at the index (of words in the document) specified.
|
No |
endWordIndex |
int |
If provided, will end the snippet at the index of (of words in the document) specified.
|
No |
includingBoundingBoxes |
bool |
If true, will include the bounding boxes of the snippets if they are available.
|
No |
text |
str |
Plain text from which to retrieve snippets.
|
No |
documentProcessingConfig |
DocumentProcessingConfig |
The document processing configuration used to extract text when doc_ids or blobs are provided. If provided, this will override including_bounding_boxes parameter.
KEY |
TYPE |
Description |
removeWatermarks |
bool |
Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True. |
convertToMarkdown |
bool |
Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True. |
extractBoundingBoxes |
bool |
Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False. |
ocrMode |
OcrMode |
OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True. |
removeHeaderFooter |
bool |
Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True. |
documentType |
DocumentType |
Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically. |
extractImages |
bool |
Whether to extract images from the document e.g. diagrams in a PDF page. Defaults to False. |
highlightRelevantText |
bool |
Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False. |
useFullOcr |
bool |
Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True. |
maskPii |
bool |
Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False. |
|
Note: The arguments for the API methods follow camelCase but for Python SDK underscore_case is followed.