Method
getDocstoreDocumentData POST
Copy POST

Returns the extracted data for a document.

Arguments:

REQUIRED KEY TYPE DESCRIPTION
Yes docId str A unique Docstore string identifier for the document.
No documentProcessingConfig DocumentProcessingConfig The document processing configuration to use for returning the data when the document is processed via EXTRACT_DOCUMENT_DATA Feature Group Operator. If Feature Group Operator is not used, this parameter should be kept as None. If Feature Group Operator is used but this parameter is not provided, the latest available data or the default configuration will be used.
KEY TYPE Description
highlightRelevantText bool Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False.
removeHeaderFooter bool Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
convertToMarkdown bool Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
ocrMode OcrMode OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
maskPii bool Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False.
documentType DocumentType Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically.
useFullOcr bool Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
extractBoundingBoxes bool Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
removeWatermarks bool Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
No documentProcessingVersion str The document processing version to use for returning the data when the document is processed via EXTRACT_DOCUMENT_DATA Feature Group Operator. If Feature Group Operator is not used, this parameter should be kept as None. If Feature Group Operator is used but this parameter is not provided, the latest version will be used.
No returnExtractedPageText bool Specifies whether to include a list of extracted text for each page in the response. Defaults to false if not provided.
Note: The arguments for the API methods follow camelCase but for Python SDK underscore_case is followed.

Response:

KEY TYPE DESCRIPTION
success Boolean true if the call succeeded, false if there was an error
result DocumentData
KEY TYPE Description
docId str Unique Docstore string identifier for the document.
mimeType str The mime type of the document.
pageCount int The number of pages for which the data is available. This is generally same as the total number of pages but may be less than the total number of pages in the document if processing is done only for selected pages.
totalPageCount int The total number of pages in the document.
extractedText str The extracted text in the document obtained from OCR.
embeddedText str The embedded text in the document. Only available for digital documents.
pages list List of embedded text for each page in the document. Only available for digital documents.
tokens list List of extracted tokens in the document obtained from OCR.
metadata list List of metadata for each page in the document.
pageMarkdown list The markdown text for the page.
extractedPageText list List of extracted text for each page in the document obtained from OCR. Available when return_extracted_page_text parameter is set to True in the document data retrieval API.
augmentedPageText list List of extracted text for each page in the document obtained from OCR augmented with embedded links in the document.

Exceptions:

TYPE WHEN
DataNotFoundError

docId is not found.

Language: