Method

extractDocumentData POST

Copy POST

Extracts data from a document using either OCR (for scanned documents/images) or embedded text extraction (for digital documents like .docx). Configure the extraction method through DocumentProcessingConfig

Arguments:

REQUIRED

KEY

TYPE

DESCRIPTION

document

bytes

The document to extract data from. One of document or doc_id must be provided.

docId

str

A unique Docstore string identifier for the document. One of document or doc_id must be provided.

documentProcessingConfig

DocumentProcessingConfig

The document processing configuration.

KEY	TYPE	Description
highlightRelevantText	bool	Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False.
extractBoundingBoxes	bool	Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
useFullOcr	bool	Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
removeHeaderFooter	bool	Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
ocrMode	OcrMode	OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
removeWatermarks	bool	Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
maskPii	bool	Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False.
extractImages	bool	Whether to extract images from the document e.g. diagrams in a PDF page. Defaults to False.
convertToMarkdown	bool	Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
documentType	DocumentType	Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically.

startPage

int

The starting page to extract data from. Pages are indexed starting from 0. If not provided, the first page will be used.

endPage

int

The last page to extract data from. Pages are indexed starting from 0. If not provided, the last page will be used.

returnExtractedPageText

bool

Specifies whether to include a list of extracted text for each page in the response. Defaults to false if not provided.

Note: The arguments for the API methods follow camelCase but for Python SDK underscore_case is followed.

Response:

KEY

TYPE

DESCRIPTION

success

Boolean

true if the call succeeded, false if there was an error

result

DocumentData

KEY	TYPE	Description
docId	str	Unique Docstore string identifier for the document.
mimeType	str	The mime type of the document.
pageCount	int	The number of pages for which the data is available. This is generally same as the total number of pages but may be less than the total number of pages in the document if processing is done only for selected pages.
totalPageCount	int	The total number of pages in the document.
extractedText	str	The extracted text in the document obtained from OCR.
embeddedText	str	The embedded text in the document. Only available for digital documents.
pages	list	List of embedded text for each page in the document. Only available for digital documents.
tokens	list	List of extracted tokens in the document obtained from OCR.
metadata	list	List of metadata for each page in the document.
pageMarkdown	list	The markdown text for the page.
extractedPageText	list	List of extracted text for each page in the document obtained from OCR. Available when return_extracted_page_text parameter is set to True in the document data retrieval API.
augmentedPageText	list	List of extracted text for each page in the document obtained from OCR augmented with embedded links in the document.

Exceptions:

TYPE	WHEN
DataNotFoundError	`docId` is not found.

Language: