Arguments:

REQUIRED

KEY

TYPE

DESCRIPTION

Yes

docId

str

A unique Docstore string identifier for the document.

Yes

page

int

The page number to retrieve. Page numbers start from 0.

documentProcessingConfig

DocumentProcessingConfig

The document processing configuration to use for returning the data when the document is processed via EXTRACT_DOCUMENT_DATA Feature Group Operator. If Feature Group Operator is not used, this parameter should be kept as None. If Feature Group Operator is used but this parameter is not provided, the latest available data or the default configuration will be used.

KEY	TYPE	Description
removeHeaderFooter	bool	Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
extractBoundingBoxes	bool	Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
documentType	DocumentType	Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically.
ocrMode	OcrMode	OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
convertToMarkdown	bool	Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
extractImages	bool	Whether to extract images from the document e.g. diagrams in a PDF page. Defaults to False.
removeWatermarks	bool	Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
highlightRelevantText	bool	Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False.
useFullOcr	bool	Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
maskPii	bool	Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False.

documentProcessingVersion

str

The document processing version to use for returning the data when the document is processed via EXTRACT_DOCUMENT_DATA Feature Group Operator. If Feature Group Operator is not used, this parameter should be kept as None. If Feature Group Operator is used but this parameter is not provided, the latest version will be used.

Note: The arguments for the API methods follow camelCase but for Python SDK underscore_case is followed.

Response:

KEY

TYPE

DESCRIPTION

success

Boolean

true if the call succeeded, false if there was an error

result

PageData

KEY	TYPE	Description
docId	str	Unique Docstore string identifier for the document.
page	int	The page number. Starts from 0.
height	int	The height of the page in pixels.
width	int	The width of the page in pixels.
pageCount	int	The total number of pages in document.
pageText	str	The text extracted from the page.
pageTokenStartOffset	int	The offset of the first token in the page.
tokenCount	int	The number of tokens in the page.
tokens	list	The tokens in the page.
extractedText	str	The extracted text in the page obtained from OCR.
rotationAngle	float	The detected rotation angle of the page in degrees. Positive values indicate clockwise and negative values indicate anti-clockwise rotation from the original orientation.
pageMarkdown	str	The markdown text for the page.
embeddedText	str	The embedded text in the page. Only available for digital documents.

Exceptions:

TYPE	WHEN
DataNotFoundError	`docId` is not found.

Language: