OCR

When documents are uploaded into the platform (for example a zip file with PDFs is uploaded), we automatically create a feature group, and the text of each document is extracted and saved.

But it's also useful to be able to extract text from a document on demand - this will allow you to build AI Agent use cases.

Extract embedded text from a local document

The BlobInput class is optional here. When documents are uploaded to AI Workflows (Agents), they are automatically transformed to this Class type so this is good to know. Regardless, you can just pass the bytes of the file directly into extract_document_data

from abacusai.client import BlobInput
import abacusai
client = abacusai.ApiClient('YOUR_API_KEY')
document = BlobInput.from_local_file("YOUR_DOCUMENT.pdf/word/etc")

extracted_doc_data = client.extract_document_data(document.contents)
# print first 100 characters of page 0
print(extracted_doc_data.pages[0][0:100])
print()
# print first 100 characters of all embedded text
print(extracted_doc_data.embedded_text[0:100]) 

Extract text using OCR from a local document

Now, let's extract data using OCR:

extracted_doc_data = client.extract_document_data(document.contents, 
                                                  document_processing_config={'extract_bounding_boxes': True,'ocr_mode': 'DEFAULT', 'use_full_ocr':True})

# Print first 100 characters of extracted_page_text
print(extracted_doc_data.extracted_text[0:100])

Get Text of document uploaded into the platform

If you have already uploaded the file into the platform, then inside the feature group, you will be able to see a doc_id. You can use that doc_id as a reference point to get the text of a document:

doc_data = client.get_docstore_document_data('DOC_ID')
# print first 100 chracters from embedded text
print('------------------------------')
print('Embedded Text:\n')
print(doc_data.embedded_text[0:100])
print('------------------------------')
# print first 100 chracters from OCR detected text
print('Extracted (OCR) Text:\n')
print(extracted_doc_data.extracted_text[0:100]) 

Load a Documents Feature group as a Pandas Dataframe

There is also a method you can use to load the feature group as a Pandas DataFrame using the describe_feature_group_by_table_name method from the Abacus.AI client.

df = client.describe_feature_group_by_table_name('YOUR_FEATURE_GROUP_NAME').load_as_pandas_documents(doc_id_column = 'doc_id',document_column = 'page_infos')
df['page_infos'][0].keys()
# dict_keys(['pages', 'tokens', 'metadata', 'extracted_text'])

pages: This is the embedded text from the document on a per page level
extracted_text: This is the OCR extracted text from the document

Extract embedded text from a local document​

Extract text using OCR from a local document​

Get Text of document uploaded into the platform​

Load a Documents Feature group as a Pandas Dataframe​

Extract embedded text from a local document

Extract text using OCR from a local document

Get Text of document uploaded into the platform

Load a Documents Feature group as a Pandas Dataframe