OCR
When documents are uploaded into the platform (for example a zip file with PDFs is uploaded), we automatically create a feature group, and the text of each document is extracted and saved.
But it's also useful to be able to extract text from a document on demand - this will allow you to build AI Agent use cases.
Extract embedded text from a local document​
The BlobInput class is optional here. When documents are uploaded to AI Workflows (Agents), they are automatically transformed to this Class type so this is good to know. Regardless, you can just pass the bytes of the file directly into extract_document_data
from abacusai.client import BlobInput
import abacusai
client = abacusai.ApiClient('YOUR_API_KEY')
document = BlobInput.from_local_file("YOUR_DOCUMENT.pdf/word/etc")
extracted_doc_data = client.extract_document_data(document.contents)
# print first 100 characters of page 0
print(extracted_doc_data.pages[0][0:100])
print()
# print first 100 characters of all embedded text
print(extracted_doc_data.embedded_text[0:100])
Extract text using OCR from a local document​
Now, let's extract data using OCR:
extracted_doc_data = client.extract_document_data(document.contents,
document_processing_config={'extract_bounding_boxes': True,'ocr_mode': 'DEFAULT', 'use_full_ocr':True})
# Print first 100 characters of extracted_page_text
print(extracted_doc_data.extracted_text[0:100])
Get Text of document uploaded into the platform​
If you have already uploaded the file into the platform, then inside the feature group, you will be able to see a doc_id. You can use that doc_id as a reference point to get the text of a document:
doc_data = client.get_docstore_document_data('DOC_ID')
# print first 100 chracters from embedded text
print('------------------------------')
print('Embedded Text:\n')
print(doc_data.embedded_text[0:100])
print('------------------------------')
# print first 100 chracters from OCR detected text
print('Extracted (OCR) Text:\n')
print(extracted_doc_data.extracted_text[0:100])
Load a Documents Feature group as a Pandas Dataframe​
There is also a method you can use to load the feature group as a Pandas DataFrame using the describe_feature_group_by_table_name method from the Abacus.AI client.
df = client.describe_feature_group_by_table_name('YOUR_FEATURE_GROUP_NAME').load_as_pandas_documents(doc_id_column = 'doc_id',document_column = 'page_infos')
df['page_infos'][0].keys()
# dict_keys(['pages', 'tokens', 'metadata', 'extracted_text'])
- pages: This is the embedded text from the document on a per page level
- extracted_text: This is the OCR extracted text from the document