Skip to main content

Vector Store

Abacus uses it's own proprietary vector store named Document Retriever. Below are some basic client methods to both create vector stores on the fly, as well as how to use already created vector stores (document retrievers).

Creating Vector Store on the Fly using a local Document​

This would be useful when creating an AI Workflow (Agent) process.

Initially, we load the file, using the BlobInput class.

from abacusai.client import BlobInput
import abacusai
client = abacusai.ApiClient('YOUR_API_KEY')
document = BlobInput.from_local_file("YOUR_DOCUMENT.pdf/word")

Then, we can index and query the contents on the fly:

# Returns chunks of documents that are relevant to the query and can be used to feed an LLM
# Example for blob in memory of notebook

relevant_snippets = client.get_relevant_snippets(
blobs={"document": document.contents},
query="What are the key terms")

If you have already uploaded the document in the Abacus platform, and it has a doc_id value, then you can also use below command directly:

# Returns chunks of documents that are relevant to the query and can be used to feed an LLM
# Example for document in the docstore

relevant_snippets = client.get_relevant_snippets(
doc_ids = ['YOUR_DOC_ID_1','YOUR_DOC_ID_2'],
query="What are the key terms")

relevant_snippets

Creating a Standalone Document Retriever (Vector Store)​

The first step would be to add a feature group that is of type Documents to the project:

client.add_feature_group_to_project(
feature_group_id='YOUR_FEATURE_GROUP_ID_WITH_DOCUMENTS'
project_id='YOUR_PROJECT_ID',
feature_group_type='DOCUMENTS' # Mandatory
)

ifm = client.infer_feature_mappings(project_id='PROJECT_ID',feature_group_id='FEATURE_GROUP_ID')


# This blocs of code might be useful to fix featuregroup for docstore usage by document retrievers

# client.set_feature_group_type(project_id='YOUR_PROJECT_ID', feature_group_id='98a8d9cce', feature_group_type='DOCUMENTS')
# client.set_feature_mapping(project_id='YOUR_PROJECT_ID',feature_group_id = 'YOUR_FEATURE_GROUP_ID',feature_name='doc_id',feature_mapping='DOCUMENT_ID')
# client.set_feature_mapping(project_id='YOUR_PROJECT_ID',feature_group_id = 'YOUR_FEATURE_GROUP_ID',feature_name='page_infos',feature_mapping='DOCUMENT')

Now, you can create the document retriever using below code:

# Creating a document retriever

document_retriever = client.create_document_retriever(
project_id=project_id,
name='NAME_OF_YOUR_DOCUMENT_RETRIEVER',
feature_group_id='YOUR_FEATURE_GROUP_ID'
)

Or, if you have used the UI to create a document retriever, you can access it using it's name:

dr = client.describe_document_retriever_by_name('DOCUMENT_RETRIEVER_NAME')

or it's id:

dr = client.describe_document_retriever(document_retriever.id)

Once you have loaded the document retriever object, then you can invoke it using get_matching_documents

# Filters allow you to filter the documents that the doc retriever has based on metadata that are part of the feature group (in essence, feature group columns).

client.get_matching_documents(document_retriever_id='DOCUMENT_RETRIEVER_ID',
query='WHATEVER_YOU_NEED_TO_ASK',limit= 10,
filters={"state": ["MICHIGAN","NATIONAL"]})

print(response[0].document) # This would be the first search result's string

The above examples take it as granted that you either:

  • Upload a document on the fly (on an AI workflow for instance)
  • Have a file locally in your environment /IDE.

In some scenarios, you may need to preprocess documents after they've been uploaded as a DOCUMENTSET. Note this only applies to List of Documents datasets. Each uploaded document receives a unique doc_id. You can use this ID to load the file's actual bytes:

import io
import abacusai
client = abacusai.ApiClient('YOUR_API_KEY')

doc_bytes = doc = io.BytesIO(client.get_docstore_document('YOUR_DOC_ID').read())

You can find the doc_id in the feature group within the Abacus platform UI. You can also set up a programmatic process to load and process all documents from a feature group. This is useful in Python feature groups for extracting specific data:

df = client.describe_feature_group_by_table_name('YOUR_FG_NAME').load_as_pandas()
for id in df['doc_id']:
doc_bytes = io.BytesIO(client.get_docstore_document(id).read())
# Do any preprocessing