Skip to main content

Feature Group & Dataset Operations

The objective of this tutorial is to get you familarised with Datasets and Feature group manipulation using the Abacus client.

  • Datasets: Raw Data uploaded in the platform
  • Feature Groups: Processed data derived from Datasets, which can be used for model training and evaluation.

To start with, let's initiate our client:

import abacusai
client = abacusai.ApiClient("YOUR API KEY")

Manipulating Project​

Now, describe project:

# Gets information about the project based on the ID.
project = client.describe_project(project_id="YOUR_PROJECT_ID")

# A list of all models trained under the project
models = client.list_models(project_id="YOUR_PROJECT_ID")

We can also load a feature group locally in our machine:

# Loads the specific version of FeatureGroup 
fg1 = client.describe_feature_group_version("FEATURE_GROUP_VERSION")
df1 = fg1.load_as_pandas()

# Loads the latest version of FeatureGroup based on a name
fg2 = client.describe_feature_group_by_table_name("FEATURE_GROUP_NAME")

# Loads the FeatureGroup as a pandas dataframe
df2 = fg2.load_as_pandas()

You can also add a feature group to a project programmatically:

client.add_feature_group_to_project(
feature_group_id='FEATURE_GROUP_ID',
project_id='PROJECT_ID',
feature_group_type='CUSTOM_TABLE' # You can set to DOCUMENTS if this is a document set
)

Uploading Dataset From Local​

To create a Dataset from a local file:

import io
zip_filename= 'sample_data_folder.zip'

with open(zip_filename, 'rb') as f:
zip_file_content = f.read()

zip_file_io = io.BytesIO(zip_file_content)

# If the ZIP folder contains unstructured text documents (PDF, Word, etc.), then set `is_documentset` == True
upload = client.create_dataset_from_upload(table_name='MY_SAMPLE_DATA', file_format='ZIP', is_documentset=False)
upload.upload_file(zip_file_io)

Following up on the above, you can also update this dataset from local, hence a new version will be created with same ID and name:

upload = client.create_dataset_version_from_upload(dataset_id='YOUR_DATASET_ID', file_format='ZIP')
upload.upload_file(zip_file_io)

When a dataset is connected through a connector, you can easily access and manipulate the data using the provided methods:

connector_id = "YOUR_CONNECTOR_ID"
sql_query = "SELECT * FROM TABLE LIMIT 5"

result = client.query_database_connector(connector_id, sql_query)

Feature groups are normally generated by using SQL unless they are kept at their raw format. Here is how to alter that SQL:

client.update_feature_group_sql_definition('YOUR_FG_ID', 'SQL')

Uploading Dataset from a Connector​

doc_processing_config is optional depending on if you want to load a document set or no. use the code below and change based on your application.

# doc_processing_config = abacusai.DatasetDocumentProcessingConfig(
# extract_bounding_boxes=True,
# use_full_ocr=False,
# remove_header_footer=False,
# remove_watermarks=True,
# convert_to_markdown=False,
# )

dataset = client.create_dataset_from_file_connector(
table_name="MY_TABLE_NAME",
location="azure://my-location:share/whatever/*",
# refresh_schedule="0 0 * * *", # Daily refresh at midnight UTC
# is_documentset=True, #Only if this is an actual documentset (Meaning word documents, PDF files, etc)
# extract_bounding_boxes=True,
# document_processing_config=doc_processing_config,
# reference_only_documentset=False,
)

Updating the datasets version using a connector:

client.create_dataset_version_from_file_connector('DATASET_ID') # For file connector
client.create_dataset_version_from_database_connector('DATASET_ID')

Exporting Data to a connector​

You can also export a feature group directly on a connector:

WRITEBACK = 'TABLE_NAME'
MAPPING = {
'COLUMN_1': 'COLUMN_1',
'COLUMN_2': 'COLUMN_2',
}

feature_group = client.describe_feature_group_by_table_name(f"FEATURE_GROUP_NAME")
feature_group.materialize() # To make sure we have latest version
feature_group_version = feature_group.latest_feature_group_version.feature_group_version
client.export_feature_group_version_to_database_connector(
feature_group_version,
database_connector_id='connector_id',
object_name=WRITEBACK,
database_feature_mapping=MAPPING,
write_mode='insert'
)