Feature Group & Dataset Operations
The objective of this tutorial is to get you familarised with Datasets and Feature group manipulation using the Abacus client.
- Datasets: Raw Data uploaded in the platform
- Feature Groups: Processed data derived from Datasets, which can be used for model training and evaluation.
To start with, let's initiate our client:
import abacusai
client = abacusai.ApiClient("YOUR API KEY")
Manipulating Project​
Now, describe project:
# Gets information about the project based on the ID.
project = client.describe_project(project_id="YOUR_PROJECT_ID")
# A list of all models trained under the project
models = client.list_models(project_id="YOUR_PROJECT_ID")
We can also load a feature group locally in our machine:
# Loads the specific version of FeatureGroup
fg1 = client.describe_feature_group_version("FEATURE_GROUP_VERSION")
df1 = fg1.load_as_pandas()
# Loads the latest version of FeatureGroup based on a name
fg2 = client.describe_feature_group_by_table_name("FEATURE_GROUP_NAME")
# Loads the FeatureGroup as a pandas dataframe
df2 = fg2.load_as_pandas()
You can also add a feature group to a project programmatically:
client.add_feature_group_to_project(
feature_group_id='FEATURE_GROUP_ID',
project_id='PROJECT_ID',
feature_group_type='CUSTOM_TABLE' # You can set to DOCUMENTS if this is a document set
)
Uploading Dataset From Local​
To create a Dataset from a local file:
import io
zip_filename= 'sample_data_folder.zip'
with open(zip_filename, 'rb') as f:
zip_file_content = f.read()
zip_file_io = io.BytesIO(zip_file_content)
# If the ZIP folder contains unstructured text documents (PDF, Word, etc.), then set `is_documentset` == True
upload = client.create_dataset_from_upload(table_name='MY_SAMPLE_DATA', file_format='ZIP', is_documentset=False)
upload.upload_file(zip_file_io)
Following up on the above, you can also update this dataset from local, hence a new version will be created with same ID and name:
upload = client.create_dataset_version_from_upload(dataset_id='YOUR_DATASET_ID', file_format='ZIP')
upload.upload_file(zip_file_io)
When a dataset is connected through a connector, you can easily access and manipulate the data using the provided methods:
connector_id = "YOUR_CONNECTOR_ID"
sql_query = "SELECT * FROM TABLE LIMIT 5"
result = client.query_database_connector(connector_id, sql_query)
Feature groups are normally generated by using SQL unless they are kept at their raw format. Here is how to alter that SQL:
client.update_feature_group_sql_definition('YOUR_FG_ID', 'SQL')
Uploading Dataset from a Connector​
doc_processing_config is optional depending on if you want to load a document set or no. use the code below and change based on your application.
# doc_processing_config = abacusai.DatasetDocumentProcessingConfig(
# extract_bounding_boxes=True,
# use_full_ocr=False,
# remove_header_footer=False,
# remove_watermarks=True,
# convert_to_markdown=False,
# )
dataset = client.create_dataset_from_file_connector(
table_name="MY_TABLE_NAME",
location="azure://my-location:share/whatever/*",
# refresh_schedule="0 0 * * *", # Daily refresh at midnight UTC
# is_documentset=True, #Only if this is an actual documentset (Meaning word documents, PDF files, etc)
# extract_bounding_boxes=True,
# document_processing_config=doc_processing_config,
# reference_only_documentset=False,
)
Updating the datasets version using a connector:
client.create_dataset_version_from_file_connector('DATASET_ID') # For file connector
client.create_dataset_version_from_database_connector('DATASET_ID')
Exporting Data to a connector​
You can also export a feature group directly on a connector:
WRITEBACK = 'TABLE_NAME'
MAPPING = {
'COLUMN_1': 'COLUMN_1',
'COLUMN_2': 'COLUMN_2',
}
feature_group = client.describe_feature_group_by_table_name(f"FEATURE_GROUP_NAME")
feature_group.materialize() # To make sure we have latest version
feature_group_version = feature_group.latest_feature_group_version.feature_group_version
client.export_feature_group_version_to_database_connector(
feature_group_version,
database_connector_id='connector_id',
object_name=WRITEBACK,
database_feature_mapping=MAPPING,
write_mode='insert'
)