This guide will help you get started with Abacus.AI Python SDK. We create a project under a specific use case, "Personalized Recommendation" and use sample datasets compiled from the publicly available movie lens dataset to demonstrate how easy it is to get started with the Abacus.AI platform and utilize any of the use cases we offer, to upload your data, map it correctly into the system to create Machine Learning features, train your AI/ML model(s), deploy trained model(s), and generate predictions using the Predict API.
The following steps would allow you to successfully setup your development environment and use the Abacus.AI platform:
python3 -m pip install abacusai
api_key = 'API_KEY' # replace API_KEY with the generated key
from abacusai import ApiClient
client = ApiClient(api_key)
For more detailed information on the Abacus.AI Python SDK, you can refer to the API documentation.
Additionally, you can explore various API examples on GitHub to see practical implementations and use cases.
The following steps are required to create a project on the Abacus.AI platform:
client.list_use_cases()
Output:
UseCase(use_case='USER_RECOMMENDATIONS', pretty_name='Personalized Recommendations', description='Increase user engagement and revenue with personalized recommendations on your app/website. Our unique blend of reinforcement learning and deep learning based technology works even when you have little historical data and have to deal with a fast-changing catalog or multiple new users.'),
UseCase(use_case='ENERGY', pretty_name='Real-Time Forecasting', description='Accurately forecast energy or computation usage in real-time. Make downstream planning decisions based on your predictions. We use generative modeling (GANs) to augment your dataset with synthetic data. This unique approach allows us to make accurate predictions in real-time, even when you have little historical data.'),
UseCase(use_case='SALES_FORECASTING', pretty_name='Sales and Revenue Forecasting', description='Forecast sales and revenue across your sales reps, products, business units, and locations. Use deep learning to forecast your sales across multiple dimensions. Make better planning discussions and anticipate future problems so you can mitigate them.'),
...
recommendations_project = client.create_project(name = 'Movie Recommendations', use_case = 'USER_RECOMMENDATIONS')
For this guide, we will use a few small sample datasets that will help us better understand the python API client and the terminologies we use. These datasets will be accessible to you and you can simple download them if you have met all the prerequisites as defined earlier:
Movies Dataset (CATALOG_ATTRIBUTES): Download
This dataset contains information about each movie.
Users Dataset (USER_ATTRIBUTES): Download
This dataset contains information about each user.
User-Movies Ratings Dataset (USER_ITEM_INTERACTIONS): Download
This dataset contains all of the user - movie ratings.
Let's quickly check the datasets required for the use case in context:
client.describe_use_case_requirements('USER_RECOMMENDATIONS')
[
UseCaseRequirements(
dataset_type="USER_ITEM_INTERACTIONS",
name="User-Item Interactions",
description="This dataset corresponds to all the user-item interactions on your website or application. For example, all the actions (e.g. click, purchase, view) taken by a particular user on a particular item (e.g product, video. article) recorded as a time-based log.",
required=True,
allowed_feature_mappings={
"ITEM_ID": {
"description": "This is the unique identifier of each item in your catalog. This is typically your product id, article id, or the video id.",
"allowed_feature_types": ["CATEGORICAL"],
"required": True,
},
"USER_ID": {
"description": "This is a unique identifier of each user in your user base.",
"allowed_feature_types": ["CATEGORICAL"],
"required": True,
},
"ACTION_TYPE": {
"description": "This is an optional column that specifies the type of action the user took. This could include any action that is specific to you (e.g., view, click, purchase, rating, comment, like, etc). You can always upload a dataset that has no action_type column if all the actions in the dataset are the same (e.g., a dataset of only purchases or clicks).",
"allowed_feature_types": ["CATEGORICAL"],
"required": False,
},
"TIMESTAMP": {
"description": "The timestamp when a particular action occurred.",
"allowed_feature_types": ["TIMESTAMP"],
"required": False,
},
"ACTION_WEIGHT": {
"description": "This is an optional column that specifies the weight of the action (e.g., video watch time, price of item purchased). This is used to optimize the the model to maximize actions with this value.",
"allowed_feature_types": ["NUMERICAL"],
"required": False,
},
"IGNORE": {
"description": "Ignore this column in training",
"multiple": True,
"required": False,
},
},
allowed_nested_feature_mappings=None,
),
UseCaseRequirements(
dataset_type="CATALOG_ATTRIBUTES",
name="Catalog Attributes",
description="This dataset corresponds to all the information you have in your catalog. If you want to recommend actions instead of items to users, you are welcome to upload an action catalog.",
required=None,
allowed_feature_mappings={
"ITEM_ID": {
"description": "This is a unique identifier of each item in your catalog. This is typically your product id, article id, or video id.",
"allowed_feature_types": ["CATEGORICAL"],
"required": True,
},
"PREDICTION_RESTRICT": {
"description": "This is an optional column that is used to restrict predictions to items matching a specific value of this column. If this is set, then the prediction api call will require that a includeFilter specifying a value for this column be included.",
"allowed_feature_types": ["CATEGORICAL"],
"required": False,
},
"SNAPSHOT_TIME": {
"description": "This is an optional column that is used to indicate when the record was updated. This allows us to provide multiple rows for a single item id and during training, we pick the row with most recent value for this column compared to the interaction timestamp.",
"allowed_feature_types": ["TIMESTAMP"],
"required": False,
},
"ACTION_WEIGHT": {
"description": "This is an optional column that specifies the weight of the item (e.g., average video watch time, average price of item purchased). This is used to do optimization weights at an item level or as a fallback score for unknown items.",
"allowed_feature_types": ["NUMERICAL"],
"required": False,
},
"IGNORE": {
"description": "Ignore this column in training",
"multiple": True,
"required": False,
},
},
allowed_nested_feature_mappings=None,
),
UseCaseRequirements(
dataset_type="USER_ATTRIBUTES",
name="User Attributes",
description="This dataset corresponds to all the attributes or meta-data that you have about your user base. Any user profile information will be relevant here.",
required=None,
allowed_feature_mappings={
"USER_ID": {
"description": "The unique identifier for the user.",
"allowed_feature_types": ["CATEGORICAL"],
"required": True,
},
"SNAPSHOT_TIME": {
"description": "This is an optional column that is used to indicate when the record was updated. This allows us to provide multiple rows for a single user id and during training, we pick the row with most recent value for this column compared to the interaction timestamp.",
"allowed_feature_types": ["TIMESTAMP"],
"required": False,
},
"IGNORE": {
"description": "Ignore this column in training",
"multiple": True,
"required": False,
},
},
allowed_nested_feature_mappings=None,
),
]
As shown above in the output block, for the Personalized Recommendation use-case, User-Item Interactions dataset is a must while the other two are optional although they are recommended to be used to train a better scoring model. Each use case we have, comes with a set of use case requirements, let's break down the above use case requirement output to understand them properly:
User-Item Interactions Dataset - Required
The associated feature group type is USER_ITEM_INTERACTIONS. This type corresponds to all the user-item interactions on your website or app. For example, all the actions (e.g. click, purchase, view) taken by a particular user on a particular item (e.g product, video. article) recorded as a time-based log. For each dataset_type we will have allowed_feature_mappings. Feature Mappings are system recognizable Machine Learning Features that tell the system how to interpret the data as a trainable ML feature.
In other words, we have created specific feature mappings which helps the system map features in your dataset to a standard meaningful type with respect to the use-case in consideration. The following feature mappings are defined for User-Item Interactions Dataset (USER_ITEM_INTERACTIONS) within Personalized Recommendation use-case:
ITEM_ID - Required
This is the unique identifier of each item in your catalog. This is typically your product id, article id or video id.
USER_ID - Required
This is a unique identifier of each user in your user base.
TIMESTAMP - Required
This timestamp of a particular action associated with an item and a user.
IGNORE - Optional
This will tell the AI engine to ignore this feature during training.
ACTION_TYPE- Optional
This specifies the type of action the user took. This could include any action e.g., view, click, purchase, rating, comment, like, etc.
ACTION_WEIGHT - Optional
This is an optional column that specifies the weight of the item (e.g., average video watch time, average price of item purchased). This is used to do optimization weights at an item level or as a fallback score for unknown items.
Example of a User-Item Interactions Dataset:
movie_id | user_id | rating | timestamp |
---|---|---|---|
1 | 1193 | 5 | 978300760 |
1 | 3408 | 4 | 978300275 |
1 | 1193 | 5 | 978300760 |
2 | 2268 | 5 | 978299297 |
2 | 3468 | 5 | 978298542 |
Next step is to make sure that the data types for each column in the dataset is what we intend it to be. The system does its best to populate the correct data types but it is recommended to verify them once. The following API method is used to get schema of the dataset:
client.get_dataset_schema(dataset_id="14f98413e6")
[
[DatasetColumn(name='user_id',
data_type='STRING',
feature_type='CATEGORICAL',
original_name=None),
DatasetColumn(name='movie_id',
data_type='STRING',
feature_type='CATEGORICAL',
original_name=None),
DatasetColumn(name='rating',
data_type='STRING',
feature_type='CATEGORICAL',
original_name=None),
DatasetColumn(name='timestamp',
data_type='DATETIME',
feature_type='TIMESTAMP',
original_name=None)]
]
For Personalized Recommendation use case, USER-ITEM INTERACTION feature group type is the only requirement but it is recommended to have more data, perhaps in the form of catalog attributes and user attributes:
Catalog Data Dataset - Recommended
This dataset corresponds to all the metadata about each item you have in your catalog.
Required Feature Mappings:
ITEM_ID - Required
This is the unique identifier of each item in your catalog, for e.g., product id, article id, video id etc.
PREDICTION_RESTRICT - Optional
This is an optional column that is used to restrict predictions to items matching a specific value of this column. If this is set, then the prediction api call will require that a includeFilter specifying a value for this column be included.
SNAPSHOT_TIME - Optional
This is an optional column that is used to indicate when the record was updated. This allows us to provide multiple rows for a single item id and during training, we pick the row with most recent value for this column compared to the interaction timestamp.
ACTION_WEIGHT - Optional
This is an optional column that specifies the weight of the item (e.g., average video watch time, average price of item purchased). This is used to do optimization weights at an item level or as a fallback score for unknown items.
IGNORE - Optional
This mapping will tell the AI engine to ignore this feature during training.
Example of a Catalog Dataset:
movie_id | movie | genres |
---|---|---|
1 | Toy Story (1995) | Animation|Children's|Comedy |
2 | Jumanji (1995) | Adventure| Children's|Fantasy |
3 | Grumpier Old Men (1995) | Comedy|Romance |
User Attribute Dataset - Recommended
This dataset corresponds to all the metadata about each user you have in your dataset.
Required Feature Column mappings:
USER_ID - Required
This is the unique identifier for each user.
SNAPSHOT_TIME
This is an optional column that is used to indicate when the record was updated. This allows us to provide multiple rows for a single item id and during training, we pick the row with most recent value for this column compared to the interaction timestamp.
IGNORE - Optional
This mapping tells the AI engine to ignore this feature during training.
Example of a User Attribute Dataset:
user_id | gender | age | occupation | zip_code |
---|---|---|---|---|
1 | F | Under 18 | K-12 student | 48067 |
2 | M | 56+ | self-employed | 70072 |
3 | F | 25-34 | scientist | 55117 |
Now that you are clear about the data requirements, you can you can add all these datasets to a project by telling Abacus.AI where to find the data and then create datasets using that data. We have uploaded to our AWS S3 bucket, the same sample datasets we provided you earlier to download. We would point to our cloud storage location and create datasets as follows:
user_item_dataset = client.create_dataset_from_file_connector(
table_name='User_Item_Recommendations',
location='s3://abacusai-exampledatasets/user_recommendations/user_movie_ratings.csv',
refresh_schedule='0 12 * * *'
)
movie_attributes_dataset = client.create_dataset_from_file_connector(
table_name='Movie_Attributes',
location='s3://abacusai-exampledatasets/user_recommendations/movies_metadata.csv',
refresh_schedule='0 12 * * *'
)
user_attributes_dataset = client.create_dataset_from_file_connector(
table_name='User_Attributes',
location='s3://abacusai-exampledatasets/user_recommendations/users_metadata.csv',
refresh_schedule='0 12 * * *'
)
Using the Create Dataset API method, you can tell Abacus.AI the public S3 URI of where to find the datasets. You could also give each dataset a Refresh Schedule, which tells Abacus.AI when it should refresh the dataset (take an updated/latest copy of the dataset). The Refresh Schedule is given with a cron string. For example, when entering "0 12 * * *", the dataset is going to be re-read from the s3 at 12pm UTC, so that no update are missed. If you're unfamiliar with Cron Syntax, Crontab Guru can help translate the syntax back into natural language.
If you would like to output the schema of the attached datasets, you could use the following API method:
for dataset in datasets:
print(f'{dataset.name} Schema:')
print(client.get_dataset_schema(dataset.dataset_id))
User Item Recommendations Schema:
[DatasetColumn(name='user_id', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='movie_id', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='rating', data_type='STRING', feature_type='CATEGORICAL',original_name=None), DatasetColumn(name='timestamp', data_type='DATETIME', feature_type='TIMESTAMP', original_name=None)]
Movie Attributes Schema:
[DatasetColumn(name='movie_id', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='movie', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='genres', data_type='STRING', feature_type='CATEGORICAL_LIST', original_name=None)]\
User Attributes Schema:
[DatasetColumn(name='user_id', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='gender', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='age', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='occupation', data_type='STRING', feature_type='CATEGORICAL', original_name=None), DatasetColumn(name='zip_code', data_type='STRING', feature_type='CATEGORICAL', original_name=None)]
Once the required and recommended datasets are attached, we are good to start mapping the dataset columns to system recognizable feature mappings, a process we can call as ML feature creation. These machine learning features are then used to train your machine learning model. In other words, every column in the dataset can be viewed as a ML feature and the dataset itself is termed as Feature Group. Therefore, a feature group is a collection of ML features that is used to train ML models. Note that we have predefined feature group types for each use case. You can create a feature group from the uploaded dataset as follows:
user_item_iteration_fg = client.create_feature_group(table_name='personalized_recommendations', sql='SELECT * from User_Item_Recommendations')
The datasets you upload to Abacus.AI exist at organization level. When you use them to create feature groups, the feature groups also exist at organization level. So, you would need to attach the feature group to your project and then set the feature group type to be one of the predefined feature groups types required or recommended for the use case under which the project was created:
client.add_feature_group_to_project(feature_group_id = user_item_iteration_fg.feature_group_id, project_id = recommendations_project.project_id)
client.set_feature_group_type(feature_group_id = user_item_iteration_fg.feature_group_id, project_id = recommendations_project.project_id, feature_group_type= "USER_ITEM_INTERACTIONS")
We used set_feature_group_type method to set the feature group type to the required type "USER_ITEM_INTERACTIONS" for our Personalized recommendation use case project. Similarly, you can set USER_ATTRIBUTES and CATALOG_ATTRIBUTES feature group types if you have the required data. Next step is to map the features within these feature groups to predefined feature mappings that are specific to the selected use case. Using the describe_use_case_requirements method in our python SDK, you can find the list of available Feature Mappings for the selected use case:
client.describe_use_case_requirements(use_case)[0].allowed_feature_mappings
{
"ACTION_TYPE":{
"allowed_feature_types":[
"CATEGORICAL"
],
"description":"This is an optional column that specifies the type of action the user took. This could include any action that is specific to you (e.g., view, click, purchase, rating, comment, like, etc). You can always upload a dataset that has no action_type column if all the actions in the dataset are the same (e.g., a dataset of only purchases or clicks).",
"required":false
},
"ACTION_WEIGHT":{
"allowed_feature_types":[
"NUMERICAL"
],
"description":"This is an optional column that specifies the weight of the action (e.g., video watch time, price of item purchased). This is used to optimize the the model to maximize actions with this value.",
"required":false
},
"IGNORE":{
"description":"Ignore this column in training",
"multiple":true,
"required":false
},
"ITEM_ID":{
"allowed_feature_types":[
"CATEGORICAL"
],
"description":"This is the unique identifier of each item in your catalog. This is typically your product id, article id, or the video id.",
"required":true
},
"TIMESTAMP":{
"allowed_feature_types":[
"TIMESTAMP"
],
"description":"The timestamp when a particular action occurred.",
"required":false
},
"USER_ID":{
"allowed_feature_types":[
"CATEGORICAL"
],
"description":"This is a unique identifier of each user in your user base.",
"required":true
}
}
Setting the system required feature mappings:
client.set_feature_mapping(project_id = recommendations_project.project_id, feature_group_id= user_item_iteration_fg.feature_group_id, feature_name='movie_id', feature_mapping='ITEM_ID')
client.set_feature_mapping(project_id = recommendations_project.project_id, feature_group_id= user_item_iteration_fg.feature_group_id, feature_name='user_id', feature_mapping='USER_ID')
client.set_feature_mapping(project_id = recommendations_project.project_id, feature_group_id= user_item_iteration_fg.feature_group_id, feature_name='timestamp', feature_mapping='TIMESTAMP')
Output:
[
Feature(name='user_id', select_clause=None, feature_mapping='USER_ID', source_table='User_Item_Recommendations', original_name=None, using_clause=None, order_clause=None, where_clause=None, feature_type='CATEGORICAL', data_type='STRING', columns=None, point_in_time_info=None),
Feature(name='movie_id', select_clause=None, feature_mapping='ITEM_ID', source_table='User_Item_Recommendations', original_name=None, using_clause=None, order_clause=None, where_clause=None, feature_type='CATEGORICAL', data_type='STRING', columns=None, point_in_time_info=None),
Feature(name='rating', select_clause=None, feature_mapping=None, source_table='User_Item_Recommendations', original_name=None, using_clause=None, order_clause=None, where_clause=None, feature_type='CATEGORICAL', data_type='STRING', columns=None, point_in_time_info=None),
Feature(name='timestamp', select_clause=None, feature_mapping='TIMESTAMP', source_table='User_Item_Recommendations', original_name=None, using_clause=None, order_clause=None, where_clause=None, feature_type='TIMESTAMP', data_type='DATETIME', columns=None, point_in_time_info=None)
]
For each required Feature Group Type within the use case, you must assign the feature group to be used for training the model:
client.use_feature_group_for_training(project_id=recommendations_project.project_id, feature_group_id=feature_group.feature_group_id)
This marks the end of our feature engineering phase and we are good to move forward towards training our ML model.
To make sure that we have met all the feature group requirements, let's call the validate method:
recommendations_project.validate()
ProjectValidation(valid=True, dataset_errors=[], column_hints={})
Let's kick off the training process:
recommendations_model = recommendations_project.train_model()
After the training is started, you can call this blocking call that continually checks the status of the model until it is done training and evaluating. Once training completes, the execution of this call gets finished:
recommendations_model.wait_for_evaluation()
The next step after model training is model evaluation. In this step, you can utilize the metric scores to get a solid idea of the quality of the trained model:
recommendations_model.get_metrics()
ModelMetrics(model_id='d71099eb6', model_version='168b96aebe', metrics={'ndcg': 0.3294146047860978, 'ndcg@5': 0.2521274992804733, 'ndcg@10': 0.28182587426741196, 'map': 0.06031499420301521, 'map@5': 0.08525364298724954, 'map@10': 0.06937885376524622, 'mrr': 0.246175726745602, 'personalization@10': 0.9656656221842672, 'coverage': 0.4647608264930654}, baseline_metrics=None, target_column=None)
Now that we have a trained model, let's create a deployment to be able to use the model to perform predictions on the test data the system automatically kept aside for you:
recommendations_deployment = client.create_deployment(name='Personalized Recommendations Deployment', model_id=recommendations_model.model_id)
recommendations_deployment.wait_for_deployment()
The wait_for_deployment function call finishes executing once the model is successfully deployed. Each deployment created for a trained model makes it available for prediction requests through the specified deployment variable. Next, a deployment token is created for authentication of the created deployment(s). This token is only authorized to predict on deployments in this project, so it's safe to embed this model inside of an application or website:
deployment_token = recommendations_project.create_deployment_token().deployment_token
Now that you have an active deployment authorized through a deployment token, you can use the use-case specific Prediction API method to request the model to make predictions on some data. In this example, a user with a unique user id is being recommended a list of movies by the system using the Prediction API method get_recommendations as follows:
client.get_recommendations(deployment_token=deployment_token, deployment_id=deployment.deployment_id, query_data=
{"user_id":"1107"})
[
{'movie_id': '1208'},
{'movie_id': '2858'},
{'movie_id': '1196'},
{'movie_id': '150'},
{'movie_id': '1230'},
{'movie_id': '2020'}
]