Feature Groups
Learning Objectivesβ
- Feature Groups Overview: Understanding the concept of feature groups.
- Data Preprocessing: Techniques for preprocessing data using feature groups
Watch the Tutorialβ
What is a Feature Group?β
A feature group is a versioned and managed data abstraction in Abacus.AI that is automatically created when you:
- Upload data to the platform
- Apply data transformations
- Generate predictions using a model
Feature groups provide a structured way to manage and version your data throughout the machine learning lifecycle.
How Feature Groups are Createdβ
Automatic Creation from Datasetsβ
When you manually upload a dataset (e.g., "bike_sharing"), the platform automatically creates a feature group with an identical name. This feature group serves as the foundation for further data transformations.
Steps:
- Upload your dataset to Abacus.AI
- The platform automatically creates a corresponding feature group
- The feature group inherits the name and structure of your dataset
Creating New Feature Groupsβ
You can create new feature groups using either SQL or Python transformations on existing feature groups.
Using SQLβ
Steps:
- Navigate to the Feature Groups screen
- Click "Add Feature Group" in the top right corner
- Select "SQL" as your transformation method
- Provide a name for your new feature group
- Write your SQL query (e.g.,
SELECT * FROM bike_sharing_new) - Save and materialize the feature group
Using Pythonβ
Steps:
- Navigate to the Feature Groups screen
- Click "Add Feature Group" in the top right corner
- Select "Python" as your transformation method
- Write your Python transformation code
- Save and materialize the feature group
Considerations:
- SQL can be as complex as needed
- You can reference multiple feature groups simultaneously
- Feature groups can have code dependencies where one works on top of another
Understanding Feature Group Versionsβ
Feature groups maintain multiple versions for tracking changes over time. You can view all versions by scrolling to the bottom of the feature group details page.
Why Multiple Versions Existβ
There are two main reasons for version changes:
- Code Changes: When users modify the SQL or Python transformation code
- Upstream Data Changes: When the source data that feeds into the feature group is updated
How Versioning Works:
- Abacus.AI automatically detects changes in code or upstream data
- The platform prompts you to rematerialize the feature group
- Rematerialization uses the latest version of both your code and data
- You don't need to manually manage versioningβjust keep your code and data up to date
Feature Group Pages and Functionalityβ
Features Tabβ
The Features tab displays all columns in your feature group table.
What You'll See:
- Column names
- Feature types (numerical, categorical, text, etc.)
- Feature mappings (discussed in separate documentation)
Purpose:
- Understand the structure of your data
- Verify column types are correct
- Review feature mappings for model training
Explore Tabβ
The Explore tab provides aggregate statistics for each feature in your feature group.
What You'll See:
- Statistical summaries for each column
- Distribution information
- Data quality metrics
Use Cases:
- Find outliers in your data
- Verify data loaded correctly
- Perform basic sanity checks
- Understand data distributions
Materialized Data Tabβ
The Materialized Data tab allows you to view and query the actual data in your feature group.
Capabilities:
- View data row by row
- Run custom SQL queries
- Use the text interface for query assistance
How to Query:
- Navigate to the Materialized Data tab
- Type your SQL query in the query editor
- Alternatively, use the text interface for help constructing queries
- Execute and view results
Feature Group Lineageβ
Feature groups can have complex dependencies where one feature group builds upon another. The platform visualizes these relationships through feature group lineage.
Example Dependency Chain:
- Original dataset β Feature Group A
- Feature Group A β Feature Group B (with transformations)
- Feature Group B β Feature Group C (with additional transformations)
Benefits:
- Track data flow through your pipeline
- Understand dependencies between feature groups
- Debug issues by tracing back through the lineage