Abacus.AI allows you to connect your storage buckets (from GCP, AWS, or Azure) and train models using data read directly from them. Once the storage is verified, following steps can be used to get data from the bucket/blob:
After creating a new project and selecting the use case, proceed to the "Datasets" tab and click on the "Create Dataset" button.
Then, click on the "Create New" button.
Write a name for the dataset, select the appropriate type of data, and click on "Continue".
Select "Import from External Service" as shown in the picture below and choose "Google Cloud Storage", "AWS S3", "Azure", or "SFTP" as the File Services Option. Enter the URI of the dataset under the "Cloud Location" option. If you want to upload multiple dataset files with same schema as a single combined dataset file, you can use wildcard expansions, for e.g., s3://example/bucket/*.csv will read all of the csv files in the location s3://example/bucket and upload a combined csv file into the project. Finally, you can click on the "Add Dataset" button to upload the dataset into your project.
There might be a case where you would require adding the filename as the value into a new column for every record in the dataset. To do this, you could simply write the column name under the "Filename Column" option and a new column with the entered column name will be created with the file name as the value for every row under the new column. For e.g., if filename is example_filename and the entered column name is example_column, then the column, "example_column" will have the value "example_filename" for every record in the dataset.
If you know the schedule when your data gets updated, you can use a refresh schedule to set a specific time for updating your data automatically from the provided bucket location. For e.g., a cron expression 5 4 * * 6 under the option, "Set Refresh Schedule UTC (optional)" will update the dataset at 04:05 every Saturday. For generating cron strings, please visit https://crontab.guru/. Finally, click on the "Add Dataset" button to finish adding the data to the project.
The dataset will be uploaded successfully and shown as "Active" under the Datasets option.