Creates a dataset from a file located in a cloud storage, such as Amazon AWS S3, using the specified dataset name and location.
Arguments:
REQUIRED
KEY
TYPE
DESCRIPTION
Yes
tableName
str
Organization-unique table name or the name of the feature group table to create using the source table.
Yes
location
str
The URI location format of the dataset source.
The URI location format needs to be specified to match the `location_date_format` when `location_date_format` is specified.
For example, Location = s3://bucket1/dir1/dir2/event_date=YYYY-MM-DD/* when `location_date_format` is specified.
The URI location format needs to include both the `start_prefix` and `until_prefix` when both are specified.
For example, Location s3://bucket1/dir1/* includes both s3://bucket1/dir1/dir2/event_date=2021-08-02/* and s3://bucket1/dir1/dir2/event_date=2021-08-08/*
No
fileFormat
str
The file format of the dataset.
No
refreshSchedule
str
The Cron time string format that describes a schedule to retrieve the latest version of the imported dataset. The time is specified in UTC.
No
csvDelimiter
str
If the file format is CSV, use a specific csv delimiter.
No
filenameColumn
str
Adds a new column to the dataset with the external URI path.
No
startPrefix
str
The start prefix (inclusive) for a range based search on a cloud storage location URI.
No
untilPrefix
str
The end prefix (exclusive) for a range based search on a cloud storage location URI.
No
sqlQuery
str
The SQL query to use when fetching data from the specified location. Use `__TABLE__` as a placeholder for the table name. For example: "SELECT * FROM __TABLE__ WHERE event_date > '2021-01-01'". If not provided, the entire dataset from the specified location will be imported.
No
locationDateFormat
str
The date format in which the data is partitioned in the cloud storage location. For example, if the data is partitioned as s3://bucket1/dir1/dir2/event_date=YYYY-MM-DD/dir4/filename.parquet, then the `location_date_format` is YYYY-MM-DD. This format needs to be consistent across all files within the specified location.
No
dateFormatLookbackDays
int
The number of days to look back from the current day for import locations that are date partitioned. For example, import date 2021-06-04 with `date_format_lookback_days` = 3 will retrieve data for all the dates in the range [2021-06-02, 2021-06-04].
No
incremental
bool
Signifies if the dataset is an incremental dataset.
No
isDocumentset
bool
Signifies if the dataset is docstore dataset. A docstore dataset contains documents like images, PDFs, audio files etc. or is tabular data with links to such files.
No
extractBoundingBoxes
bool
Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True.
The number of recent versions to preserve for the dataset (minimum 30).
Note: The arguments for the API methods follow camelCase but for Python SDK underscore_case is followed.
Response:
KEY
TYPE
DESCRIPTION
success
Boolean
true if the call succeeded, false if there was an error
result
Dataset
KEY
TYPE
Description
datasetId
str
The unique identifier of the dataset.
sourceType
str
The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.
dataSource
str
Location of data. It may be a URI such as an s3 bucket or the database table.
createdAt
str
The timestamp at which this dataset was created.
ignoreBefore
str
The timestamp at which all previous events are ignored when training.
ephemeral
bool
The dataset is ephemeral and not used for training.
lookbackDays
int
Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.
databaseConnectorId
str
The Database Connector used.
databaseConnectorConfig
dict
The database connector query used to retrieve data.
connectorType
str
The type of connector used to get this dataset FILE or DATABASE.
featureGroupTableName
str
The table name of the dataset's feature group
applicationConnectorId
str
The Application Connector used.
applicationConnectorConfig
dict
The application connector query used to retrieve data.
incremental
bool
If dataset is an incremental dataset.
isDocumentset
bool
If dataset is a documentset.
extractBoundingBoxes
bool
Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True.
mergeFileSchemas
bool
If the merge file schemas policy is enabled.
referenceOnlyDocumentset
bool
Signifies whether to save the data reference only. Only valid if is_documentset if True.
versionLimit
int
Version limit for the dataset.
latestDatasetVersion
DatasetVersion
The latest version of this dataset.
KEY
TYPE
Description
datasetVersion
str
The unique identifier of the dataset version.
status
str
The current status of the dataset version
datasetId
str
A reference to the Dataset this dataset version belongs to.
size
int
The size in bytes of the file.
rowCount
int
Number of rows in the dataset version.
fileInspectMetadata
dict
Metadata information about file's inspection. For example - the detected delimiter for CSV files.
createdAt
str
The timestamp this dataset version was created.
error
str
If status is FAILED, this field will be populated with an error.
incrementalQueriedAt
str
If the dataset version is from an incremental dataset, this is the last entry of timestamp column when the dataset version was created.
uploadId
str
If the dataset version is being uploaded, this the reference to the Upload
mergeFileSchemas
bool
If the merge file schemas policy is enabled.
databaseConnectorConfig
dict
The database connector query used to retrieve data for this version.
applicationConnectorConfig
dict
The application connector used to retrieve data for this version.
invalidRecords
str
Invalid records in the dataset version
schema
DatasetColumn
List of resolved columns.
KEY
TYPE
Description
name
str
The unique name of the column.
dataType
str
The underlying data type of each column.
detectedDataType
str
The detected data type of the column.
featureType
str
Feature type of the column.
detectedFeatureType
str
The detected feature type of the column.
originalName
str
The original name of the column.
validDataTypes
List[str]
The valid data type options for this column.
timeFormat
str
The detected time format of the column.
timestampFrequency
str
The detected frequency of the timestamps in the dataset.
refreshSchedules
RefreshSchedule
List of schedules that determines when the next version of the dataset will be created.
KEY
TYPE
Description
refreshPolicyId
str
The unique identifier of the refresh policy
nextRunTime
str
The next run time of the refresh policy. If null, the policy is paused.
cron
str
A cron-style string that describes the when this refresh policy is to be executed in UTC
refreshType
str
The type of refresh that will be run
error
str
An error message for the last pipeline run of a policy
parsingConfig
ParsingConfig
The parsing config used for dataset.
KEY
TYPE
Description
csvDelimiter
str
Delimiter for CSV files. Defaults to None.
escape
str
Escape character for CSV files. Defaults to '"'.
filePathWithSchema
str
Path to the file with schema. Defaults to None.
documentProcessingConfig
DocumentProcessingConfig
The document processing config used for dataset (when is_documentset is True).
KEY
TYPE
Description
documentType
DocumentType
Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically.
convertToMarkdown
bool
Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
removeWatermarks
bool
Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
highlightRelevantText
bool
Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False.
maskPii
bool
Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False.
useFullOcr
bool
Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
extractBoundingBoxes
bool
Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
ocrMode
OcrMode
OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
removeHeaderFooter
bool
Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
attachmentParsingConfig
AttachmentParsingConfig
The attachment parsing config used for dataset (eg. for salesforce attachment parsing)
KEY
TYPE
Description
columnName
str
column name
featureGroupName
str
feature group name
urls
str
list of urls
Exceptions:
TYPE
WHEN
InvalidEnumParameterError
An invalid value is passed for fileFormat.
InvalidParameterError
The location is not a valid cloud location URI, the start and end prefixes are invalid.
PermissionDeniedError
The location has not been verified with Abacus.AI.