Method

createDatasetFromFileConnector POST

Copy POST

Creates a dataset from a file located in a cloud storage, such as Amazon AWS S3, using the specified dataset name and location.

Arguments:

REQUIRED

KEY

TYPE

DESCRIPTION

Yes

tableName

str

Organization-unique table name or the name of the feature group table to create using the source table.

Yes

location

str

The URI location format of the dataset source. The URI location format needs to be specified to match the `location_date_format` when `location_date_format` is specified. For example, Location = s3://bucket1/dir1/dir2/event_date=YYYY-MM-DD/* when `location_date_format` is specified. The URI location format needs to include both the `start_prefix` and `until_prefix` when both are specified. For example, Location s3://bucket1/dir1/* includes both s3://bucket1/dir1/dir2/event_date=2021-08-02/* and s3://bucket1/dir1/dir2/event_date=2021-08-08/*

fileFormat

str

The file format of the dataset.

refreshSchedule

str

The Cron time string format that describes a schedule to retrieve the latest version of the imported dataset. The time is specified in UTC.

csvDelimiter

str

If the file format is CSV, use a specific csv delimiter.

filenameColumn

str

Adds a new column to the dataset with the external URI path.

startPrefix

str

The start prefix (inclusive) for a range based search on a cloud storage location URI.

untilPrefix

str

The end prefix (exclusive) for a range based search on a cloud storage location URI.

sqlQuery

str

The SQL query to use when fetching data from the specified location. Use `__TABLE__` as a placeholder for the table name. For example: "SELECT * FROM __TABLE__ WHERE event_date > '2021-01-01'". If not provided, the entire dataset from the specified location will be imported.

locationDateFormat

str

The date format in which the data is partitioned in the cloud storage location. For example, if the data is partitioned as s3://bucket1/dir1/dir2/event_date=YYYY-MM-DD/dir4/filename.parquet, then the `location_date_format` is YYYY-MM-DD. This format needs to be consistent across all files within the specified location.

dateFormatLookbackDays

int

The number of days to look back from the current day for import locations that are date partitioned. For example, import date 2021-06-04 with `date_format_lookback_days` = 3 will retrieve data for all the dates in the range [2021-06-02, 2021-06-04].

incremental

bool

Signifies if the dataset is an incremental dataset.

isDocumentset

bool

Signifies if the dataset is docstore dataset. A docstore dataset contains documents like images, PDFs, audio files etc. or is tabular data with links to such files.

extractBoundingBoxes

bool

Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True.

documentProcessingConfig

DatasetDocumentProcessingConfig

The document processing configuration. Only valid if is_documentset is True.

KEY	TYPE	Description
pageTextColumn	str	Name of the output column which contains the extracted text for each page. If not provided, no column will be created.

mergeFileSchemas

bool

Signifies if the merge file schema policy is enabled. If is_documentset is True, this is also set to True by default.

referenceOnlyDocumentset

bool

Signifies if the data reference only policy is enabled.

parsingConfig

ParsingConfig

Custom config for dataset parsing.

KEY	TYPE	Description
escape	str	Escape character for CSV files. Defaults to '"'.
filePathWithSchema	str	Path to the file with schema. Defaults to None.
csvDelimiter	str	Delimiter for CSV files. Defaults to None.

versionLimit

int

The number of recent versions to preserve for the dataset (minimum 30).

Note: The arguments for the API methods follow camelCase but for Python SDK underscore_case is followed.

Response:

KEY

TYPE

DESCRIPTION

success

Boolean

true if the call succeeded, false if there was an error

result

Dataset

KEY

TYPE

Description

datasetId

str

The unique identifier of the dataset.

sourceType

str

The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.

dataSource

str

Location of data. It may be a URI such as an s3 bucket or the database table.

createdAt

str

The timestamp at which this dataset was created.

ignoreBefore

str

The timestamp at which all previous events are ignored when training.

ephemeral

bool

The dataset is ephemeral and not used for training.

lookbackDays

int

Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.

databaseConnectorId

str

The Database Connector used.

databaseConnectorConfig

dict

The database connector query used to retrieve data.

connectorType

str

The type of connector used to get this dataset FILE or DATABASE.

featureGroupTableName

str

The table name of the dataset's feature group

applicationConnectorId

str

The Application Connector used.

applicationConnectorConfig

dict

The application connector query used to retrieve data.

incremental

bool

If dataset is an incremental dataset.

isDocumentset

bool

If dataset is a documentset.

extractBoundingBoxes

bool

Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True.

mergeFileSchemas

bool

If the merge file schemas policy is enabled.

referenceOnlyDocumentset

bool

Signifies whether to save the data reference only. Only valid if is_documentset if True.

versionLimit

int

Version limit for the dataset.

latestDatasetVersion

DatasetVersion

The latest version of this dataset.

KEY	TYPE	Description
datasetVersion	str	The unique identifier of the dataset version.
status	str	The current status of the dataset version
datasetId	str	A reference to the Dataset this dataset version belongs to.
size	int	The size in bytes of the file.
rowCount	int	Number of rows in the dataset version.
fileInspectMetadata	dict	Metadata information about file's inspection. For example - the detected delimiter for CSV files.
createdAt	str	The timestamp this dataset version was created.
error	str	If status is FAILED, this field will be populated with an error.
incrementalQueriedAt	str	If the dataset version is from an incremental dataset, this is the last entry of timestamp column when the dataset version was created.
uploadId	str	If the dataset version is being uploaded, this the reference to the Upload
mergeFileSchemas	bool	If the merge file schemas policy is enabled.
databaseConnectorConfig	dict	The database connector query used to retrieve data for this version.
applicationConnectorConfig	dict	The application connector used to retrieve data for this version.
invalidRecords	str	Invalid records in the dataset version

schema

DatasetColumn

List of resolved columns.

KEY	TYPE	Description
name	str	The unique name of the column.
dataType	str	The underlying data type of each column.
detectedDataType	str	The detected data type of the column.
featureType	str	Feature type of the column.
detectedFeatureType	str	The detected feature type of the column.
originalName	str	The original name of the column.
validDataTypes	List[str]	The valid data type options for this column.
timeFormat	str	The detected time format of the column.
timestampFrequency	str	The detected frequency of the timestamps in the dataset.

refreshSchedules

RefreshSchedule

List of schedules that determines when the next version of the dataset will be created.

KEY	TYPE	Description
refreshPolicyId	str	The unique identifier of the refresh policy
nextRunTime	str	The next run time of the refresh policy. If null, the policy is paused.
cron	str	A cron-style string that describes the when this refresh policy is to be executed in UTC
refreshType	str	The type of refresh that will be run
error	str	An error message for the last pipeline run of a policy

parsingConfig

ParsingConfig

The parsing config used for dataset.

KEY	TYPE	Description
escape	str	Escape character for CSV files. Defaults to '"'.
filePathWithSchema	str	Path to the file with schema. Defaults to None.
csvDelimiter	str	Delimiter for CSV files. Defaults to None.

documentProcessingConfig

DocumentProcessingConfig

The document processing config used for dataset (when is_documentset is True).

KEY	TYPE	Description
useFullOcr	bool	Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
removeHeaderFooter	bool	Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
extractImages	bool	Whether to extract images from the document e.g. diagrams in a PDF page. Defaults to False.
highlightRelevantText	bool	Whether to extract bounding boxes and highlight relevant text in search results. Defaults to False.
maskPii	bool	Whether to mask personally identifiable information (PII) in the document text/tokens. Defaults to False.
documentType	DocumentType	Type of document. Can be one of Text, Tables and Forms, Embedded Images, etc. If not specified, type will be decided automatically.
removeWatermarks	bool	Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
ocrMode	OcrMode	OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
extractBoundingBoxes	bool	Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
convertToMarkdown	bool	Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.

attachmentParsingConfig

AttachmentParsingConfig

The attachment parsing config used for dataset (eg. for salesforce attachment parsing)

KEY	TYPE	Description
columnName	str	column name
featureGroupName	str	feature group name
urls	str	list of urls

Exceptions:

TYPE	WHEN
InvalidEnumParameterError	An invalid value is passed for `fileFormat`.
InvalidParameterError	The location is not a valid cloud location URI, the start and end prefixes are invalid.
PermissionDeniedError	The location has not been verified with Abacus.AI.
DataNotFoundError	No file was found at the specified location.

Language: