DuckLake

Connect & Ingest data from / to a DuckLake database

DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables with support for multiple catalog databases and various storage backends. See https://ducklake.select/arrow-up-right for more details.

Setup

The following credentials keys are accepted:

  • catalog_type (required) -> The catalog database type: duckdb, sqlite, postgres, or mysql. Default is duckdb.

  • catalog_conn_string (required) -> The connection string for the catalog database. See herearrow-up-right for details. Examples:

    • postgres: dbname=ducklake_catalog host=localhost

    • sqlite: metadata.sqlite

    • duckdb: metadata.ducklake

    • mysql: db=ducklake_catalog host=localhost

  • catalog_schema (optional) -> The schema to use to store catalog tables (e.g. public).

  • data_path (optional) -> Path to data files (local, S3, Azure, GCS). e.g. /local/path, s3://bucket/data, r2://bucket/data, az://container/data, gs://bucket/data.

  • data_inlining_limit (optional) -> The row limit to use for DATA_INLINING_ROW_LIMITarrow-up-right.

  • copy_method (optional v1.4.20+) -> the method data is copied from sling into DuckDB. Acceptable values: csv_http, arrow_http. Default is arrow_http. If facing issues with arrow, try setting csv_http.

  • schema (optional) -> The default schema to use to read/write data. Default is main.

  • url_style (optional) -> specify path to use path url style (e.g. when using MinIO)

  • use_ssl (optional) -> specify false to not use HTTPS (e.g. when using MinIO)

  • duckdb_version (optional) -> The CLI version of DuckDB to use. You can also specify the env. variable DUCKDB_VERSION.

  • max_buffer_size (optional v1.5) -> the max buffer size to use when piping data from DuckDB CLI. Specify this if you have extremely large one-line text values in your dataset. Default is 10485760 (10MB).

Storage Configuration

For S3/S3-compatible storage:

  • s3_access_key_id (optional) -> AWS access key ID

  • s3_secret_access_key (optional) -> AWS secret access key

  • s3_session_token (optional) -> AWS session token

  • s3_region (optional) -> AWS region

  • s3_profile (optional) -> AWS profile to use

  • s3_endpoint (optional) -> S3-compatible endpoint URL (e.g. localhost:9000 for MinIO)

For Azure Blob Storage:

  • azure_account_name (optional) -> Azure storage account name

  • azure_account_key (optional) -> Azure storage account key

  • azure_sas_token (optional) -> Azure SAS token

  • azure_tenant_id (optional) -> Azure tenant ID

  • azure_client_id (optional) -> Azure client ID

  • azure_client_secret (optional) -> Azure client secret

  • azure_connection_string (optional) -> Azure connection string

For Google Cloud Storage:

  • gcs_key_file (optional) -> Path to GCS service account key file

  • gcs_project_id (optional) -> GCS project ID

  • gcs_access_key_id (optional) -> GCS HMAC access key ID

  • gcs_secret_access_key (optional) -> GCS HMAC secret access key

Using sling conns

Here are examples of setting a connection named DUCKLAKE. We must provide the type=ducklake property:

Environment Variable

See here to learn more about the .env.sling file.

Sling Env File YAML

See here to learn more about the sling env.yaml file.

Common Usage Examples

Basic Operations

Data Import/Export

If you are facing issues connecting, please reach out to us at [email protected]envelope, on discordarrow-up-right or open a Github Issue herearrow-up-right.

Last updated

Was this helpful?