DuckLake

Connect & Ingest data from / to a DuckLake database

DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables with support for multiple catalog databases and various storage backends. See https://ducklake.select/ for more details.

Setup

The following credentials keys are accepted:

  • catalog_type (required) -> The catalog database type: duckdb, sqlite, postgres, or mysql. Default is duckdb.

  • catalog_conn_string (required) -> The connection string for the catalog database. See here for details. Examples:

    • postgres: dbname=ducklake_catalog host=localhost

    • sqlite: metadata.sqlite

    • duckdb: metadata.ducklake

    • mysql: db=ducklake_catalog host=localhost

  • catalog_schema (optional) -> The schema to use to store catalog tables (e.g. public).

  • data_path (optional) -> Path to data files (local, S3, Azure, GCS). e.g. /local/path, s3://bucket/data, r2://bucket/data, az://container/data, gs://bucket/data.

  • data_inlining_limit (optional) -> The row limit to use for DATA_INLINING_ROW_LIMIT.

  • copy_method (optional v1.4.20+) -> the method data is copied from sling into DuckDB. Acceptable values: csv_http, arrow_http. Default is arrow_http. If facing issues with arrow, try setting csv_http.

  • schema (optional) -> The default schema to use to read/write data. Default is main.

  • url_style (optional) -> specify path to use path url style (e.g. when using MinIO)

  • use_ssl (optional) -> specify false to not use HTTPS (e.g. when using MinIO)

  • duckdb_version (optional) -> The CLI version of DuckDB to use. You can also specify the env. variable DUCKDB_VERSION.

  • max_buffer_size (optional v1.5) -> the max buffer size to use when piping data from DuckDB CLI. Specify this if you have extremely large one-line text values in your dataset. Default is 10485760 (10MB).

Storage Configuration

For S3/S3-compatible storage:

  • s3_access_key_id (optional) -> AWS access key ID

  • s3_secret_access_key (optional) -> AWS secret access key

  • s3_session_token (optional) -> AWS session token

  • s3_region (optional) -> AWS region

  • s3_profile (optional) -> AWS profile to use

  • s3_endpoint (optional) -> S3-compatible endpoint URL (e.g. localhost:9000 for MinIO)

For Azure Blob Storage:

  • azure_account_name (optional) -> Azure storage account name

  • azure_account_key (optional) -> Azure storage account key

  • azure_sas_token (optional) -> Azure SAS token

  • azure_tenant_id (optional) -> Azure tenant ID

  • azure_client_id (optional) -> Azure client ID

  • azure_client_secret (optional) -> Azure client secret

  • azure_connection_string (optional) -> Azure connection string

For Google Cloud Storage:

  • gcs_key_file (optional) -> Path to GCS service account key file

  • gcs_project_id (optional) -> GCS project ID

  • gcs_access_key_id (optional) -> GCS HMAC access key ID

  • gcs_secret_access_key (optional) -> GCS HMAC secret access key

Using sling conns

Here are examples of setting a connection named DUCKLAKE. We must provide the type=ducklake property:

Environment Variable

Sling Env File YAML

See here to learn more about the sling env.yaml file.

Common Usage Examples

Basic Operations

Data Import/Export

If you are facing issues connecting, please reach out to us at [email protected], on discord or open a Github Issue here.

Last updated

Was this helpful?