DuckLake

Connect & Ingest data from / to a DuckLake database

DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables with support for multiple catalog databases and various storage backends. See https://ducklake.select/ for more details.

Setup

The following credentials keys are accepted:

catalog_type (required) -> The catalog database type: duckdb, sqlite, postgres, or mysql. Default is duckdb.
catalog_conn_string (required) -> The connection string for the catalog database. See here for details. Examples:
- postgres: dbname=ducklake_catalog host=localhost
- sqlite: metadata.sqlite
- duckdb: metadata.ducklake
- mysql: db=ducklake_catalog host=localhost
catalog_schema (optional) -> The schema to use to store catalog tables (e.g. public).
data_path (optional) -> Path to data files (local, S3, Azure, GCS). e.g. /local/path, s3://bucket/data, r2://bucket/data, az://container/data, gs://bucket/data.
data_inlining_limit (optional) -> The row limit to use for DATA_INLINING_ROW_LIMIT.
copy_method (optional v1.4.20+) -> the method data is copied from sling into DuckDB. Acceptable values: csv_http, arrow_http. Default is arrow_http. If facing issues with arrow, try setting csv_http.
schema (optional) -> The default schema to use to read/write data. Default is main.
url_style (optional) -> specify path to use path url style (e.g. when using MinIO)
use_ssl (optional) -> specify false to not use HTTPS (e.g. when using MinIO)
duckdb_version (optional) -> The CLI version of DuckDB to use. You can also specify the env. variable DUCKDB_VERSION.
max_buffer_size (optional v1.5) -> the max buffer size to use when piping data from DuckDB CLI. Specify this if you have extremely large one-line text values in your dataset. Default is 10485760 (10MB).

Storage Configuration

For S3/S3-compatible storage:

s3_access_key_id (optional) -> AWS access key ID
s3_secret_access_key (optional) -> AWS secret access key
s3_session_token (optional) -> AWS session token
s3_region (optional) -> AWS region
s3_profile (optional) -> AWS profile to use
s3_endpoint (optional) -> S3-compatible endpoint URL (e.g. localhost:9000 for MinIO)

For Azure Blob Storage:

azure_account_name (optional) -> Azure storage account name
azure_account_key (optional) -> Azure storage account key
azure_sas_token (optional) -> Azure SAS token
azure_tenant_id (optional) -> Azure tenant ID
azure_client_id (optional) -> Azure client ID
azure_client_secret (optional) -> Azure client secret
azure_connection_string (optional) -> Azure connection string

For Google Cloud Storage:

gcs_key_file (optional) -> Path to GCS service account key file
gcs_project_id (optional) -> GCS project ID
gcs_access_key_id (optional) -> GCS HMAC access key ID
gcs_secret_access_key (optional) -> GCS HMAC secret access key

Using `sling conns`

Here are examples of setting a connection named DUCKLAKE. We must provide the type=ducklake property:

# Default DuckDB catalog with local data
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake

# DuckDB catalog with S3 data storage
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake data_path=data

# SQLite catalog for multi-client access
$ sling conns set DUCKLAKE type=ducklake catalog_type=sqlite catalog_conn_string=metadata.sqlite

# PostgreSQL catalog for multi-user access
$ sling conns set DUCKLAKE type=ducklake catalog_type=postgres catalog_conn_string="host=localhost port=5432 user=myuser password=mypass dbname=ducklake_catalog"

# MySQL catalog with Azure storage
$ sling conns set DUCKLAKE type=ducklake catalog_type=mysql catalog_conn_string="host=localhost user=root password=pass db=ducklake_catalog" data_path=data

Environment Variable


# configuration with s3 properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "s3://bucket/path/to/data/folder/",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_region: "us-east-1"
}'

# configuration with Non-HTTPS MinIO properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "s3://bucket/path/to/data/folder/",
  s3_endpoint: machine:9000,
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_region: "minio",
  url_style: "path",
  use_ssl: false
}'

# configuration with r2 properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "r2://bucket/path/to/data/folder/",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_endpoint: "<account-id>.r2.cloudflarestorage.com"
}'

# Windows PowerShell
$env:DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "az://my_container/path/to/data/folder/",
  azure_account_name: "<azure_account_name>",
  azure_account_key: "<s3_secret_access_key>",
}'

Sling Env File YAML

See here to learn more about the sling env.yaml file.

connections:
  DUCKLAKE:
    type: ducklake
    catalog_type: <catalog_type>
    catalog_conn_string: <catalog_conn_string>
    data_path: <data_path>
    database: <database>
    schema: <schema>
    
    # S3 Configuration (if using S3 storage)
    s3_access_key_id: <s3_access_key_id>
    s3_secret_access_key: <s3_secret_access_key>
    s3_region: <s3_region>
    s3_endpoint: http://localhost:9000
    
    # Azure Configuration (if using Azure storage)
    azure_account_name: <azure_account_name>
    azure_account_key: <azure_account_key>
    
    # GCS Configuration (if using GCS storage)
    gcs_key_file: <gcs_key_file>
    gcs_project_id: <gcs_project_id>

Common Usage Examples

Basic Operations

# List tables
sling conns discover DUCKLAKE

# Query data
sling run --src-conn DUCKLAKE --src-stream "SELECT * FROM sales.orders LIMIT 10" --stdout

# Export to CSV
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-object file://./orders.csv

Data Import/Export

# Import CSV to DuckLake
sling run --src-stream file://./data.csv --tgt-conn DUCKLAKE --tgt-object sales.new_data

# Import from PostgreSQL
sling run --src-conn POSTGRES_DB --src-stream public.customers --tgt-conn DUCKLAKE --tgt-object sales.customers

# Export to Parquet files
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-conn AWS_S3 --tgt-object s3://bucket/exports/orders.parquet

If you are facing issues connecting, please reach out to us at [email protected], on discord or open a Github Issue here.

PreviousAthena NextIceberg

Last updated 14 days ago

Was this helpful?