DuckLake
Connect & Ingest data from / to a DuckLake database
DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables with support for multiple catalog databases and various storage backends. See https://ducklake.select/ for more details.
Setup
The following credentials keys are accepted:
catalog_type(required) -> The catalog database type:duckdb,sqlite,postgres, ormysql. Default isduckdb.catalog_conn_string(required) -> The connection string for the catalog database. See here for details. Examples:postgres:
dbname=ducklake_catalog host=localhostsqlite:
metadata.sqliteduckdb:
metadata.ducklakemysql:
db=ducklake_catalog host=localhost
catalog_schema(optional) -> The schema to use to store catalog tables (e.g.public).data_path(optional) -> Path to data files (local, S3, Azure, GCS). e.g./local/path,s3://bucket/data,r2://bucket/data,az://container/data,gs://bucket/data.data_inlining_limit(optional) -> The row limit to use forDATA_INLINING_ROW_LIMIT.copy_method(optional v1.4.20+) -> the method data is copied from sling into DuckDB. Acceptable values:csv_http,arrow_http. Default isarrow_http. If facing issues with arrow, try settingcsv_http.schema(optional) -> The default schema to use to read/write data. Default ismain.url_style(optional) -> specifypathto use path url style (e.g. when using MinIO)use_ssl(optional) -> specifyfalseto not use HTTPS (e.g. when using MinIO)duckdb_version(optional) -> The CLI version of DuckDB to use. You can also specify the env. variableDUCKDB_VERSION.max_buffer_size(optional v1.5) -> the max buffer size to use when piping data from DuckDB CLI. Specify this if you have extremely large one-line text values in your dataset. Default is10485760(10MB).
Storage Configuration
For S3/S3-compatible storage:
s3_access_key_id(optional) -> AWS access key IDs3_secret_access_key(optional) -> AWS secret access keys3_session_token(optional) -> AWS session tokens3_region(optional) -> AWS regions3_profile(optional) -> AWS profile to uses3_endpoint(optional) -> S3-compatible endpoint URL (e.g.localhost:9000for MinIO)
For Azure Blob Storage:
azure_account_name(optional) -> Azure storage account nameazure_account_key(optional) -> Azure storage account keyazure_sas_token(optional) -> Azure SAS tokenazure_tenant_id(optional) -> Azure tenant IDazure_client_id(optional) -> Azure client IDazure_client_secret(optional) -> Azure client secretazure_connection_string(optional) -> Azure connection string
For Google Cloud Storage:
gcs_key_file(optional) -> Path to GCS service account key filegcs_project_id(optional) -> GCS project IDgcs_access_key_id(optional) -> GCS HMAC access key IDgcs_secret_access_key(optional) -> GCS HMAC secret access key
Using sling conns
sling connsHere are examples of setting a connection named DUCKLAKE. We must provide the type=ducklake property:
# Default DuckDB catalog with local data
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake
# DuckDB catalog with S3 data storage
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake data_path=data
# SQLite catalog for multi-client access
$ sling conns set DUCKLAKE type=ducklake catalog_type=sqlite catalog_conn_string=metadata.sqlite
# PostgreSQL catalog for multi-user access
$ sling conns set DUCKLAKE type=ducklake catalog_type=postgres catalog_conn_string="host=localhost port=5432 user=myuser password=mypass dbname=ducklake_catalog"
# MySQL catalog with Azure storage
$ sling conns set DUCKLAKE type=ducklake catalog_type=mysql catalog_conn_string="host=localhost user=root password=pass db=ducklake_catalog" data_path=data
Environment Variable
# configuration with s3 properties
export DUCKLAKE='{
type: ducklake,
catalog_type: postgres,
catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
data_path: "s3://bucket/path/to/data/folder/",
s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
s3_region: "us-east-1"
}'
# configuration with Non-HTTPS MinIO properties
export DUCKLAKE='{
type: ducklake,
catalog_type: postgres,
catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
data_path: "s3://bucket/path/to/data/folder/",
s3_endpoint: machine:9000,
s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
s3_region: "minio",
url_style: "path",
use_ssl: false
}'
# configuration with r2 properties
export DUCKLAKE='{
type: ducklake,
catalog_type: postgres,
catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
data_path: "r2://bucket/path/to/data/folder/",
s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
s3_endpoint: "<account-id>.r2.cloudflarestorage.com"
}'
# Windows PowerShell
$env:DUCKLAKE='{
type: ducklake,
catalog_type: postgres,
catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
data_path: "az://my_container/path/to/data/folder/",
azure_account_name: "<azure_account_name>",
azure_account_key: "<s3_secret_access_key>",
}'Sling Env File YAML
See here to learn more about the sling env.yaml file.
connections:
DUCKLAKE:
type: ducklake
catalog_type: <catalog_type>
catalog_conn_string: <catalog_conn_string>
data_path: <data_path>
database: <database>
schema: <schema>
# S3 Configuration (if using S3 storage)
s3_access_key_id: <s3_access_key_id>
s3_secret_access_key: <s3_secret_access_key>
s3_region: <s3_region>
s3_endpoint: http://localhost:9000
# Azure Configuration (if using Azure storage)
azure_account_name: <azure_account_name>
azure_account_key: <azure_account_key>
# GCS Configuration (if using GCS storage)
gcs_key_file: <gcs_key_file>
gcs_project_id: <gcs_project_id>Common Usage Examples
Basic Operations
# List tables
sling conns discover DUCKLAKE
# Query data
sling run --src-conn DUCKLAKE --src-stream "SELECT * FROM sales.orders LIMIT 10" --stdout
# Export to CSV
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-object file://./orders.csvData Import/Export
# Import CSV to DuckLake
sling run --src-stream file://./data.csv --tgt-conn DUCKLAKE --tgt-object sales.new_data
# Import from PostgreSQL
sling run --src-conn POSTGRES_DB --src-stream public.customers --tgt-conn DUCKLAKE --tgt-object sales.customers
# Export to Parquet files
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-conn AWS_S3 --tgt-object s3://bucket/exports/orders.parquetIf you are facing issues connecting, please reach out to us at [email protected], on discord or open a Github Issue here.
Last updated
Was this helpful?