DuckLake
Connect & Ingest data from / to a DuckLake database
DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables with support for multiple catalog databases and various storage backends. See https://ducklake.select/ for more details.
Setup
The following credentials keys are accepted:
catalog_type(required) -> The catalog database type:duckdb,sqlite,postgres, ormysql. Default isduckdb.catalog_conn_string(required) -> The connection string for the catalog database. See here for details. Examples:postgres:
dbname=ducklake_catalog host=localhostsqlite:
metadata.sqliteduckdb:
metadata.ducklakemysql:
db=ducklake_catalog host=localhost
catalog_schema(optional) -> The schema to use to store catalog tables (e.g.public).data_path(optional) -> Path to data files (local, S3, Azure, GCS). e.g./local/path,s3://bucket/data,r2://bucket/data,az://container/data,gs://bucket/data.data_inlining_limit(optional) -> The row limit to use forDATA_INLINING_ROW_LIMIT.copy_method(optional v1.4.20+) -> the method data is copied from sling into DuckDB. Acceptable values:csv_http,arrow_http. Default isarrow_http. If facing issues with arrow, try settingcsv_http.schema(optional) -> The default schema to use to read/write data. Default ismain.url_style(optional) -> specifypathto use path url style (e.g. when using MinIO)use_ssl(optional) -> specifyfalseto not use HTTPS (e.g. when using MinIO)duckdb_version(optional) -> The CLI version of DuckDB to use. You can also specify the env. variableDUCKDB_VERSION.max_buffer_size(optional v1.5) -> the max buffer size to use when piping data from DuckDB CLI. Specify this if you have extremely large one-line text values in your dataset. Default is10485760(10MB).
Storage Configuration
For S3/S3-compatible storage:
s3_access_key_id(optional) -> AWS access key IDs3_secret_access_key(optional) -> AWS secret access keys3_session_token(optional) -> AWS session tokens3_region(optional) -> AWS regions3_profile(optional) -> AWS profile to uses3_endpoint(optional) -> S3-compatible endpoint URL (e.g.localhost:9000for MinIO)
For Azure Blob Storage:
azure_account_name(optional) -> Azure storage account nameazure_account_key(optional) -> Azure storage account keyazure_sas_token(optional) -> Azure SAS tokenazure_tenant_id(optional) -> Azure tenant IDazure_client_id(optional) -> Azure client IDazure_client_secret(optional) -> Azure client secretazure_connection_string(optional) -> Azure connection string
For Google Cloud Storage:
gcs_key_file(optional) -> Path to GCS service account key filegcs_project_id(optional) -> GCS project IDgcs_access_key_id(optional) -> GCS HMAC access key IDgcs_secret_access_key(optional) -> GCS HMAC secret access key
Using sling conns
sling connsHere are examples of setting a connection named DUCKLAKE. We must provide the type=ducklake property:
Environment Variable
See here to learn more about the .env.sling file.
Sling Env File YAML
See here to learn more about the sling env.yaml file.
Common Usage Examples
Basic Operations
Data Import/Export
If you are facing issues connecting, please reach out to us at [email protected], on discord or open a Github Issue here.
Last updated
Was this helpful?