# DuckLake

DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables with support for multiple catalog databases and various storage backends. See <https://ducklake.select/> for more details.

## Setup

The following credentials keys are accepted:

* `catalog_type` **(required)** -> The catalog database type: `duckdb`, `sqlite`, `postgres`, or `mysql`. Default is `duckdb`.
* `catalog_conn_string` **(required)** -> The connection string for the catalog database. See [here](https://ducklake.select/docs/stable/duckdb/usage/choosing_a_catalog_database) for details. Examples:
  * postgres: `dbname=ducklake_catalog host=localhost`
  * sqlite: `metadata.sqlite`
  * duckdb: `metadata.ducklake`
  * mysql: `db=ducklake_catalog host=localhost`
* `catalog_schema` (optional) -> The schema to use to store catalog tables (e.g. `public`).
* `data_path` (optional) -> Path to data files (local, S3, Azure, GCS). e.g. `/local/path`, `s3://bucket/data`, `r2://bucket/data`, `az://container/data`, `gs://bucket/data`.
* `data_inlining_limit` (optional) -> The row limit to use for [`DATA_INLINING_ROW_LIMIT`](https://ducklake.select/docs/stable/duckdb/advanced_features/data_inlining.html).
* `copy_method` (optional *v1.4.20+*) -> the method data is copied from sling into DuckDB. Acceptable values: `csv_http`, `arrow_http`. Default is `arrow_http`. **If facing issues with arrow, try setting `csv_http`**.
* `schema` (optional) -> The default schema to use to read/write data. Default is `main`.
* `url_style` (optional) -> specify `path` to use path url style (e.g. when using MinIO)
* `use_ssl` (optional) -> specify `false` to not use HTTPS (e.g. when using MinIO)
* `duckdb_version` (optional) -> The CLI version of DuckDB to use. You can also specify the env. variable `DUCKDB_VERSION`.
* `max_buffer_size` (optional *v1.5*) -> the max buffer size to use when piping data from DuckDB CLI. Specify this if you have extremely large one-line text values in your dataset. Default is `10485760` (10MB).

### Storage Configuration

For **S3/S3-compatible storage**:

* `s3_access_key_id` (optional) -> AWS access key ID
* `s3_secret_access_key` (optional) -> AWS secret access key
* `s3_session_token` (optional) -> AWS session token
* `s3_region` (optional) -> AWS region
* `s3_profile` (optional) -> AWS profile to use
* `s3_endpoint` (optional) -> S3-compatible endpoint URL (e.g. `localhost:9000` for MinIO)

For **Azure Blob Storage**:

* `azure_account_name` (optional) -> Azure storage account name
* `azure_account_key` (optional) -> Azure storage account key
* `azure_sas_token` (optional) -> Azure SAS token
* `azure_tenant_id` (optional) -> Azure tenant ID
* `azure_client_id` (optional) -> Azure client ID
* `azure_client_secret` (optional) -> Azure client secret
* `azure_connection_string` (optional) -> Azure connection string

For **Google Cloud Storage**:

* `gcs_key_file` (optional) -> Path to GCS service account key file
* `gcs_project_id` (optional) -> GCS project ID
* `gcs_access_key_id` (optional) -> GCS HMAC access key ID
* `gcs_secret_access_key` (optional) -> GCS HMAC secret access key

### Using `sling conns`

Here are examples of setting a connection named `DUCKLAKE`. We must provide the `type=ducklake` property:

{% code overflow="wrap" %}

```bash
# Default DuckDB catalog with local data
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake

# DuckDB catalog with S3 data storage
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake data_path=data

# SQLite catalog for multi-client access
$ sling conns set DUCKLAKE type=ducklake catalog_type=sqlite catalog_conn_string=metadata.sqlite

# PostgreSQL catalog for multi-user access
$ sling conns set DUCKLAKE type=ducklake catalog_type=postgres catalog_conn_string="host=localhost port=5432 user=myuser password=mypass dbname=ducklake_catalog"

# MySQL catalog with Azure storage
$ sling conns set DUCKLAKE type=ducklake catalog_type=mysql catalog_conn_string="host=localhost user=root password=pass db=ducklake_catalog" data_path=data

```

{% endcode %}

### Environment Variable

See [here](https://docs.slingdata.io/connections/datalake-connections/pages/eAdVs2BHCgdr6RS8GoJC#dot-env-file-.env.sling) to learn more about the `.env.sling` file.

{% code overflow="wrap" %}

```bash

# configuration with s3 properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "s3://bucket/path/to/data/folder/",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_region: "us-east-1"
}'

# configuration with Non-HTTPS MinIO properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "s3://bucket/path/to/data/folder/",
  s3_endpoint: machine:9000,
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_region: "minio",
  url_style: "path",
  use_ssl: false
}'

# configuration with r2 properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "r2://bucket/path/to/data/folder/",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_endpoint: "<account-id>.r2.cloudflarestorage.com"
}'

# Windows PowerShell
$env:DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "az://my_container/path/to/data/folder/",
  azure_account_name: "<azure_account_name>",
  azure_account_key: "<s3_secret_access_key>",
}'
```

{% endcode %}

### Sling Env File YAML

See [here](https://docs.slingdata.io/connections/datalake-connections/pages/eAdVs2BHCgdr6RS8GoJC#sling-env-file-env.yaml) to learn more about the sling `env.yaml` file.

```yaml
connections:
  DUCKLAKE:
    type: ducklake
    catalog_type: <catalog_type>
    catalog_conn_string: <catalog_conn_string>
    data_path: <data_path>
    database: <database>
    schema: <schema>
    
    # S3 Configuration (if using S3 storage)
    s3_access_key_id: <s3_access_key_id>
    s3_secret_access_key: <s3_secret_access_key>
    s3_region: <s3_region>
    s3_endpoint: http://localhost:9000
    
    # Azure Configuration (if using Azure storage)
    azure_account_name: <azure_account_name>
    azure_account_key: <azure_account_key>
    
    # GCS Configuration (if using GCS storage)
    gcs_key_file: <gcs_key_file>
    gcs_project_id: <gcs_project_id>
```

## Common Usage Examples

### Basic Operations

```bash
# List tables
sling conns discover DUCKLAKE

# Query data
sling run --src-conn DUCKLAKE --src-stream "SELECT * FROM sales.orders LIMIT 10" --stdout

# Export to CSV
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-object file://./orders.csv
```

### Data Import/Export

```bash
# Import CSV to DuckLake
sling run --src-stream file://./data.csv --tgt-conn DUCKLAKE --tgt-object sales.new_data

# Import from PostgreSQL
sling run --src-conn POSTGRES_DB --src-stream public.customers --tgt-conn DUCKLAKE --tgt-object sales.customers

# Export to Parquet files
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-conn AWS_S3 --tgt-object s3://bucket/exports/orders.parquet
```

If you are facing issues connecting, please reach out to us at <support@slingdata.io>, on [discord](https://discord.gg/q5xtaSNDvp) or open a Github Issue [here](https://github.com/slingdata-io/sling-cli/issues).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.slingdata.io/connections/datalake-connections/ducklake.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
