# DuckLake

DuckLake is a data lake format specification that combines the power of DuckDB with flexible catalog backends and scalable data storage. It provides versioned, ACID-compliant tables with support for multiple catalog databases and various storage backends. See <https://ducklake.select/> for more details.

## Setup

The following credentials keys are accepted:

* `catalog_type` **(required)** -> The catalog database type: `duckdb`, `sqlite`, `postgres`, or `mysql`. Default is `duckdb`.
* `catalog_conn_string` **(required)** -> The connection string for the catalog database. See [here](https://ducklake.select/docs/stable/duckdb/usage/choosing_a_catalog_database) for details. Examples:
  * postgres: `dbname=ducklake_catalog host=localhost`
  * sqlite: `metadata.sqlite`
  * duckdb: `metadata.ducklake`
  * mysql: `db=ducklake_catalog host=localhost`
* `catalog_schema` (optional) -> The schema to use to store catalog tables (e.g. `public`).
* `data_path` (optional) -> Path to data files (local, S3, Azure, GCS). e.g. `/local/path`, `s3://bucket/data`, `r2://bucket/data`, `az://container/data`, `gs://bucket/data`.
* `data_inlining_limit` (optional) -> The row limit to use for [`DATA_INLINING_ROW_LIMIT`](https://ducklake.select/docs/stable/duckdb/advanced_features/data_inlining.html).
* `copy_method` (optional *v1.4.20+*) -> the method data is copied from sling into DuckDB. Acceptable values: `csv_http`, `arrow_http`. Default is `arrow_http`. **If facing issues with arrow, try setting `csv_http`**.
* `schema` (optional) -> The default schema to use to read/write data. Default is `main`.
* `url_style` (optional) -> specify `path` to use path url style (e.g. when using MinIO)
* `use_ssl` (optional) -> specify `false` to not use HTTPS (e.g. when using MinIO)
* `duckdb_version` (optional) -> The CLI version of DuckDB to use. You can also specify the env. variable `DUCKDB_VERSION`.
* `max_buffer_size` (optional *v1.5*) -> the max buffer size to use when piping data from DuckDB CLI. Specify this if you have extremely large one-line text values in your dataset. Default is `10485760` (10MB).

### Storage Configuration

For **S3/S3-compatible storage**:

* `s3_access_key_id` (optional) -> AWS access key ID
* `s3_secret_access_key` (optional) -> AWS secret access key
* `s3_session_token` (optional) -> AWS session token
* `s3_region` (optional) -> AWS region
* `s3_profile` (optional) -> AWS profile to use
* `s3_endpoint` (optional) -> S3-compatible endpoint URL (e.g. `localhost:9000` for MinIO)

For **Azure Blob Storage**:

* `azure_account_name` (optional) -> Azure storage account name
* `azure_account_key` (optional) -> Azure storage account key
* `azure_sas_token` (optional) -> Azure SAS token
* `azure_tenant_id` (optional) -> Azure tenant ID
* `azure_client_id` (optional) -> Azure client ID
* `azure_client_secret` (optional) -> Azure client secret
* `azure_connection_string` (optional) -> Azure connection string

For **Google Cloud Storage**:

* `gcs_key_file` (optional) -> Path to GCS service account key file
* `gcs_project_id` (optional) -> GCS project ID
* `gcs_access_key_id` (optional) -> GCS HMAC access key ID
* `gcs_secret_access_key` (optional) -> GCS HMAC secret access key

### Using `sling conns`

Here are examples of setting a connection named `DUCKLAKE`. We must provide the `type=ducklake` property:

{% code overflow="wrap" %}

```bash
# Default DuckDB catalog with local data
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake

# DuckDB catalog with S3 data storage
$ sling conns set DUCKLAKE type=ducklake catalog_conn_string=metadata.ducklake data_path=data

# SQLite catalog for multi-client access
$ sling conns set DUCKLAKE type=ducklake catalog_type=sqlite catalog_conn_string=metadata.sqlite

# PostgreSQL catalog for multi-user access
$ sling conns set DUCKLAKE type=ducklake catalog_type=postgres catalog_conn_string="host=localhost port=5432 user=myuser password=mypass dbname=ducklake_catalog"

# MySQL catalog with Azure storage
$ sling conns set DUCKLAKE type=ducklake catalog_type=mysql catalog_conn_string="host=localhost user=root password=pass db=ducklake_catalog" data_path=data

```

{% endcode %}

### Environment Variable

See [here](https://docs.slingdata.io/sling-cli/environment#dot-env-file-.env.sling) to learn more about the `.env.sling` file.

{% code overflow="wrap" %}

```bash

# configuration with s3 properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "s3://bucket/path/to/data/folder/",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_region: "us-east-1"
}'

# configuration with Non-HTTPS MinIO properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "s3://bucket/path/to/data/folder/",
  s3_endpoint: machine:9000,
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_region: "minio",
  url_style: "path",
  use_ssl: false
}'

# configuration with r2 properties
export DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "r2://bucket/path/to/data/folder/",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
  s3_endpoint: "<account-id>.r2.cloudflarestorage.com"
}'

# Windows PowerShell
$env:DUCKLAKE='{ 
  type: ducklake, 
  catalog_type: postgres,
  catalog_conn_string: "host=db.example.com port=5432 user=ducklake password=secret dbname=ducklake_catalog",
  data_path: "az://my_container/path/to/data/folder/",
  azure_account_name: "<azure_account_name>",
  azure_account_key: "<s3_secret_access_key>",
}'
```

{% endcode %}

### Sling Env File YAML

See [here](https://docs.slingdata.io/sling-cli/environment#sling-env-file-env.yaml) to learn more about the sling `env.yaml` file.

```yaml
connections:
  DUCKLAKE:
    type: ducklake
    catalog_type: <catalog_type>
    catalog_conn_string: <catalog_conn_string>
    data_path: <data_path>
    database: <database>
    schema: <schema>
    
    # S3 Configuration (if using S3 storage)
    s3_access_key_id: <s3_access_key_id>
    s3_secret_access_key: <s3_secret_access_key>
    s3_region: <s3_region>
    s3_endpoint: http://localhost:9000
    
    # Azure Configuration (if using Azure storage)
    azure_account_name: <azure_account_name>
    azure_account_key: <azure_account_key>
    
    # GCS Configuration (if using GCS storage)
    gcs_key_file: <gcs_key_file>
    gcs_project_id: <gcs_project_id>
```

## Common Usage Examples

### Basic Operations

```bash
# List tables
sling conns discover DUCKLAKE

# Query data
sling run --src-conn DUCKLAKE --src-stream "SELECT * FROM sales.orders LIMIT 10" --stdout

# Export to CSV
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-object file://./orders.csv
```

### Data Import/Export

```bash
# Import CSV to DuckLake
sling run --src-stream file://./data.csv --tgt-conn DUCKLAKE --tgt-object sales.new_data

# Import from PostgreSQL
sling run --src-conn POSTGRES_DB --src-stream public.customers --tgt-conn DUCKLAKE --tgt-object sales.customers

# Export to Parquet files
sling run --src-conn DUCKLAKE --src-stream sales.orders --tgt-conn AWS_S3 --tgt-object s3://bucket/exports/orders.parquet
```

If you are facing issues connecting, please reach out to us at <support@slingdata.io>, on [discord](https://discord.gg/q5xtaSNDvp) or open a Github Issue [here](https://github.com/slingdata-io/sling-cli/issues).
