Iceberg

Connect & Ingest data from / to Apache Iceberg tables

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table. See https://iceberg.apache.org/ for more details.

Sling supports connecting to Iceberg tables through catalog backends including REST catalogs, AWS Glue, and SQL catalogs.

Setup

The following credentials keys are accepted:

Common Properties

catalog_type (required) -> The catalog type: rest, glue, or sql. Default is rest.
schema (optional) -> The default schema to use to read/write data. Default is main.

REST Catalog Configuration

rest_uri (required for REST) -> The REST catalog endpoint URI (e.g., https://s3tables.us-east-1.amazonaws.com/iceberg, https://catalog.cloudflarestorage.com/xxxxxxxxx/warehouse, http://localhost:8181).
rest_warehouse (optional) -> Warehouse location for the catalog (e.g., s3://bucket/warehouse, arn:aws:s3tables:region:account-id:bucket/namespace).
rest_token (optional) -> OAuth token for authentication.
rest_oauth_client_id (optional) -> OAuth client ID for authentication.
rest_oauth_client_secret (optional) -> OAuth client secret for authentication.
rest_oauth_scope (optional) -> OAuth scope for authentication.
rest_oauth_server_uri (optional) -> OAuth server URI for token requests.
rest_prefix (optional) -> API prefix for the REST catalog.
rest_metadata_location (optional) -> Custom metadata location.
rest_extra_props (optional) -> Additional properties as JSON string.
rest_sigv4_enable (optional) -> Enable AWS SigV4 authentication (true/false).
rest_sigv4_region (optional) -> AWS region for SigV4 authentication (with rest_sigv4_enable=true).
rest_sigv4_service (optional) -> AWS service name for SigV4 authentication (with rest_sigv4_enable=true).

Glue Catalog Configuration

glue_warehouse (required for Glue) -> Warehouse location in S3. e.g. s3://my-bucket/warehouse
glue_account_id (optional) -> AWS account ID for Glue catalog.
glue_namespace (optional) -> Namespace in the Glue catalog.
glue_extra_props (optional) -> Extra Glue Properties (object).

SQL Catalog Configuration

sql_catalog_name (optional) -> Name of the SQL catalog. Default is sql.
sql_catalog_conn (required for SQL) -> Name of a Sling database connection to use as catalog backend.
sql_catalog_init (optional) -> Whether to initialize catalog tables if they don't exist. Default is true.

Storage Configuration for Glue Catalog or S3Tables or SQL Catalog

For S3/S3-compatible storage:

s3_access_key_id (optional) -> AWS access key ID
s3_secret_access_key (optional) -> AWS secret access key
s3_session_token (optional) -> AWS session token
s3_region (optional) -> AWS region
s3_profile (optional) -> AWS profile to use
s3_endpoint (optional) -> S3-compatible endpoint URL (e.g. http://localhost:9000 for MinIO)

Using `sling conns`

Here are examples of setting a connection named ICEBERG. We must provide the type=iceberg property:

# REST catalog with local warehouse
$ sling conns set ICEBERG type=iceberg catalog_type=rest rest_uri=http://localhost:8181

# AWS S3 Tables via REST
$ sling conns set ICEBERG type=iceberg catalog_type=rest rest_warehouse="arn:aws:s3tables:us-east-1:123456789012:bucket/my-namespace" s3_profile=my-profile

# Cloudflare R2 Data Catalog via REST
$ sling conns set ICEBERG type=iceberg catalog_type=rest rest_uri="https://catalog.cloudflarestorage.com/xxxxxxxxx/warehouse" rest_warehouse="3fff6d86c73fcxxxxxxx4cb125f34927_warehouse" rest_token="<rest_token>"

# Glue catalog
$ sling conns set ICEBERG type=iceberg catalog_type=glue glue_warehouse=s3://my-bucket/glue-warehouse s3_access_key_id=AKIAIOSFODNN7EXAMPLE s3_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY s3_region=us-east-1

Environment Variable

See here to learn more about the .env.sling file.

# REST catalog local
export ICEBERG='{ 
  type: iceberg, 
  catalog_type: rest,
  rest_uri: "http://localhost:8181",
  rest_token: "<rest_token>",
  rest_warehouse: "<rest_warehouse>"
}'

# REST catalog with OAuth
export ICEBERG='{ 
  type: iceberg, 
  catalog_type: rest,
  rest_uri: "http://localhost:8181",
  rest_oauth_client_id: "my-client-id",
  rest_oauth_client_secret: "my-client-secret",
  rest_oauth_scope: "catalog:write",
  rest_oauth_server_uri: "https://auth.example.com/oauth2/token"
}'

# AWS S3 Tables
export ICEBERG_S3='{ 
  type: iceberg, 
  catalog_type: rest,
  rest_warehouse: "arn:aws:s3tables:us-east-1:123456789012:bucket/my-namespace",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}'

# Iceberg with GCP backend (Apache Gravitino)
export ICEBERG_GCP='{ 
  type: iceberg
  catalog_name: iceberg_catalog
  catalog_type: rest
  rest_uri: http://endpoint.com:9001/iceberg
  rest_warehouse: s3://my-bucket/gravitino/gcs_as_s3
  s3_access_key_id: <s3_access_key_id>
  s3_secret_access_key: <s3_secret_access_key>
  s3_endpoint: https://myhost.storage.googleapis.com
  s3_region: auto
  schema: iceberg_schema
}'

# Glue catalog
export ICEBERG_GLUE='{ 
  type: iceberg, 
  catalog_type: glue,
  glue_warehouse: "s3://my-bucket/glue-warehouse",
  s3_region: "us-east-1",
  s3_profile: "default"
}'

# SQL catalog with PostgreSQL
export ICEBERG='{ 
  type: iceberg
  catalog_type: sql
  sql_catalog_conn: "POSTGRES_DB"
  sql_catalog_name: "iceberg_catalog"
  sql_catalog_init: true
  sql_warehouse: s3://my-bucket/iceberg/warehouse
  s3_access_key_id: AKIAIOSFODNN7EXAMPLE
  s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  s3_region: us-east-1
}'

Sling Env File YAML

See here to learn more about the sling env.yaml file.

connections:
  ICEBERG:
    type: iceberg
    catalog_type: rest  # or glue, or sql
    catalog_name: iceberg
    schema: main
    
    # REST Catalog Configuration
    rest_uri: http://localhost:8181
    rest_warehouse: s3://my-bucket/warehouse
    rest_token: my-bearer-token
    rest_oauth_client_id: my-client-id
    rest_oauth_client_secret: my-client-secret
    rest_oauth_scope: catalog:write
    rest_oauth_server_uri: https://auth.example.com/oauth2/token
    
    # Glue Catalog Configuration
    catalog_type: glue
    glue_warehouse: s3://my-bucket/glue-warehouse
    s3_access_key_id: AKIAIOSFODNN7EXAMPLE
    s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    s3_region: us-east-1
    
    # SQL Catalog Configuration (alternative to REST/Glue)
    # sql_catalog_conn: POSTGRES_DB
    # sql_catalog_name: iceberg_catalog
    # sql_catalog_init: true
    
    # S3 Configuration (if using S3 storage)
    s3_access_key_id: AKIAIOSFODNN7EXAMPLE
    s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    s3_region: us-east-1
    s3_endpoint: http://localhost:9000  # for MinIO or other S3-compatible

  # Example SQL catalog setup with PostgreSQL backend
  ICEBERG_SQL:
    type: iceberg
    catalog_type: sql
    sql_catalog_conn: POSTGRES_CATALOG  # defined below
    sql_catalog_name: iceberg_catalog
    sql_catalog_init: true
    schema: main

    # S3 Configuration for storage
    sql_warehouse: s3://my-bucket/iceberg/warehouse
    s3_access_key_id: AKIAIOSFODNN7EXAMPLE
    s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    s3_region: us-east-1

  # PostgreSQL connection for SQL catalog backend
  POSTGRES_CATALOG:
    type: postgres
    host: localhost
    port: 5432
    user: postgres
    password: mypassword
    database: iceberg_catalog

Common Usage Examples

Basic Operations

# List namespaces (schemas)
sling conns discover ICEBERG

# List tables in a namespace
sling conns discover ICEBERG --schema my_namespace

# Query data
sling run --src-conn ICEBERG --src-stream "SELECT * FROM my_namespace.orders LIMIT 10" --stdout

# Export to CSV
sling run --src-conn ICEBERG --src-stream my_namespace.orders --tgt-object file://./orders.csv

Data Import/Export

# Import CSV to Iceberg
sling run --src-stream file://./data.csv --tgt-conn ICEBERG --tgt-object my_namespace.new_data

# Import from PostgreSQL
sling run --src-conn POSTGRES_DB --src-stream public.customers --tgt-conn ICEBERG --tgt-object sales.customers

# Export to Parquet files
sling run --src-conn ICEBERG --src-stream sales.orders --tgt-conn AWS_S3 --tgt-object s3://bucket/exports/orders.parquet

# Incremental sync with timestamp
sling run --src-conn POSTGRES_DB --src-stream public.events --tgt-conn ICEBERG --tgt-object events.raw --mode incremental --primary-key id --update-key updated_at

Advanced Queries with DuckDB Integration

# For complex SQL queries, Sling uses DuckDB with Iceberg extension
# Tables must be qualified with "iceberg_catalog" prefix
# Note: Custom SQL queries via DuckDB are only supported with REST and Glue catalog types
sling run --src-conn ICEBERG --src-stream "SELECT count(*) FROM iceberg_catalog.my_namespace.my_table" --stdout

# Join multiple Iceberg tables
sling run --src-conn ICEBERG --src-stream "
  SELECT o.*, c.customer_name 
  FROM iceberg_catalog.sales.orders o 
  JOIN iceberg_catalog.sales.customers c ON o.customer_id = c.id
" --stdout

# Export with custom SQL
sling run --src-conn ICEBERG --src-stream "
  SELECT date_trunc('month', order_date) as month, sum(amount) as total
  FROM iceberg_catalog.sales.orders 
  GROUP BY 1 ORDER BY 1
" --tgt-object file://./monthly_sales.csv

If you are facing issues connecting, please reach out to us at [email protected], on discord or open a Github Issue here.

PreviousDuckLake NextStorage Connections

Last updated 15 days ago

Was this helpful?

hashtagSetup

hashtagCommon Properties

hashtagREST Catalog Configuration

hashtagGlue Catalog Configuration

hashtagSQL Catalog Configuration

hashtagStorage Configuration for Glue Catalog or S3Tables or SQL Catalog

hashtagUsing sling conns

hashtagEnvironment Variable

hashtagSling Env File YAML

hashtagCommon Usage Examples

hashtagBasic Operations

hashtagData Import/Export

hashtagAdvanced Queries with DuckDB Integration