Iceberg

Connect & Ingest data from / to Apache Iceberg tables

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table. See https://iceberg.apache.org/ for more details.

Sling supports connecting to Iceberg tables through catalog backends including REST catalogs, AWS Glue, and SQL catalogs.

Setup

The following credentials keys are accepted:

Common Properties

  • catalog_type (required) -> The catalog type: rest, glue, or sql. Default is rest.

  • schema (optional) -> The default schema to use to read/write data. Default is main.

REST Catalog Configuration

  • rest_uri (required for REST) -> The REST catalog endpoint URI (e.g., https://s3tables.us-east-1.amazonaws.com/iceberg, https://catalog.cloudflarestorage.com/xxxxxxxxx/warehouse, http://localhost:8181).

  • rest_warehouse (optional) -> Warehouse location for the catalog (e.g., s3://bucket/warehouse, arn:aws:s3tables:region:account-id:bucket/namespace).

  • rest_token (optional) -> OAuth token for authentication.

  • rest_oauth_client_id (optional) -> OAuth client ID for authentication.

  • rest_oauth_client_secret (optional) -> OAuth client secret for authentication.

  • rest_oauth_scope (optional) -> OAuth scope for authentication.

  • rest_oauth_server_uri (optional) -> OAuth server URI for token requests.

  • rest_prefix (optional) -> API prefix for the REST catalog.

  • rest_metadata_location (optional) -> Custom metadata location.

  • rest_extra_props (optional) -> Additional properties as JSON string.

  • rest_sigv4_enable (optional) -> Enable AWS SigV4 authentication (true/false).

  • rest_sigv4_region (optional) -> AWS region for SigV4 authentication (with rest_sigv4_enable=true).

  • rest_sigv4_service (optional) -> AWS service name for SigV4 authentication (with rest_sigv4_enable=true).

Glue Catalog Configuration

  • glue_warehouse (required for Glue) -> Warehouse location in S3. e.g. s3://my-bucket/warehouse

  • glue_account_id (optional) -> AWS account ID for Glue catalog.

  • glue_namespace (optional) -> Namespace in the Glue catalog.

  • glue_extra_props (optional) -> Extra Glue Properties (object).

SQL Catalog Configuration

  • sql_catalog_name (optional) -> Name of the SQL catalog. Default is sql.

  • sql_catalog_conn (required for SQL) -> Name of a Sling database connection to use as catalog backend.

  • sql_catalog_init (optional) -> Whether to initialize catalog tables if they don't exist. Default is true.

Storage Configuration for Glue Catalog or S3Tables or SQL Catalog

For S3/S3-compatible storage:

  • s3_access_key_id (optional) -> AWS access key ID

  • s3_secret_access_key (optional) -> AWS secret access key

  • s3_session_token (optional) -> AWS session token

  • s3_region (optional) -> AWS region

  • s3_profile (optional) -> AWS profile to use

  • s3_endpoint (optional) -> S3-compatible endpoint URL (e.g. http://localhost:9000 for MinIO)

Using sling conns

Here are examples of setting a connection named ICEBERG. We must provide the type=iceberg property:

# REST catalog with local warehouse
$ sling conns set ICEBERG type=iceberg catalog_type=rest rest_uri=http://localhost:8181

# AWS S3 Tables via REST
$ sling conns set ICEBERG type=iceberg catalog_type=rest rest_warehouse="arn:aws:s3tables:us-east-1:123456789012:bucket/my-namespace" s3_profile=my-profile

# Cloudflare R2 Data Catalog via REST
$ sling conns set ICEBERG type=iceberg catalog_type=rest rest_uri="https://catalog.cloudflarestorage.com/xxxxxxxxx/warehouse" rest_warehouse="3fff6d86c73fcxxxxxxx4cb125f34927_warehouse" rest_token="<rest_token>"

# Glue catalog
$ sling conns set ICEBERG type=iceberg catalog_type=glue glue_warehouse=s3://my-bucket/glue-warehouse s3_access_key_id=AKIAIOSFODNN7EXAMPLE s3_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY s3_region=us-east-1

Environment Variable

# REST catalog local
export ICEBERG='{ 
  type: iceberg, 
  catalog_type: rest,
  rest_uri: "http://localhost:8181",
  rest_token: "<rest_token>",
  rest_warehouse: "<rest_warehouse>"
}'

# REST catalog with OAuth
export ICEBERG='{ 
  type: iceberg, 
  catalog_type: rest,
  rest_uri: "http://localhost:8181",
  rest_oauth_client_id: "my-client-id",
  rest_oauth_client_secret: "my-client-secret",
  rest_oauth_scope: "catalog:write",
  rest_oauth_server_uri: "https://auth.example.com/oauth2/token"
}'

# AWS S3 Tables
export ICEBERG_S3='{ 
  type: iceberg, 
  catalog_type: rest,
  rest_warehouse: "arn:aws:s3tables:us-east-1:123456789012:bucket/my-namespace",
  s3_access_key_id: "AKIAIOSFODNN7EXAMPLE",
  s3_secret_access_key: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}'

# Iceberg with GCP backend (Apache Gravitino)
export ICEBERG_GCP='{ 
  type: iceberg
  catalog_name: iceberg_catalog
  catalog_type: rest
  rest_uri: http://endpoint.com:9001/iceberg
  rest_warehouse: s3://my-bucket/gravitino/gcs_as_s3
  s3_access_key_id: <s3_access_key_id>
  s3_secret_access_key: <s3_secret_access_key>
  s3_endpoint: https://myhost.storage.googleapis.com
  s3_region: auto
  schema: iceberg_schema
}'

# Glue catalog
export ICEBERG_GLUE='{ 
  type: iceberg, 
  catalog_type: glue,
  glue_warehouse: "s3://my-bucket/glue-warehouse",
  s3_region: "us-east-1",
  s3_profile: "default"
}'

# SQL catalog with PostgreSQL
export ICEBERG='{ 
  type: iceberg
  catalog_type: sql
  sql_catalog_conn: "POSTGRES_DB"
  sql_catalog_name: "iceberg_catalog"
  sql_catalog_init: true
  sql_warehouse: s3://my-bucket/iceberg/warehouse
  s3_access_key_id: AKIAIOSFODNN7EXAMPLE
  s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  s3_region: us-east-1
}'

Sling Env File YAML

See here to learn more about the sling env.yaml file.

connections:
  ICEBERG:
    type: iceberg
    catalog_type: rest  # or glue, or sql
    catalog_name: iceberg
    schema: main
    
    # REST Catalog Configuration
    rest_uri: http://localhost:8181
    rest_warehouse: s3://my-bucket/warehouse
    rest_token: my-bearer-token
    rest_oauth_client_id: my-client-id
    rest_oauth_client_secret: my-client-secret
    rest_oauth_scope: catalog:write
    rest_oauth_server_uri: https://auth.example.com/oauth2/token
    
    # Glue Catalog Configuration
    catalog_type: glue
    glue_warehouse: s3://my-bucket/glue-warehouse
    s3_access_key_id: AKIAIOSFODNN7EXAMPLE
    s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    s3_region: us-east-1
    
    # SQL Catalog Configuration (alternative to REST/Glue)
    # sql_catalog_conn: POSTGRES_DB
    # sql_catalog_name: iceberg_catalog
    # sql_catalog_init: true
    
    # S3 Configuration (if using S3 storage)
    s3_access_key_id: AKIAIOSFODNN7EXAMPLE
    s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    s3_region: us-east-1
    s3_endpoint: http://localhost:9000  # for MinIO or other S3-compatible

  # Example SQL catalog setup with PostgreSQL backend
  ICEBERG_SQL:
    type: iceberg
    catalog_type: sql
    sql_catalog_conn: POSTGRES_CATALOG  # defined below
    sql_catalog_name: iceberg_catalog
    sql_catalog_init: true
    schema: main

    # S3 Configuration for storage
    sql_warehouse: s3://my-bucket/iceberg/warehouse
    s3_access_key_id: AKIAIOSFODNN7EXAMPLE
    s3_secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    s3_region: us-east-1

  # PostgreSQL connection for SQL catalog backend
  POSTGRES_CATALOG:
    type: postgres
    host: localhost
    port: 5432
    user: postgres
    password: mypassword
    database: iceberg_catalog

Common Usage Examples

Basic Operations

# List namespaces (schemas)
sling conns discover ICEBERG

# List tables in a namespace
sling conns discover ICEBERG --schema my_namespace

# Query data
sling run --src-conn ICEBERG --src-stream "SELECT * FROM my_namespace.orders LIMIT 10" --stdout

# Export to CSV
sling run --src-conn ICEBERG --src-stream my_namespace.orders --tgt-object file://./orders.csv

Data Import/Export

# Import CSV to Iceberg
sling run --src-stream file://./data.csv --tgt-conn ICEBERG --tgt-object my_namespace.new_data

# Import from PostgreSQL
sling run --src-conn POSTGRES_DB --src-stream public.customers --tgt-conn ICEBERG --tgt-object sales.customers

# Export to Parquet files
sling run --src-conn ICEBERG --src-stream sales.orders --tgt-conn AWS_S3 --tgt-object s3://bucket/exports/orders.parquet

# Incremental sync with timestamp
sling run --src-conn POSTGRES_DB --src-stream public.events --tgt-conn ICEBERG --tgt-object events.raw --mode incremental --primary-key id --update-key updated_at

Advanced Queries with DuckDB Integration

# For complex SQL queries, Sling uses DuckDB with Iceberg extension
# Tables must be qualified with "iceberg_catalog" prefix
# Note: Custom SQL queries via DuckDB are only supported with REST and Glue catalog types
sling run --src-conn ICEBERG --src-stream "SELECT count(*) FROM iceberg_catalog.my_namespace.my_table" --stdout

# Join multiple Iceberg tables
sling run --src-conn ICEBERG --src-stream "
  SELECT o.*, c.customer_name 
  FROM iceberg_catalog.sales.orders o 
  JOIN iceberg_catalog.sales.customers c ON o.customer_id = c.id
" --stdout

# Export with custom SQL
sling run --src-conn ICEBERG --src-stream "
  SELECT date_trunc('month', order_date) as month, sum(amount) as total
  FROM iceberg_catalog.sales.orders 
  GROUP BY 1 ORDER BY 1
" --tgt-object file://./monthly_sales.csv

If you are facing issues connecting, please reach out to us at [email protected], on discord or open a Github Issue here.

Last updated

Was this helpful?