Backfill

Backfill mode allows you to load historical data within a specific range based on an update key. This is useful when you need to reload data for a particular time period or range of values.

Basic Backfill

Basic backfill requires:

  • A primary key to uniquely identify records

  • An update key to determine the range

  • A range specification defining the start and end values

Using CLI Flags

$ sling run --src-conn MY_SOURCE_DB \
  --src-stream 'my_schema.orders' \
  --tgt-conn MY_TARGET_DB \
  --tgt-object 'target_schema.orders' \
  --mode backfill \
  --primary-key order_id \
  --update-key order_date \
  --range '2023-01-01,2023-12-31'

Using Replication Config

replication.yaml
source: MY_SOURCE_DB
target: MY_TARGET_DB

defaults:
  mode: backfill
  primary_key: [id]
  object: target_schema.{stream_table}

streams:
  my_schema.orders:
    update_key: order_date
    source_options:
      range: '2023-01-01,2023-12-31'

  my_schema.transactions:
    update_key: id # same as primary key
    source_options:
      range: '100000,200000'

Chunked Backfill

For large datasets, chunked backfill breaks down the data into smaller, manageable pieces. This helps optimize performance and resource usage. Chunking is available with Sling CLI Pro or on a Platform plan. This is especially useful when combined with the SLING_THREADS for parallel execution.

Supported Chunk Types

We can define the chunk size using the source option chunk_size. It accepts values as shown below.

  1. Time-based chunks:

    • Hours: e.g., 6h

    • Days: e.g., 7d

    • Months: e.g., 1m

  2. Numeric chunks:

    • Integer ranges: e.g., 1000 for chunks of 1000 records

Example Configuration

replication.yaml
source: postgres
target: oracle

defaults:
  mode: backfill
  object: oracle.{stream_table_lower}
  primary_key: [id]

streams:
  public.orders_mariadb_pg:
    update_key: update_dt
    source_options:
      range: '2018-11-01,2018-12-01'
      chunk_size: 10d  # Process in 10-day chunks

  public.orders_sqlserver_pg:
    update_key: date
    source_options:
      range: '2019-01-01,2019-06-01'
      chunk_size: 2m   # Process in 2-month chunks

  public.orders_snowflake_pg:
    update_key: id  # same as primary key
    source_options:
      range: '1,800'
      chunk_size: 200  # Process in chunks of 200 IDs

env:
  SLING_THREADS: 3  # Process 3 streams concurrently

How Chunking Works

Sling automatically splits streams with the specified range into smaller stream ranges, based on the chunk_size.

Each chunk is processed independently, allowing for:

  • Parallel processing capabilities (with SLING_THREADS)

  • Easier recovery if errors occur

  • Progress tracking per chunk

For example, consider the public.orders_snowflake_pg stream below:

public.orders_snowflake_pg:
  primary_key: [id]
  update_key: id     # Same as integer primary key
  source_options:
    range: '1,800'
    chunk_size: 200  # Process in chunks of 200 IDs

Once compiled, Sling will generate the following streams with a numeric suffix (e.g. _001):

public.orders_snowflake_pg_001:
  source_options:
    range: '1,200'

public.orders_snowflake_pg_002:
  source_options:
    range: '201,400'

public.orders_snowflake_pg_003:
  source_options:
    range: '301,600'

public.orders_snowflake_pg_004:
  source_options:
    range: '601,800'

For a time-based example:

public.orders_sqlserver_pg:
  primary_key: [id]
  update_key: date
  source_options:
    range: '2019-01-01,2019-06-01'
    chunk_size: 2m   # Process in 2-month chunks

Will yield:

public.orders_sqlserver_pg_001:
  source_options:
    range: '2019-01-01,2019-03-01'

public.orders_sqlserver_pg_002:
  source_options:
    range: '2019-03-01,2019-05-01'

public.orders_sqlserver_pg_003:
  source_options:
    range: '2019-05-01,2019-06-01'

Last updated