Backfill

Backfill mode allows you to load historical data within a specific range based on an update key. This is useful when you need to reload data for a particular time period or range of values.

Basic Backfill

Basic backfill requires:

  • A primary key to uniquely identify records

  • An update key to determine the range

  • A range specification defining the start and end values

Using CLI Flags

$ sling run --src-conn MY_SOURCE_DB \
  --src-stream 'my_schema.orders' \
  --tgt-conn MY_TARGET_DB \
  --tgt-object 'target_schema.orders' \
  --mode backfill \
  --primary-key order_id \
  --update-key order_date \
  --range '2023-01-01,2023-12-31'

Using Replication Config

replication.yaml
source: MY_SOURCE_DB
target: MY_TARGET_DB

defaults:
  mode: backfill
  primary_key: [id]
  object: target_schema.{stream_table}

streams:
  my_schema.orders:
    update_key: order_date
    source_options:
      range: '2023-01-01,2023-12-31'

  my_schema.transactions:
    update_key: id # same as primary key
    source_options:
      range: '100000,200000'

Backfill Chunking

Backfill mode with chunking allows loading historical data in smaller ranges, optimizing for large datasets. See the chunking documentation for details.

source: postgres
target: oracle

defaults:
  mode: backfill
  object: oracle.{stream_table_lower}
  primary_key: [id]

streams:
  public.orders_mariadb_pg:
    update_key: update_dt
    source_options:
      range: '2018-11-01,2018-12-01'
      chunk_size: 10d  # Process in 10-day chunks

  public.orders_sqlserver_pg:
    update_key: date
    source_options:
      range: '2019-01-01,2019-06-01'
      chunk_size: 2m   # Process in 2-month chunks

  public.orders_snowflake_pg:
    update_key: id  # same as primary key
    source_options:
      range: '1,800'
      chunk_size: 200  # Process in chunks of 200 IDs

  public.large_table_pg:
    update_key: created_at
    source_options:
      range: '2023-01-01,2023-12-31'
      chunk_count: 6  # Split into 6 equal time-based chunks

  public.user_data_pg:
    update_key: user_id
    source_options:
      range: '1000,50000'
      chunk_count: 10  # Split into 10 equal numeric chunks

env:
  SLING_THREADS: 3  # Process 3 streams concurrently

Last updated

Was this helpful?