Backfill
Backfill mode allows you to load historical data within a specific range based on an update key. This is useful when you need to reload data for a particular time period or range of values.
Basic Backfill
Basic backfill requires:
A primary key to uniquely identify records
An update key to determine the range
A range specification defining the start and end values
Using CLI Flags
$ sling run --src-conn MY_SOURCE_DB \
--src-stream 'my_schema.orders' \
--tgt-conn MY_TARGET_DB \
--tgt-object 'target_schema.orders' \
--mode backfill \
--primary-key order_id \
--update-key order_date \
--range '2023-01-01,2023-12-31'Using Replication Config
source: MY_SOURCE_DB
target: MY_TARGET_DB
defaults:
mode: backfill
primary_key: [id]
object: target_schema.{stream_table}
streams:
my_schema.orders:
update_key: order_date
source_options:
range: '2023-01-01,2023-12-31'
my_schema.transactions:
update_key: id # same as primary key
source_options:
range: '100000,200000'Backfill Chunking
Backfill mode with chunking allows loading historical data in smaller ranges, optimizing for large datasets. See the chunking documentation for details.
source: postgres
target: oracle
defaults:
mode: backfill
object: oracle.{stream_table_lower}
primary_key: [id]
streams:
public.orders_mariadb_pg:
update_key: update_dt
source_options:
range: '2018-11-01,2018-12-01'
chunk_size: 10d # Process in 10-day chunks
public.orders_sqlserver_pg:
update_key: date
source_options:
range: '2019-01-01,2019-06-01'
chunk_size: 2m # Process in 2-month chunks
public.orders_snowflake_pg:
update_key: id # same as primary key
source_options:
range: '1,800'
chunk_size: 200 # Process in chunks of 200 IDs
public.large_table_pg:
update_key: created_at
source_options:
range: '2023-01-01,2023-12-31'
chunk_count: 6 # Split into 6 equal time-based chunks
public.user_data_pg:
update_key: user_id
source_options:
range: '1000,50000'
chunk_count: 10 # Split into 10 equal numeric chunks
env:
SLING_THREADS: 3 # Process 3 streams concurrentlyLast updated
Was this helpful?