Backfill mode allows you to load historical data within a specific range based on an update key. This is useful when you need to reload data for a particular time period or range of values.
Basic Backfill
Basic backfill requires:
A primary key to uniquely identify records
An update key to determine the range
A range specification defining the start and end values
source: MY_SOURCE_DB
target: MY_TARGET_DB
defaults:
mode: backfill
primary_key: [id]
object: target_schema.{stream_table}
streams:
my_schema.orders:
update_key: order_date
source_options:
range: '2023-01-01,2023-12-31'
my_schema.transactions:
update_key: id # same as primary key
source_options:
range: '100000,200000'
Chunked Backfill
Supported Chunk Types
We can define the chunk size using the source option chunk_size. It accepts values as shown below.
Time-based chunks:
Hours: e.g., 6h
Days: e.g., 7d
Months: e.g., 1m
Numeric chunks:
Integer ranges: e.g., 1000 for chunks of 1000 records
Example Configuration
replication.yaml
source: postgres
target: oracle
defaults:
mode: backfill
object: oracle.{stream_table_lower}
primary_key: [id]
streams:
public.orders_mariadb_pg:
update_key: update_dt
source_options:
range: '2018-11-01,2018-12-01'
chunk_size: 10d # Process in 10-day chunks
public.orders_sqlserver_pg:
update_key: date
source_options:
range: '2019-01-01,2019-06-01'
chunk_size: 2m # Process in 2-month chunks
public.orders_snowflake_pg:
update_key: id # same as primary key
source_options:
range: '1,800'
chunk_size: 200 # Process in chunks of 200 IDs
env:
SLING_THREADS: 3 # Process 3 streams concurrently
How Chunking Works
Sling automatically splits streams with the specified range into smaller stream ranges, based on the chunk_size.
Each chunk is processed independently, allowing for:
Easier recovery if errors occur
Progress tracking per chunk
For example, consider the public.orders_snowflake_pg stream below:
public.orders_snowflake_pg:
primary_key: [id]
update_key: id # Same as integer primary key
source_options:
range: '1,800'
chunk_size: 200 # Process in chunks of 200 IDs
Once compiled, Sling will generate the following streams with a numeric suffix (e.g. _001):
public.orders_sqlserver_pg:
primary_key: [id]
update_key: date
source_options:
range: '2019-01-01,2019-06-01'
chunk_size: 2m # Process in 2-month chunks
For large datasets, chunked backfill breaks down the data into smaller, manageable pieces. This helps optimize performance and resource usage. Chunking is available with Sling or on a plan. This is especially useful when combined with the for parallel execution.