Backfill mode allows you to load historical data within a specific range based on an update key. This is useful when you need to reload data for a particular time period or range of values.
Basic Backfill
Basic backfill requires:
A primary key to uniquely identify records
An update key to determine the range
A range specification defining the start and end values
source: MY_SOURCE_DB
target: MY_TARGET_DB
defaults:
mode: backfill
primary_key: [id]
object: target_schema.{stream_table}
streams:
my_schema.orders:
update_key: order_date
source_options:
range: '2023-01-01,2023-12-31'
my_schema.transactions:
update_key: id # same as primary key
source_options:
range: '100000,200000'
Chunked Backfill
For large datasets, chunked backfill breaks down the data into smaller, manageable pieces. This helps optimize performance and resource usage. Chunking is available with Sling CLI Pro or on a Platform plan. This is especially useful when combined with the SLING_THREADS for parallel execution.
Supported Chunk Types
We can define the chunk size using the source option chunk_size. It accepts values as shown below.
Time-based chunks:
Hours: e.g., 6h
Days: e.g., 7d
Months: e.g., 1m
Numeric chunks:
Integer ranges: e.g., 1000 for chunks of 1000 records
Example Configuration
replication.yaml
source: postgres
target: oracle
defaults:
mode: backfill
object: oracle.{stream_table_lower}
primary_key: [id]
streams:
public.orders_mariadb_pg:
update_key: update_dt
source_options:
range: '2018-11-01,2018-12-01'
chunk_size: 10d # Process in 10-day chunks
public.orders_sqlserver_pg:
update_key: date
source_options:
range: '2019-01-01,2019-06-01'
chunk_size: 2m # Process in 2-month chunks
public.orders_snowflake_pg:
update_key: id # same as primary key
source_options:
range: '1,800'
chunk_size: 200 # Process in chunks of 200 IDs
env:
SLING_THREADS: 3 # Process 3 streams concurrently
How Chunking Works
Sling automatically splits streams with the specified range into smaller stream ranges, based on the chunk_size.
Each chunk is processed independently, allowing for:
For example, consider the public.orders_snowflake_pg stream below:
public.orders_snowflake_pg:
primary_key: [id]
update_key: id # Same as integer primary key
source_options:
range: '1,800'
chunk_size: 200 # Process in chunks of 200 IDs
Once compiled, Sling will generate the following streams with a numeric suffix (e.g. _001):
public.orders_sqlserver_pg:
primary_key: [id]
update_key: date
source_options:
range: '2019-01-01,2019-06-01'
chunk_size: 2m # Process in 2-month chunks