Backfill
Examples of backfilling historical data from APIs to databases using range parameters
Backfill mode allows you to load historical data from APIs within a specific date or value range. This is useful when you need to reload data for a particular time period or catch up on historical data.
Learn more: Incremental Sync | Iteration
How API Backfill Works
API backfill uses the range parameter in combination with context.range_start and context.range_end to iterate over date or value ranges:
Define a range in your replication config using
source_options.rangeAccess the range values in your spec using
context.range_startandcontext.range_endUse these values with the
range()function to generate iterationsEach iteration fetches data for that specific date or value
Note: API backfill does not support chunking (like database backfill does). The range is processed through iteration.
Date Range Backfill
This is the most common use case - backfilling API data for a specific date range.
Spec File (analytics_api.yaml)
name: "Analytics API"
defaults:
state:
base_url: https://api.analytics.com/v1
request:
headers:
Authorization: "Bearer {secrets.api_key}"
endpoints:
daily_events:
description: "Get daily event data for a date range"
# Persist the last processed date for incremental runs
sync: [last_date]
iterate:
# Generate date range using context values from replication config
over: >
range(
coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -7, "day"), "%Y-%m-%d")),
coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
"1d"
)
into: "state.current_date"
concurrency: 5 # Process 5 dates concurrently
state:
# Format the date for the API request
date: '{date_format(state.current_date, "%Y-%m-%d")}'
request:
url: "{state.base_url}/events/daily/{state.date}"
method: GET
response:
records:
jmespath: "data.events[]"
primary_key: ["event_id"]
processors:
# Track the latest date processed for incremental sync
- expression: "state.date"
output: "state.last_date"
aggregation: "maximum"
overrides:
mode: incremental # Use incremental mode for proper upsertingUsing Replication
Running with Sling: sling run -r /path/to/replication.yaml
source: MY_API
target: MY_TARGET_DB
defaults:
mode: full-refresh
streams:
daily_events:
object: analytics.events
source_options:
# Backfill data for January 2024
range: '2024-01-01,2024-01-31'Tip: The spec uses
overrides.mode: incrementalto ensure data is properly upserted based onprimary_key, while the replication config can usemode: full-refreshor leave it unset.
Using Python
from sling import Replication, ReplicationStream
replication = Replication(
source='MY_API',
target='MY_TARGET_DB',
streams={
'daily_events': ReplicationStream(
object='analytics.events',
source_options={
'range': '2024-01-01,2024-01-31' # Backfill January 2024
}
)
}
)
replication.run()Month Range Backfill
For APIs that accept month-level granularity, you can iterate by month.
Spec File (sales_api.yaml)
name: "Sales API"
defaults:
state:
base_url: https://api.sales.com/v2
request:
headers:
X-API-Key: "{secrets.api_key}"
endpoints:
monthly_sales:
description: "Get monthly sales data"
sync: [last_month]
iterate:
# Generate monthly range
over: >
range(
coalesce(context.range_start, sync.last_month, date_format(date_add(now(), -6, "month"), "%Y-%m-01")),
coalesce(context.range_end, date_format(now(), "%Y-%m-01")),
"1M"
)
into: "state.current_month"
concurrency: 3
state:
# Format as YYYY-MM for the API
month: '{date_format(state.current_month, "%Y-%m")}'
request:
url: "{state.base_url}/sales/summary"
parameters:
month: "{state.month}"
response:
records:
jmespath: "sales[]"
primary_key: ["sale_id"]
processors:
- expression: "state.month"
output: "state.last_month"
aggregation: "maximum"
overrides:
mode: incrementalUsing Replication
source: MY_API
target: MY_TARGET_DB
streams:
monthly_sales:
object: sales.monthly_summary
source_options:
# Backfill 6 months of data
range: '2023-07-01,2023-12-31'Numeric ID Range Backfill
For APIs that support ID-based pagination or filtering, you can backfill by ID range.
Spec File (orders_api.yaml)
name: "Orders API"
defaults:
state:
base_url: https://api.shop.com/v1
request:
headers:
Authorization: "Bearer {secrets.api_key}"
endpoints:
orders_by_id:
description: "Get orders by ID range"
sync: [last_order_id]
iterate:
# Generate numeric range (process 1000 IDs at a time)
over: >
range(
coalesce(context.range_start, sync.last_order_id, "1"),
coalesce(context.range_end, "999999"),
"1000"
)
into: "state.start_id"
concurrency: 5
state:
# Calculate end ID for this batch
end_id: '{to_int(state.start_id) + 999}'
request:
url: "{state.base_url}/orders"
parameters:
id_min: "{state.start_id}"
id_max: "{state.end_id}"
response:
records:
jmespath: "orders[]"
primary_key: ["order_id"]
processors:
# Track the highest order ID processed
- expression: "record.order_id"
output: "state.last_order_id"
aggregation: "maximum"
overrides:
mode: incrementalUsing Replication
source: MY_API
target: MY_TARGET_DB
streams:
orders_by_id:
object: ecommerce.orders
source_options:
# Backfill orders with IDs from 10000 to 50000
range: '10000,50000'Multiple Endpoints with Different Ranges
You can backfill multiple endpoints with different date ranges in a single replication.
source: MY_API
target: MY_TARGET_DB
streams:
# Backfill events for last month
daily_events:
object: analytics.events
source_options:
range: '2024-01-01,2024-01-31'
# Backfill sales for last quarter
monthly_sales:
object: sales.monthly_summary
source_options:
range: '2023-10-01,2023-12-31'
# Backfill specific order ID range
orders_by_id:
object: ecommerce.orders
source_options:
range: '10000,50000'
env:
SLING_THREADS: 3 # Process 3 streams concurrentlyOpen-Ended Range
You can specify only the start of a range, leaving the end open to process up to the current date.
source: MY_API
target: MY_TARGET_DB
streams:
daily_events:
object: analytics.events
source_options:
# Backfill from 2024-01-01 to yesterday
range: '2024-01-01,'In your spec, handle the open end with a default:
iterate:
over: >
range(
coalesce(context.range_start, sync.last_date, "2024-01-01"),
coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
"1d"
)Best Practices
1. Always Use Sync State
Persist the last processed value so subsequent runs can pick up where you left off:
sync: [last_date]
processors:
- expression: "state.date"
output: "state.last_date"
aggregation: "maximum"2. Set Appropriate Concurrency
Balance API rate limits with performance:
iterate:
concurrency: 5 # Adjust based on API rate limits
request:
rate: 10 # Max requests per second3. Use Incremental Mode Override
Even when backfilling, use incremental mode to properly handle upserts:
overrides:
mode: incremental4. Handle API Limits Gracefully
Add retry logic for rate limits and errors:
response:
rules:
- condition: "response.status == 429"
action: retry
max_attempts: 5
backoff: exponential5. Provide Reasonable Defaults
Use coalesce() to provide sensible defaults when range is not specified:
over: >
range(
coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -30, "day"), "%Y-%m-%d")),
coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
"1d"
)Last updated
Was this helpful?