Backfill
Examples of backfilling historical data from APIs to databases using range parameters
Backfill mode allows you to load historical data from APIs within a specific date or value range. This is useful when you need to reload data for a particular time period or catch up on historical data.
Learn more: Incremental Sync | Iteration
How API Backfill Works
API backfill uses the range parameter in combination with context.range_start and context.range_end to iterate over date or value ranges:
Define a range in your replication config using
source_options.rangeAccess the range values in your spec using
context.range_startandcontext.range_endUse these values with the
range()function to generate iterationsEach iteration fetches data for that specific date or value
Note: API backfill does not support chunking (like database backfill does). The range is processed through iteration.
Date Range Backfill
This is the most common use case - backfilling API data for a specific date range.
Spec File (analytics_api.yaml)
name: "Analytics API"
defaults:
state:
base_url: https://api.analytics.com/v1
request:
headers:
Authorization: "Bearer {secrets.api_key}"
endpoints:
daily_events:
description: "Get daily event data for a date range"
# Persist the last processed date for incremental runs
sync: [last_date]
iterate:
# Generate date range using context values from replication config
over: >
range(
coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -7, "day"), "%Y-%m-%d")),
coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
"1d"
)
into: "state.current_date"
concurrency: 5 # Process 5 dates concurrently
state:
# Format the date for the API request
date: '{date_format(state.current_date, "%Y-%m-%d")}'
request:
url: "{state.base_url}/events/daily/{state.date}"
method: GET
response:
records:
jmespath: "data.events[]"
primary_key: ["event_id"]
processors:
# Track the latest date processed for incremental sync
- expression: "state.date"
output: "state.last_date"
aggregation: "maximum"
overrides:
mode: incremental # Use incremental mode for proper upsertingUsing Replication
Running with Sling: sling run -r /path/to/replication.yaml
Tip: The spec uses
overrides.mode: incrementalto ensure data is properly upserted based onprimary_key, while the replication config can usemode: full-refreshor leave it unset.
Using Python
Month Range Backfill
For APIs that accept month-level granularity, you can iterate by month.
Spec File (sales_api.yaml)
Using Replication
Numeric ID Range Backfill
For APIs that support ID-based pagination or filtering, you can backfill by ID range.
Spec File (orders_api.yaml)
Using Replication
Multiple Endpoints with Different Ranges
You can backfill multiple endpoints with different date ranges in a single replication.
Open-Ended Range
You can specify only the start of a range, leaving the end open to process up to the current date.
In your spec, handle the open end with a default:
Best Practices
1. Always Use Sync State
Persist the last processed value so subsequent runs can pick up where you left off:
2. Set Appropriate Concurrency
Balance API rate limits with performance:
3. Use Incremental Mode Override
Even when backfilling, use incremental mode to properly handle upserts:
4. Handle API Limits Gracefully
Add retry logic for rate limits and errors:
5. Provide Reasonable Defaults
Use coalesce() to provide sensible defaults when range is not specified:
Last updated
Was this helpful?