Backfill

Examples of backfilling historical data from APIs to databases using range parameters

Backfill mode allows you to load historical data from APIs within a specific date or value range. This is useful when you need to reload data for a particular time period or catch up on historical data.

Learn more: Incremental Sync | Iteration

How API Backfill Works

API backfill uses the range parameter in combination with context.range_start and context.range_end to iterate over date or value ranges:

  1. Define a range in your replication config using source_options.range

  2. Access the range values in your spec using context.range_start and context.range_end

  3. Use these values with the range() function to generate iterations

  4. Each iteration fetches data for that specific date or value

Note: API backfill does not support chunking (like database backfill does). The range is processed through iteration.

Date Range Backfill

This is the most common use case - backfilling API data for a specific date range.

Spec File (analytics_api.yaml)

analytics_api.yaml
name: "Analytics API"

defaults:
  state:
    base_url: https://api.analytics.com/v1
  request:
    headers:
      Authorization: "Bearer {secrets.api_key}"

endpoints:
  daily_events:
    description: "Get daily event data for a date range"

    # Persist the last processed date for incremental runs
    sync: [last_date]

    iterate:
      # Generate date range using context values from replication config
      over: >
        range(
          coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -7, "day"), "%Y-%m-%d")),
          coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
          "1d"
        )
      into: "state.current_date"
      concurrency: 5  # Process 5 dates concurrently

    state:
      # Format the date for the API request
      date: '{date_format(state.current_date, "%Y-%m-%d")}'

    request:
      url: "{state.base_url}/events/daily/{state.date}"
      method: GET

    response:
      records:
        jmespath: "data.events[]"
        primary_key: ["event_id"]

      processors:
        # Track the latest date processed for incremental sync
        - expression: "state.date"
          output: "state.last_date"
          aggregation: "maximum"

    overrides:
      mode: incremental  # Use incremental mode for proper upserting

Using Replication

Running with Sling: sling run -r /path/to/replication.yaml

Tip: The spec uses overrides.mode: incremental to ensure data is properly upserted based on primary_key, while the replication config can use mode: full-refresh or leave it unset.


Using Python

Month Range Backfill

For APIs that accept month-level granularity, you can iterate by month.

Spec File (sales_api.yaml)


Using Replication

Numeric ID Range Backfill

For APIs that support ID-based pagination or filtering, you can backfill by ID range.

Spec File (orders_api.yaml)


Using Replication

Multiple Endpoints with Different Ranges

You can backfill multiple endpoints with different date ranges in a single replication.

Open-Ended Range

You can specify only the start of a range, leaving the end open to process up to the current date.

In your spec, handle the open end with a default:

Best Practices

1. Always Use Sync State

Persist the last processed value so subsequent runs can pick up where you left off:

2. Set Appropriate Concurrency

Balance API rate limits with performance:

3. Use Incremental Mode Override

Even when backfilling, use incremental mode to properly handle upserts:

4. Handle API Limits Gracefully

Add retry logic for rate limits and errors:

5. Provide Reasonable Defaults

Use coalesce() to provide sensible defaults when range is not specified:

Last updated

Was this helpful?