Backfill

Examples of backfilling historical data from APIs to databases using range parameters

Backfill mode allows you to load historical data from APIs within a specific date or value range. This is useful when you need to reload data for a particular time period or catch up on historical data.

Learn more: Incremental Sync | Iteration

How API Backfill Works

API backfill uses the range parameter in combination with context.range_start and context.range_end to iterate over date or value ranges:

  1. Define a range in your replication config using source_options.range

  2. Access the range values in your spec using context.range_start and context.range_end

  3. Use these values with the range() function to generate iterations

  4. Each iteration fetches data for that specific date or value

Note: API backfill does not support chunking (like database backfill does). The range is processed through iteration.

Date Range Backfill

This is the most common use case - backfilling API data for a specific date range.

Spec File (analytics_api.yaml)

analytics_api.yaml
name: "Analytics API"

defaults:
  state:
    base_url: https://api.analytics.com/v1
  request:
    headers:
      Authorization: "Bearer {secrets.api_key}"

endpoints:
  daily_events:
    description: "Get daily event data for a date range"

    # Persist the last processed date for incremental runs
    sync: [last_date]

    iterate:
      # Generate date range using context values from replication config
      over: >
        range(
          coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -7, "day"), "%Y-%m-%d")),
          coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
          "1d"
        )
      into: "state.current_date"
      concurrency: 5  # Process 5 dates concurrently

    state:
      # Format the date for the API request
      date: '{date_format(state.current_date, "%Y-%m-%d")}'

    request:
      url: "{state.base_url}/events/daily/{state.date}"
      method: GET

    response:
      records:
        jmespath: "data.events[]"
        primary_key: ["event_id"]

      processors:
        # Track the latest date processed for incremental sync
        - expression: "state.date"
          output: "state.last_date"
          aggregation: "maximum"

    overrides:
      mode: incremental  # Use incremental mode for proper upserting

Using Replication

Running with Sling: sling run -r /path/to/replication.yaml

replication.yaml
source: MY_API
target: MY_TARGET_DB

defaults:
  mode: full-refresh

streams:
  daily_events:
    object: analytics.events
    source_options:
      # Backfill data for January 2024
      range: '2024-01-01,2024-01-31'

Tip: The spec uses overrides.mode: incremental to ensure data is properly upserted based on primary_key, while the replication config can use mode: full-refresh or leave it unset.


Using Python

api_to_database.py
from sling import Replication, ReplicationStream

replication = Replication(
    source='MY_API',
    target='MY_TARGET_DB',
    streams={
        'daily_events': ReplicationStream(
            object='analytics.events',
            source_options={
                'range': '2024-01-01,2024-01-31'  # Backfill January 2024
            }
        )
    }
)

replication.run()

Month Range Backfill

For APIs that accept month-level granularity, you can iterate by month.

Spec File (sales_api.yaml)

sales_api.yaml
name: "Sales API"

defaults:
  state:
    base_url: https://api.sales.com/v2
  request:
    headers:
      X-API-Key: "{secrets.api_key}"

endpoints:
  monthly_sales:
    description: "Get monthly sales data"

    sync: [last_month]

    iterate:
      # Generate monthly range
      over: >
        range(
          coalesce(context.range_start, sync.last_month, date_format(date_add(now(), -6, "month"), "%Y-%m-01")),
          coalesce(context.range_end, date_format(now(), "%Y-%m-01")),
          "1M"
        )
      into: "state.current_month"
      concurrency: 3

    state:
      # Format as YYYY-MM for the API
      month: '{date_format(state.current_month, "%Y-%m")}'

    request:
      url: "{state.base_url}/sales/summary"
      parameters:
        month: "{state.month}"

    response:
      records:
        jmespath: "sales[]"
        primary_key: ["sale_id"]

      processors:
        - expression: "state.month"
          output: "state.last_month"
          aggregation: "maximum"

    overrides:
      mode: incremental

Using Replication

replication.yaml
source: MY_API
target: MY_TARGET_DB

streams:
  monthly_sales:
    object: sales.monthly_summary
    source_options:
      # Backfill 6 months of data
      range: '2023-07-01,2023-12-31'

Numeric ID Range Backfill

For APIs that support ID-based pagination or filtering, you can backfill by ID range.

Spec File (orders_api.yaml)

orders_api.yaml
name: "Orders API"

defaults:
  state:
    base_url: https://api.shop.com/v1
  request:
    headers:
      Authorization: "Bearer {secrets.api_key}"

endpoints:
  orders_by_id:
    description: "Get orders by ID range"

    sync: [last_order_id]

    iterate:
      # Generate numeric range (process 1000 IDs at a time)
      over: >
        range(
          coalesce(context.range_start, sync.last_order_id, "1"),
          coalesce(context.range_end, "999999"),
          "1000"
        )
      into: "state.start_id"
      concurrency: 5

    state:
      # Calculate end ID for this batch
      end_id: '{to_int(state.start_id) + 999}'

    request:
      url: "{state.base_url}/orders"
      parameters:
        id_min: "{state.start_id}"
        id_max: "{state.end_id}"

    response:
      records:
        jmespath: "orders[]"
        primary_key: ["order_id"]

      processors:
        # Track the highest order ID processed
        - expression: "record.order_id"
          output: "state.last_order_id"
          aggregation: "maximum"

    overrides:
      mode: incremental

Using Replication

replication.yaml
source: MY_API
target: MY_TARGET_DB

streams:
  orders_by_id:
    object: ecommerce.orders
    source_options:
      # Backfill orders with IDs from 10000 to 50000
      range: '10000,50000'

Multiple Endpoints with Different Ranges

You can backfill multiple endpoints with different date ranges in a single replication.

replication.yaml
source: MY_API
target: MY_TARGET_DB

streams:
  # Backfill events for last month
  daily_events:
    object: analytics.events
    source_options:
      range: '2024-01-01,2024-01-31'

  # Backfill sales for last quarter
  monthly_sales:
    object: sales.monthly_summary
    source_options:
      range: '2023-10-01,2023-12-31'

  # Backfill specific order ID range
  orders_by_id:
    object: ecommerce.orders
    source_options:
      range: '10000,50000'

env:
  SLING_THREADS: 3  # Process 3 streams concurrently

Open-Ended Range

You can specify only the start of a range, leaving the end open to process up to the current date.

replication.yaml
source: MY_API
target: MY_TARGET_DB

streams:
  daily_events:
    object: analytics.events
    source_options:
      # Backfill from 2024-01-01 to yesterday
      range: '2024-01-01,'

In your spec, handle the open end with a default:

iterate:
  over: >
    range(
      coalesce(context.range_start, sync.last_date, "2024-01-01"),
      coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
      "1d"
    )

Best Practices

1. Always Use Sync State

Persist the last processed value so subsequent runs can pick up where you left off:

sync: [last_date]

processors:
  - expression: "state.date"
    output: "state.last_date"
    aggregation: "maximum"

2. Set Appropriate Concurrency

Balance API rate limits with performance:

iterate:
  concurrency: 5  # Adjust based on API rate limits

request:
  rate: 10  # Max requests per second

3. Use Incremental Mode Override

Even when backfilling, use incremental mode to properly handle upserts:

overrides:
  mode: incremental

4. Handle API Limits Gracefully

Add retry logic for rate limits and errors:

response:
  rules:
    - condition: "response.status == 429"
      action: retry
      max_attempts: 5
      backoff: exponential

5. Provide Reasonable Defaults

Use coalesce() to provide sensible defaults when range is not specified:

over: >
  range(
    coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -30, "day"), "%Y-%m-%d")),
    coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
    "1d"
  )

Last updated

Was this helpful?