Backfill

Examples of backfilling historical data from APIs to databases using range parameters

Backfill mode allows you to load historical data from APIs within a specific date or value range. This is useful when you need to reload data for a particular time period or catch up on historical data.

Learn more: Incremental Sync | Iteration

How API Backfill Works

API backfill uses the range parameter in combination with context.range_start and context.range_end to iterate over date or value ranges:

Define a range in your replication config using source_options.range
Access the range values in your spec using context.range_start and context.range_end
Use these values with the range() function to generate iterations
Each iteration fetches data for that specific date or value

Note: API backfill does not support chunking (like database backfill does). The range is processed through iteration.

Date Range Backfill

This is the most common use case - backfilling API data for a specific date range.

Spec File (analytics_api.yaml)

analytics_api.yaml

name: "Analytics API"

defaults:
  state:
    base_url: https://api.analytics.com/v1
  request:
    headers:
      Authorization: "Bearer {secrets.api_key}"

endpoints:
  daily_events:
    description: "Get daily event data for a date range"

    # Persist the last processed date for incremental runs
    sync: [last_date]

    iterate:
      # Generate date range using context values from replication config
      over: >
        range(
          coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -7, "day"), "%Y-%m-%d")),
          coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
          "1d"
        )
      into: "state.current_date"
      concurrency: 5  # Process 5 dates concurrently

    state:
      # Format the date for the API request
      date: '{date_format(state.current_date, "%Y-%m-%d")}'

    request:
      url: "{state.base_url}/events/daily/{state.date}"
      method: GET

    response:
      records:
        jmespath: "data.events[]"
        primary_key: ["event_id"]

      processors:
        # Track the latest date processed for incremental sync
        - expression: "state.date"
          output: "state.last_date"
          aggregation: "maximum"

    overrides:
      mode: incremental  # Use incremental mode for proper upserting

Using Replication

Running with Sling: sling run -r /path/to/replication.yaml

replication.yaml

source: MY_API
target: MY_TARGET_DB

defaults:
  mode: full-refresh

streams:
  daily_events:
    object: analytics.events
    source_options:
      # Backfill data for January 2024
      range: '2024-01-01,2024-01-31'

Tip: The spec uses overrides.mode: incremental to ensure data is properly upserted based on primary_key, while the replication config can use mode: full-refresh or leave it unset.

Using Python

api_to_database.py

from sling import Replication, ReplicationStream

replication = Replication(
    source='MY_API',
    target='MY_TARGET_DB',
    streams={
        'daily_events': ReplicationStream(
            object='analytics.events',
            source_options={
                'range': '2024-01-01,2024-01-31'  # Backfill January 2024
            }
        )
    }
)

replication.run()

Month Range Backfill

For APIs that accept month-level granularity, you can iterate by month.

Spec File (sales_api.yaml)

sales_api.yaml

name: "Sales API"

defaults:
  state:
    base_url: https://api.sales.com/v2
  request:
    headers:
      X-API-Key: "{secrets.api_key}"

endpoints:
  monthly_sales:
    description: "Get monthly sales data"

    sync: [last_month]

    iterate:
      # Generate monthly range
      over: >
        range(
          coalesce(context.range_start, sync.last_month, date_format(date_add(now(), -6, "month"), "%Y-%m-01")),
          coalesce(context.range_end, date_format(now(), "%Y-%m-01")),
          "1M"
        )
      into: "state.current_month"
      concurrency: 3

    state:
      # Format as YYYY-MM for the API
      month: '{date_format(state.current_month, "%Y-%m")}'

    request:
      url: "{state.base_url}/sales/summary"
      parameters:
        month: "{state.month}"

    response:
      records:
        jmespath: "sales[]"
        primary_key: ["sale_id"]

      processors:
        - expression: "state.month"
          output: "state.last_month"
          aggregation: "maximum"

    overrides:
      mode: incremental

Using Replication

replication.yaml

source: MY_API
target: MY_TARGET_DB

streams:
  monthly_sales:
    object: sales.monthly_summary
    source_options:
      # Backfill 6 months of data
      range: '2023-07-01,2023-12-31'

Numeric ID Range Backfill

For APIs that support ID-based pagination or filtering, you can backfill by ID range.

Spec File (orders_api.yaml)

orders_api.yaml

name: "Orders API"

defaults:
  state:
    base_url: https://api.shop.com/v1
  request:
    headers:
      Authorization: "Bearer {secrets.api_key}"

endpoints:
  orders_by_id:
    description: "Get orders by ID range"

    sync: [last_order_id]

    iterate:
      # Generate numeric range (process 1000 IDs at a time)
      over: >
        range(
          coalesce(context.range_start, sync.last_order_id, "1"),
          coalesce(context.range_end, "999999"),
          "1000"
        )
      into: "state.start_id"
      concurrency: 5

    state:
      # Calculate end ID for this batch
      end_id: '{to_int(state.start_id) + 999}'

    request:
      url: "{state.base_url}/orders"
      parameters:
        id_min: "{state.start_id}"
        id_max: "{state.end_id}"

    response:
      records:
        jmespath: "orders[]"
        primary_key: ["order_id"]

      processors:
        # Track the highest order ID processed
        - expression: "record.order_id"
          output: "state.last_order_id"
          aggregation: "maximum"

    overrides:
      mode: incremental

Using Replication

replication.yaml

source: MY_API
target: MY_TARGET_DB

streams:
  orders_by_id:
    object: ecommerce.orders
    source_options:
      # Backfill orders with IDs from 10000 to 50000
      range: '10000,50000'

Multiple Endpoints with Different Ranges

You can backfill multiple endpoints with different date ranges in a single replication.

replication.yaml

source: MY_API
target: MY_TARGET_DB

streams:
  # Backfill events for last month
  daily_events:
    object: analytics.events
    source_options:
      range: '2024-01-01,2024-01-31'

  # Backfill sales for last quarter
  monthly_sales:
    object: sales.monthly_summary
    source_options:
      range: '2023-10-01,2023-12-31'

  # Backfill specific order ID range
  orders_by_id:
    object: ecommerce.orders
    source_options:
      range: '10000,50000'

env:
  SLING_THREADS: 3  # Process 3 streams concurrently

Open-Ended Range

You can specify only the start of a range, leaving the end open to process up to the current date.

replication.yaml

source: MY_API
target: MY_TARGET_DB

streams:
  daily_events:
    object: analytics.events
    source_options:
      # Backfill from 2024-01-01 to yesterday
      range: '2024-01-01,'

In your spec, handle the open end with a default:

iterate:
  over: >
    range(
      coalesce(context.range_start, sync.last_date, "2024-01-01"),
      coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
      "1d"
    )

Best Practices

1. Always Use Sync State

Persist the last processed value so subsequent runs can pick up where you left off:

sync: [last_date]

processors:
  - expression: "state.date"
    output: "state.last_date"
    aggregation: "maximum"

2. Set Appropriate Concurrency

Balance API rate limits with performance:

iterate:
  concurrency: 5  # Adjust based on API rate limits

request:
  rate: 10  # Max requests per second

3. Use Incremental Mode Override

Even when backfilling, use incremental mode to properly handle upserts:

overrides:
  mode: incremental

4. Handle API Limits Gracefully

Add retry logic for rate limits and errors:

response:
  rules:
    - condition: "response.status == 429"
      action: retry
      max_attempts: 5
      backoff: exponential

5. Provide Reasonable Defaults

Use coalesce() to provide sensible defaults when range is not specified:

over: >
  range(
    coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -30, "day"), "%Y-%m-%d")),
    coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
    "1d"
  )

PreviousIncremental NextDynamic Endpoints ⚡

Last updated 3 months ago

Was this helpful?

hashtagHow API Backfill Works

hashtagDate Range Backfill

hashtagMonth Range Backfill

hashtagNumeric ID Range Backfill

hashtagMultiple Endpoints with Different Ranges

hashtagOpen-Ended Range

hashtagBest Practices

hashtag1. Always Use Sync State

hashtag2. Set Appropriate Concurrency

hashtag3. Use Incremental Mode Override

hashtag4. Handle API Limits Gracefully

hashtag5. Provide Reasonable Defaults

How API Backfill Works

Date Range Backfill

Month Range Backfill

Numeric ID Range Backfill

Multiple Endpoints with Different Ranges

Open-Ended Range

Best Practices

1. Always Use Sync State

2. Set Appropriate Concurrency

3. Use Incremental Mode Override

4. Handle API Limits Gracefully

5. Provide Reasonable Defaults