# Backfill

Backfill mode allows you to load historical data from APIs within a specific date or value range. This is useful when you need to reload data for a particular time period or catch up on historical data.

Learn more: [Incremental Sync](/concepts/api-specs/advanced.md#sync-state-for-incremental-loads) | [Iteration](/concepts/api-specs/request.md#iteration-looping-requests)

## How API Backfill Works

API backfill uses the `range` parameter in combination with `context.range_start` and `context.range_end` to iterate over date or value ranges:

1. Define a range in your replication config using `source_options.range`
2. Access the range values in your spec using `context.range_start` and `context.range_end`
3. Use these values with the `range()` function to generate iterations
4. Each iteration fetches data for that specific date or value

> **Note:** API backfill does not support chunking (like database backfill does). The range is processed through iteration.

## Date Range Backfill

This is the most common use case - backfilling API data for a specific date range.

**Spec File** (`analytics_api.yaml`)

{% code title="analytics\_api.yaml" overflow="wrap" %}

```yaml
name: "Analytics API"

defaults:
  state:
    base_url: https://api.analytics.com/v1
  request:
    headers:
      Authorization: "Bearer {secrets.api_key}"

endpoints:
  daily_events:
    description: "Get daily event data for a date range"

    # Persist the last processed date for incremental runs
    sync: [last_date]

    iterate:
      # Generate date range using context values from replication config
      over: >
        range(
          coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -7, "day"), "%Y-%m-%d")),
          coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
          "1d"
        )
      into: "state.current_date"
      concurrency: 5  # Process 5 dates concurrently

    state:
      # Format the date for the API request
      date: '{date_format(state.current_date, "%Y-%m-%d")}'

    request:
      url: "{state.base_url}/events/daily/{state.date}"
      method: GET

    response:
      records:
        jmespath: "data.events[]"
        primary_key: ["event_id"]

      processors:
        # Track the latest date processed for incremental sync
        - expression: "state.date"
          output: "state.last_date"
          aggregation: "maximum"

    overrides:
      mode: incremental  # Use incremental mode for proper upserting
```

{% endcode %}

***

**Using Replication**

Running with Sling: `sling run -r /path/to/replication.yaml`

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: MY_API
target: MY_TARGET_DB

defaults:
  mode: full-refresh

streams:
  daily_events:
    object: analytics.events
    source_options:
      # Backfill data for January 2024
      range: '2024-01-01,2024-01-31'
```

{% endcode %}

> **Tip:** The spec uses `overrides.mode: incremental` to ensure data is properly upserted based on `primary_key`, while the replication config can use `mode: full-refresh` or leave it unset.

***

**Using Python**

{% code title="api\_to\_database.py" overflow="wrap" %}

```python
from sling import Replication, ReplicationStream

replication = Replication(
    source='MY_API',
    target='MY_TARGET_DB',
    streams={
        'daily_events': ReplicationStream(
            object='analytics.events',
            source_options={
                'range': '2024-01-01,2024-01-31'  # Backfill January 2024
            }
        )
    }
)

replication.run()
```

{% endcode %}

## Month Range Backfill

For APIs that accept month-level granularity, you can iterate by month.

**Spec File** (`sales_api.yaml`)

{% code title="sales\_api.yaml" overflow="wrap" %}

```yaml
name: "Sales API"

defaults:
  state:
    base_url: https://api.sales.com/v2
  request:
    headers:
      X-API-Key: "{secrets.api_key}"

endpoints:
  monthly_sales:
    description: "Get monthly sales data"

    sync: [last_month]

    iterate:
      # Generate monthly range
      over: >
        range(
          coalesce(context.range_start, sync.last_month, date_format(date_add(now(), -6, "month"), "%Y-%m-01")),
          coalesce(context.range_end, date_format(now(), "%Y-%m-01")),
          "1M"
        )
      into: "state.current_month"
      concurrency: 3

    state:
      # Format as YYYY-MM for the API
      month: '{date_format(state.current_month, "%Y-%m")}'

    request:
      url: "{state.base_url}/sales/summary"
      parameters:
        month: "{state.month}"

    response:
      records:
        jmespath: "sales[]"
        primary_key: ["sale_id"]

      processors:
        - expression: "state.month"
          output: "state.last_month"
          aggregation: "maximum"

    overrides:
      mode: incremental
```

{% endcode %}

***

**Using Replication**

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: MY_API
target: MY_TARGET_DB

streams:
  monthly_sales:
    object: sales.monthly_summary
    source_options:
      # Backfill 6 months of data
      range: '2023-07-01,2023-12-31'
```

{% endcode %}

## Numeric ID Range Backfill

For APIs that support ID-based pagination or filtering, you can backfill by ID range.

**Spec File** (`orders_api.yaml`)

{% code title="orders\_api.yaml" overflow="wrap" %}

```yaml
name: "Orders API"

defaults:
  state:
    base_url: https://api.shop.com/v1
  request:
    headers:
      Authorization: "Bearer {secrets.api_key}"

endpoints:
  orders_by_id:
    description: "Get orders by ID range"

    sync: [last_order_id]

    iterate:
      # Generate numeric range (process 1000 IDs at a time)
      over: >
        range(
          coalesce(context.range_start, sync.last_order_id, "1"),
          coalesce(context.range_end, "999999"),
          "1000"
        )
      into: "state.start_id"
      concurrency: 5

    state:
      # Calculate end ID for this batch
      end_id: '{to_int(state.start_id) + 999}'

    request:
      url: "{state.base_url}/orders"
      parameters:
        id_min: "{state.start_id}"
        id_max: "{state.end_id}"

    response:
      records:
        jmespath: "orders[]"
        primary_key: ["order_id"]

      processors:
        # Track the highest order ID processed
        - expression: "record.order_id"
          output: "state.last_order_id"
          aggregation: "maximum"

    overrides:
      mode: incremental
```

{% endcode %}

***

**Using Replication**

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: MY_API
target: MY_TARGET_DB

streams:
  orders_by_id:
    object: ecommerce.orders
    source_options:
      # Backfill orders with IDs from 10000 to 50000
      range: '10000,50000'
```

{% endcode %}

## Multiple Endpoints with Different Ranges

You can backfill multiple endpoints with different date ranges in a single replication.

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: MY_API
target: MY_TARGET_DB

streams:
  # Backfill events for last month
  daily_events:
    object: analytics.events
    source_options:
      range: '2024-01-01,2024-01-31'

  # Backfill sales for last quarter
  monthly_sales:
    object: sales.monthly_summary
    source_options:
      range: '2023-10-01,2023-12-31'

  # Backfill specific order ID range
  orders_by_id:
    object: ecommerce.orders
    source_options:
      range: '10000,50000'

env:
  SLING_THREADS: 3  # Process 3 streams concurrently
```

{% endcode %}

## Open-Ended Range

You can specify only the start of a range, leaving the end open to process up to the current date.

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: MY_API
target: MY_TARGET_DB

streams:
  daily_events:
    object: analytics.events
    source_options:
      # Backfill from 2024-01-01 to yesterday
      range: '2024-01-01,'
```

{% endcode %}

In your spec, handle the open end with a default:

```yaml
iterate:
  over: >
    range(
      coalesce(context.range_start, sync.last_date, "2024-01-01"),
      coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
      "1d"
    )
```

## Best Practices

### 1. Always Use Sync State

Persist the last processed value so subsequent runs can pick up where you left off:

```yaml
sync: [last_date]

processors:
  - expression: "state.date"
    output: "state.last_date"
    aggregation: "maximum"
```

### 2. Set Appropriate Concurrency

Balance API rate limits with performance:

```yaml
iterate:
  concurrency: 5  # Adjust based on API rate limits

request:
  rate: 10  # Max requests per second
```

### 3. Use Incremental Mode Override

Even when backfilling, use incremental mode to properly handle upserts:

```yaml
overrides:
  mode: incremental
```

### 4. Handle API Limits Gracefully

Add retry logic for rate limits and errors:

```yaml
response:
  rules:
    - condition: "response.status == 429"
      action: retry
      max_attempts: 5
      backoff: exponential
```

### 5. Provide Reasonable Defaults

Use `coalesce()` to provide sensible defaults when range is not specified:

```yaml
over: >
  range(
    coalesce(context.range_start, sync.last_date, date_format(date_add(now(), -30, "day"), "%Y-%m-%d")),
    coalesce(context.range_end, date_format(date_add(now(), -1, "day"), "%Y-%m-%d")),
    "1d"
  )
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.slingdata.io/examples/api-to-database/backfill.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
