# Change Capture (CDC)

Change Data Capture (CDC) continuously replicates row-level changes (inserts, updates, deletes) from a source database to a target database by reading the database's transaction log. Unlike `incremental` mode which polls for new or updated rows, CDC captures every change as it happens, including deletes.

{% hint style="success" %}
**Advanced Plan Required**: CDC requires a [CLI Pro Max](https://docs.slingdata.io/sling-cli/cli-pro) token or an [Advanced Platform Plan](https://docs.slingdata.io/sling-platform/platform).
{% endhint %}

## Supported Sources

| Source                                                                     | Transaction Log       | Status      |
| -------------------------------------------------------------------------- | --------------------- | ----------- |
| [MySQL](https://docs.slingdata.io/concepts/change-capture/mysql)           | Binary log (binlog)   | Available   |
| [MariaDB](https://docs.slingdata.io/concepts/change-capture/mysql)         | Binary log (binlog)   | Available   |
| [PostgreSQL](https://docs.slingdata.io/concepts/change-capture/postgres)   | Write-Ahead Log (WAL) | Coming Soon |
| [SQL Server](https://docs.slingdata.io/concepts/change-capture/sql-server) | Change Tracking / CDC | Coming Soon |

## How It Works

CDC operates in two phases: an **initial load** that copies existing data, followed by **incremental change capture** that streams ongoing changes.

### Phase 1: Initial Load

On the first run for a given stream, Sling performs a full table copy from source to target:

1. **Position capture** — Sling records the current transaction log position (e.g., binlog file + offset) *before* the snapshot begins. This ensures no changes are lost between the snapshot and the first incremental run.
2. **Chunked reading** — Large tables are automatically split into primary-key-range chunks (configurable via `snapshot_chunk_size`). Each chunk is read, written, and checkpointed independently.
3. **Resumability** — If the process is interrupted (crash, timeout, kill), the next run detects the in-progress snapshot and resumes from the last completed chunk. No data is re-read.
4. **Completion** — Once all chunks are written, Sling marks the initial load as complete in the state store.

{% hint style="info" %}
Chunked mode requires an integer-like primary key for range splitting. If the table has no primary key or the PK is non-numeric, Sling falls back to a single-shot full table read automatically.
{% endhint %}

### Phase 2: Incremental Changes

On subsequent runs, Sling reads the source database's transaction log from the last saved position:

1. **Read changes** — Reads inserts, updates, and deletes from the transaction log starting at the saved position, up to `run_max_events` or `run_max_duration`.
2. **Merge to target** — Applies changes to the target table using a merge strategy that handles inserts, updates, and deletes.
3. **Save position** — Persists the new log position in the state store so the next run picks up where this one left off.

Each run is bounded and exits after processing its batch. This makes CDC safe to schedule on a recurring interval (e.g., every 30 seconds or every 5 minutes) via cron or the Sling Platform.

### What Is a CDC Event?

A CDC event corresponds to a single statement in the transaction log, not a single row. A bulk insert like `INSERT INTO t VALUES (...), (...), (...)` produces **one event** containing multiple rows. Similarly, an `UPDATE ... WHERE status = 'old'` that modifies 1,000 rows is a single event with 1,000 row changes.

This means `run_max_events: 10000` does not necessarily equal 10,000 rows — it could represent significantly more rows depending on how the source application writes data. Keep this in mind when tuning `run_max_events` for high-throughput workloads.

### Lifecycle Diagram

```
First Run                          Subsequent Runs
─────────                          ────────────────
┌─────────────────────┐            ┌──────────────────────┐
│ Record log position │            │ Read from saved      │
│ (before snapshot)   │            │ log position         │
└─────────┬───────────┘            └──────────┬───────────┘
          │                                   │
┌─────────▼───────────┐            ┌──────────▼───────────┐
│ Read chunk 1        │            │ Capture changes      │
│ Write to target     │            │ (inserts, updates,   │
│ Checkpoint          │            │  deletes)            │
└─────────┬───────────┘            └──────────┬───────────┘
          │                                   │
┌─────────▼───────────┐            ┌──────────▼───────────┐
│ Read chunk 2...N    │            │ Merge into target    │
│ Write + checkpoint  │            │ table                │
└─────────┬───────────┘            └──────────┬───────────┘
          │                                   │
┌─────────▼───────────┐            ┌──────────▼───────────┐
│ Mark initial load   │            │ Save new log         │
│ complete            │            │ position             │
└─────────────────────┘            └──────────────────────┘
```

## Replication Structure

CDC is configured using `mode: change-capture` in a standard Sling replication file. CDC-specific options go under `change_capture_options`:

```yaml
source: MY_SOURCE
target: MY_TARGET

defaults:
  mode: change-capture
  primary_key: [id]
  object: public.{stream_table}
  change_capture_options:
    run_max_events: 10000       # max change events per run
    run_max_duration: 10m       # max wall-clock time per run
    soft_delete: false          # keep deleted rows marked with _sling_synced_op='D'
    snapshot_start: now         # 'now' or 'beginning' for the very first run
    snapshot_chunk_size: 100000 # rows per chunk during initial snapshot
    snapshot_run_duration: 30m  # cap time spent on initial snapshot per run
    # replay_from: "2025-01-01T00:00:00Z"  # rewind to re-process from a point in time (beta)
    retry_attempts: 3           # retries on transient failures
    retry_delay: 5s             # delay between retries

streams:
  my_database.users:

  my_database.orders:
    change_capture_options:
      soft_delete: true         # per-stream override: preserve deleted rows
      run_max_events: 50000     # higher event budget for a busy table
      run_max_duration: 5m      # shorter run window

  my_database.products:
    change_capture_options:
      snapshot_start: beginning         # capture full history from earliest log position
      snapshot_chunk_size: 50000        # smaller chunks for a wide table
      snapshot_run_duration: 30m        # resume snapshot across runs
```

Options set in `defaults.change_capture_options` apply to all streams. Per-stream `change_capture_options` override the defaults.

## Options Reference

<table data-full-width="false"><thead><tr><th>Key</th><th>Description</th></tr></thead><tbody><tr><td><code>run_max_events</code></td><td>Maximum number of change events to process per run. When this limit is reached, Sling saves the position and exits. Default is <code>10000</code>.</td></tr><tr><td><code>run_max_duration</code></td><td>Maximum duration per run (e.g., <code>30s</code>, <code>10m</code>, <code>1h</code>). If no events arrive within this window, the run completes with zero changes. Default is <code>10m</code>.</td></tr><tr><td><code>soft_delete</code></td><td>When <code>true</code>, DELETE events mark the row with <code>_sling_synced_op = 'D'</code> and update <code>_sling_synced_at</code> instead of removing the row. Default is <code>false</code>.</td></tr><tr><td><code>snapshot_start</code></td><td>Where to start reading the transaction log on the very first run. Default is <code>now</code>. Use <code>beginning</code> to read from the earliest available log position.</td></tr><tr><td><code>snapshot_chunk_size</code></td><td>Number of rows per chunk during the initial snapshot. Default is <code>100000</code>.</td></tr><tr><td><code>snapshot_run_duration</code></td><td>Maximum time to spend on the initial snapshot per run (e.g., <code>30m</code>, <code>1h</code>). When the budget is exhausted, Sling exits cleanly after the current chunk and resumes on the next run. Default: no limit.</td></tr><tr><td><code>replay_from</code></td><td>Rewind the CDC position to re-process changes from an earlier point. Accepts source-specific formats (e.g., RFC 3339 timestamp, binlog position, GTID set). Applied once per unique value.</td></tr><tr><td><code>retry_attempts</code></td><td>Number of retry attempts on transient failures. Default is <code>3</code>.</td></tr><tr><td><code>retry_delay</code></td><td>Delay between retries. Default is <code>5s</code>.</td></tr></tbody></table>

## CDC Metadata Columns

Sling adds three metadata columns to every CDC-managed target table:

| Column             | Type          | Description                                                                                                                                                       |
| ------------------ | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `_sling_synced_at` | `timestamptz` | Timestamp when the row was last synced.                                                                                                                           |
| `_sling_synced_op` | `varchar`     | The operation type: `S` (snapshot), `I` (insert), `U` (update), `D` (delete). When `soft_delete: true`, deleted rows are preserved with `_sling_synced_op = 'D'`. |
| `_sling_cdc_seq`   | `bigint`      | Monotonically increasing sequence number for ordering events within and across runs.                                                                              |

## State Management

CDC state is stored in the connection specified by the `SLING_STATE` environment variable. This tracks:

* The current transaction log position
* Whether the initial snapshot is complete
* Checkpoint progress for in-progress snapshots
* Total rows captured

```bash
# Store state in a PostgreSQL table
export SLING_STATE='MY_POSTGRES/sling_state'

# Or store state in a file
export SLING_STATE='MY_AWS/sling_state'
```

{% hint style="warning" %}
The `SLING_STATE` connection must be configured before running CDC replications. Without it, Sling cannot track positions and each run would repeat the initial snapshot. See location string details [here](https://docs.slingdata.io/sling-cli/environment#location-string).
{% endhint %}

## Scheduling

CDC is designed to be run repeatedly on a schedule. Each run processes a bounded batch of changes (controlled by `run_max_events` and `run_max_duration`) and exits.

Configure the replication in the [Sling Platform](https://docs.slingdata.io/sling-platform/platform) UI with a schedule interval. The platform handles orchestration, monitoring, and alerting automatically.

![Sling Platform UI](https://3453272330-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M93cpHl7B7NPZlDrubS%2Fuploads%2Fgit-blob-22e0411029f6c0376d78bd0f2681dea8f806b073%2Fsling-platform-ui.png?alt=media)

## Comparison with Other Modes

| Feature          | `change-capture`    | `incremental`           | `full-refresh`  |
| ---------------- | ------------------- | ----------------------- | --------------- |
| Captures inserts | Yes                 | Yes                     | Yes             |
| Captures updates | Yes                 | Yes (with `update_key`) | Yes             |
| Captures deletes | Yes                 | With `delete_missing`   | Yes             |
| Reads from       | Transaction log     | Table query             | Table query     |
| State tracking   | Log position        | Max update\_key value   | None            |
| Source load      | Minimal (reads log) | Queries table           | Full table scan |
| Initial setup    | Automatic snapshot  | Manual first load       | N/A             |
