Parallel Chunking ⚡

Chunking is a feature in Sling that breaks down large data transfers into smaller, manageable parts. This is particularly useful for optimizing performance, managing resources, and enabling parallel processing during incremental and backfill operations. Chunking is available with Sling CLI Pro or on a Platform plan.

Supported Chunk Types

Sling supports several chunking strategies via the source_options.chunk_size or source_options.chunk_count parameters:

Time-based chunks:

  • Hours: e.g., 6h

  • Days: e.g., 7d

  • Weeks: e.g., 1w

  • Months: e.g., 1m

  • Years: e.g., 1y

Numeric chunks:

  • Integer ranges: e.g., 1000 for chunks of 1000 records

Count-based chunks (v1.4.14+):

  • Specific number of chunks: e.g., 5 to split data into 5 equal parts

Expression-based chunks (v1.4.14+):

  • Custom expressions: e.g., mod(abs(hashtext(column_name)), {chunk_count})

Each chunk is processed independently, allowing for parallel execution when combined with SLING_THREADS.

Chunking in Different Modes

Chunking works across all replication modes (full-refresh, truncate, incremental, and backfill), helping process large datasets by breaking them into smaller batches. This is useful for memory management, progress tracking, and reducing source database load.

Time-Range Chunking

Time-range chunking splits data based on date/time columns across different modes:

Numeric-Range Chunking

Numeric-range chunking splits data based on numeric columns like IDs:

Count-based Chunking

Count-based chunking splits data into a specific number of equal chunks:

Chunking by Expression

Expression-based chunking allows you to define custom SQL expressions to distribute data across chunks using the chunk_expr parameter. This works across all modes and is particularly useful for:

  • Hash-based distribution for even data splitting

  • Custom partitioning logic based on specific columns

  • Complex expressions that don't rely on sequential values

  • No update key needed

Mixed Chunking Strategies

For more on incremental mode basics, see incremental.md.

Chunking in Backfill Mode

Backfill mode with chunking allows loading historical data in smaller ranges, optimizing for large datasets.

Sling splits the specified range into smaller sub-ranges based on chunk_size or chunk_count. Each sub-range is processed as a separate stream.

For more on backfill mode, see backfill.md.

Chunking with Custom SQL

Combine custom SQL queries with chunking using the {incremental_where_cond} variable in your SQL.

For more on custom SQL, see custom-sql.md.

Key Benefits and Notes

  • Parallelism: Use SLING_THREADS for concurrent chunk processing

  • Error Recovery: Chunks allow resuming from failures without restarting everything

  • Progress Tracking: Better visibility for large operations

  • Requirements: Works with incremental and backfill modes; needs appropriate update_key type

  • Best Practices: Test chunk sizes for optimal performance; monitor resource usage

For complete replication mode details, see Replication Modes.

Last updated

Was this helpful?