Parallel Chunking ⚡
Chunking is a feature in Sling that breaks down large data transfers into smaller, manageable parts. This is particularly useful for optimizing performance, managing resources, and enabling parallel processing during incremental and backfill operations. Chunking is available with Sling CLI Pro or on a Platform plan.
Supported Chunk Types
Sling supports several chunking strategies via the source_options.chunk_size or source_options.chunk_count parameters:
Time-based chunks:
Hours: e.g.,
6hDays: e.g.,
7dWeeks: e.g.,
1wMonths: e.g.,
1mYears: e.g.,
1y
Numeric chunks:
Integer ranges: e.g.,
1000for chunks of 1000 records
Count-based chunks (v1.4.14+):
Specific number of chunks: e.g.,
5to split data into 5 equal parts
Expression-based chunks (v1.4.14+):
Custom expressions: e.g.,
mod(abs(hashtext(column_name)), {chunk_count})
Each chunk is processed independently, allowing for parallel execution when combined with SLING_THREADS.
Chunking in Different Modes
Chunking works across all replication modes (full-refresh, truncate, incremental, and backfill), helping process large datasets by breaking them into smaller batches. This is useful for memory management, progress tracking, and reducing source database load.
Time-Range Chunking
Time-range chunking splits data based on date/time columns across different modes:
Numeric-Range Chunking
Numeric-range chunking splits data based on numeric columns like IDs:
Count-based Chunking
Count-based chunking splits data into a specific number of equal chunks:
Chunking by Expression
Expression-based chunking allows you to define custom SQL expressions to distribute data across chunks using the chunk_expr parameter. This works across all modes and is particularly useful for:
Hash-based distribution for even data splitting
Custom partitioning logic based on specific columns
Complex expressions that don't rely on sequential values
No update key needed
Mixed Chunking Strategies
For more on incremental mode basics, see incremental.md.
Chunking in Backfill Mode
Backfill mode with chunking allows loading historical data in smaller ranges, optimizing for large datasets.
Sling splits the specified range into smaller sub-ranges based on chunk_size or chunk_count. Each sub-range is processed as a separate stream.
For more on backfill mode, see backfill.md.
Chunking with Custom SQL
Combine custom SQL queries with chunking using the {incremental_where_cond} variable in your SQL.
For more on custom SQL, see custom-sql.md.
Key Benefits and Notes
Parallelism: Use
SLING_THREADSfor concurrent chunk processingError Recovery: Chunks allow resuming from failures without restarting everything
Progress Tracking: Better visibility for large operations
Requirements: Works with
incrementalandbackfillmodes; needs appropriateupdate_keytypeBest Practices: Test chunk sizes for optimal performance; monitor resource usage
For complete replication mode details, see Replication Modes.
Last updated
Was this helpful?