Data Quality
Learn how to use sling for data quality
Sling provides several powerful features to ensure and maintain data quality throughout your data pipeline:
Constraints
Constraints allow you to validate data at ingestion time using SQL-like syntax. They're specified at the column level and can prevent invalid data from entering your system.
streams:
my_stream:
columns:
# Ensure email is valid
email: string | value ~ '^[^@]+@[^@]+\.[^@]+$'
# Status can only be specific values
status: string | value in ('active', 'pending', 'inactive')
# Amount must be positive
amount: decimal | value > 0
Check Hooks
Check hooks enable you to implement custom validation logic at any point in your pipeline. They're particularly useful for:
Validating row counts
Ensuring data freshness
Implementing complex business rules
hooks:
post:
# Ensure minimum row count
- type: check
check: "run.total_rows >= 1000"
on_failure: abort
# Verify recent data
- type: check
check: "state.freshness_check.result.last_update >= timestamp.unix - 3600"
on_failure: warn
3. Query Hooks
Query hooks allow you to run SQL-based quality checks and store results:
hooks:
post:
- type: query
connection: target_db
query: |
INSERT INTO quality_metrics (
stream_name,
check_time,
null_rate,
duplicate_rate
)
SELECT
'{run.stream.name}',
CURRENT_TIMESTAMP,
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*),
COUNT(*) - COUNT(DISTINCT id) * 100.0 / COUNT(*)
FROM {run.object.full_name}
Failure Handling
All data quality features support flexible failure handling:
abort
: Stop processing immediatelywarn
: Continue but emit a warningskip
: Skip the problematic record (for constraints)quiet
: Continue silently
This allows you to implement the appropriate level of strictness for your use case.
By combining these features, you can build robust data quality checks throughout your pipeline, from ingestion to final delivery.
Last updated