Data Quality

Learn how to use sling for data quality

Sling provides several powerful features to ensure and maintain data quality throughout your data pipeline:

Constraints

Constraints allow you to validate data at ingestion time using SQL-like syntax. They're specified at the column level and can prevent invalid data from entering your system.

streams:
  my_stream:
    columns:
      # Ensure email is valid
      email: string | value ~ '^[^@]+@[^@]+\.[^@]+$'
      
      # Status can only be specific values
      status: string | value in ('active', 'pending', 'inactive')
      
      # Amount must be positive
      amount: decimal | value > 0

Check Hooks

Check hooks enable you to implement custom validation logic at any point in your pipeline. They're particularly useful for:

  • Validating row counts

  • Ensuring data freshness

  • Implementing complex business rules

hooks:
  post:
    # Ensure minimum row count
    - type: check
      check: "run.total_rows >= 1000"
      on_failure: abort
    
    # Verify recent data
    - type: check
      check: "state.freshness_check.result.last_update >= timestamp.unix - 3600"
      on_failure: warn

3. Query Hooks

Query hooks allow you to run SQL-based quality checks and store results:

hooks:
  post:
    - type: query
      connection: target_db
      query: |
        INSERT INTO quality_metrics (
          stream_name,
          check_time,
          null_rate,
          duplicate_rate
        )
        SELECT 
          '{run.stream.name}',
          CURRENT_TIMESTAMP,
          SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*),
          COUNT(*) - COUNT(DISTINCT id) * 100.0 / COUNT(*)
        FROM {run.object.full_name}

Failure Handling

All data quality features support flexible failure handling:

  • abort: Stop processing immediately

  • warn: Continue but emit a warning

  • skip: Skip the problematic record (for constraints)

  • quiet: Continue silently

This allows you to implement the appropriate level of strictness for your use case.

By combining these features, you can build robust data quality checks throughout your pipeline, from ingestion to final delivery.

Last updated