Transformations

Using Sling transformations

Can Sling do the T part of ELT?

Not in the traditional sense. However, Sling is capable of the below transformations.

Custom SQL

Custom SELECT query as the source stream, using a custom SQL input as a source.stream. Here is an example::

sling run \
  --src-conn STARROCKS \
  --src-stream "SELECT columnB, columnA FROM tbl WHERE columnB > 6000" \
  --tgt-conn MYSQL \
  --tgt-object mysql.tbl \
  --mode full-refresh

Column Type Casting

Column type coercion, the columns key in source.options (see here). This is especially useful for file sources. Here is an example casting types of columns:

sling run \
  --src-conn GCS \
  --src-stream gs://path/to/file3.csv \
  --tgt-conn CLICKHOUSE \
  --tgt-object default.gcs_{stream_file_name} \
  --src-options '{columns: { col1: string, col2: "decimal(20,10)", col3: json } }'

Special String Transforms

Special string transforms, like encodings, uuid parsing (see here). Here is are some examples:

# Transform at stream Level
sling run \
  --src-conn POSTGRES \
  --src-stream my_schema.my_table \
  --tgt-conn SNOWFLAKE \
  --tgt-object new_schema.{stream_schema}_{stream_table} \
  --src-options '{ transforms: [ decode_utf16, replace_accents ] }' # will decode first, then replace accents

# Transform at column Level
sling run \
  --src-conn POSTGRES \
  --src-stream my_schema.my_table \
  --stdout --limit 10 \
  --src-options '{ transforms: { col1: [ hash_md5 ], col5: [ trim_space, parse_uuid ] } }'

# Transform at stream level (with "*") and column Level
sling run \
  --src-conn POSTGRES \
  --src-stream my_schema.my_table \
  --stdout --limit 10 \
  --src-options '{ transforms: { "*": [ decode_utf16 ], col_sensitive: [ hash_sha512 ] } }'

JSON Flattening

JSON Flattening: the flatten key in source.options (see here). Example:

sling run \
  --src-conn AWS_S3 \
  --src-stream s3://path/to/file.json \
  --tgt-object file://./target/models.csv \
  --src-options '{flatten: true}'

JMESPath Transforms

JMESPath transformations for JSON source files: the jmespath key in source.options. Here is a example, extracting the models information from the DBT manifest file using jmespath + flatten options:

sling run \
  --src-stream file://./target/manifest.json \
  --tgt-object file://./target/models.csv \
  --src-options '{
      jmespath: "nodes.*.{resource_type: resource_type, database: database, schema: schema, name: name, relation_name: relation_name, original_file_path: original_file_path, materialized: config.materialized }",
      flatten: true
    }'

Schema Evolution

When using Sling to extract/load data in a incremental manner, it will attempt to match whatever columns are present in both the source stream and target table. If an extra column is present in the source stream, it will add it in the target table. If no columns match from source stream at all, it will error. At least the primary_key or update_key must be present in the target table.

See below for a simple example, mimicking the addition and removal of columns.

# Initial data

$ echo 'a,b,c
1,2,3
4,5,6' > test1.csv

$ sling run \
  --src-stream file://./test1.csv \
  --tgt-conn postgres \
  --tgt-object public.test1

<...log output omitted>

$ sling run \
  --src-conn postgres \
  --src-stream public.test1 \
  --stdout

a,b,c,_sling_loaded_at
1,2,3,1707869559
4,5,6,1707869559
# test2.csv is missing column b

echo 'a,c
7,8' > test2.csv

$ sling run \
  --src-stream file://./test2.csv \
  --tgt-conn postgres \
  --tgt-object public.test1 \
  --mode incremental \
  --primary-key a

<...log output omitted>

$ sling run \
  --src-conn postgres \
  --src-stream public.test1 \
  --stdout

a,b,c,_sling_loaded_at
1,2,3,1707869559
4,5,6,1707869559
7,,8,1707869689
# test3.csv is missing column b, c and has extra column d

$ echo 'a,d
9,10' > test3.csv

$ sling run \
  --src-stream file://./test3.csv \
  --tgt-conn postgres \
  --tgt-object public.test1 \
  --mode incremental \
  --primary-key a

<...log output omitted>

$ sling run \
  --src-conn postgres \
  --src-stream public.test1 \
  --stdout

a,b,c,_sling_loaded_at,d
1,2,3,1707869559,
4,5,6,1707869559,
7,,8,1707869689,
9,,,1707870320,10

We can see that sling handled the changes properly, in a non-destructive manner. If the source stream were from a database, the same rules would apply, whether a column disappeared or appeared.

Suggestions?

There are currently no plans to supports other custom transformations, in the traditional "T" sense. However, any suggestions are welcome.

Last updated