Source Options

Specification

Here we have various keys accepted for source options:

Key
Description

compression

(Only for file source)

The type of compression to use when reading files. Valid inputs are none, auto and gzip, zstd, snappy. Default is auto.

chunk_size

(Only for database source)

The chunk size for backfill processing. This tells Sling to split a stream into many. Accepts values such as 12h, 7d or 1m. See here for more details.

datetime_format

The ISO 8601 date format to use when reading date values. Default is auto

delimiter

(Only for file source)

The delimiter to use when parsing tabular files. Default is auto.

encoding

(Only for file source)

The text encoding to use when reading files. This is essential for correctly reading files that contain special characters or were created with non-UTF-8 encodings. Options are: latin1, latin5, latin9, utf8, utf8_bom, utf16, windows1250, windows1252. Default is utf8.

escape

(Only for file source - since v1.2.4)

The escape character to use when parsing tabular files. Default is "

empty_as_null

Whether empty fields should be treated as NULL. Default is false starting in v1.4.5. Prior, default depends on the kind of the source connection: false for database connections, true for storage connections.

flatten

Whether to flatten a semi-structure source format (JSON, XML). Accepts true or false boolean values. Since v1.4.5, also accepts an integer, representing the maximum flattening depth. 0 means infinite depth.

format

(Only for file source)

The format of the file(s). Options are: csv, parquet, xlsx, avro, json, jsonlines, sas7bdat and xml.

header

(Only for file source) Whether to consider the first line as header. Default is true.

jmespath

(Only for file and NoSQL database source)

Specify a JMESPath expression to use to filter / extract nested JSON data. See https://jmespath.org/ for more

limit

The maximum number of rows to pull from the source

null_if

Whether this case-sensitive string value should be treated as a database null value when encountered. Default is NULL.

sheet

(Only for Excel source files) The name of the sheet to use as a data source, for example Sheet1. Default is the first sheet. You can also specify the range (Sheet2!B:H, Sheet3!B1:H70).

range

The range to use for backfill mode, separated by a single comma. Example: 2021-01-01,2021-02-01 or 1,10000

skip_blank_lines

Whether blank lines should be skipped when encountered. Default is false.

Encoding Options

When working with files that contain special characters (accented letters, non-English text, etc.), it's crucial to specify the correct encoding to ensure data integrity. Without the proper encoding, special characters may appear garbled or corrupted.

Supported Encodings

  • latin1 (ISO-8859-1): Western European languages

  • latin5 (ISO-8859-5): Cyrillic alphabet (Russian, Bulgarian, etc.)

  • latin9 (ISO-8859-15): Western European with Euro symbol

  • utf8: Unicode UTF-8 (default, most common)

  • utf8_bom: UTF-8 with Byte Order Mark

  • utf16: Unicode UTF-16

  • windows1250: Central European languages (Windows)

  • windows1252: Western European languages (Windows)

Examples

# Reading a Latin-1 encoded CSV file with French characters
streams:
  "file://data/customers_french.csv":
    object: public.customers
    source_options:
      encoding: latin1
      header: true

# Reading a Windows-1252 encoded file with special quotes and em-dashes
streams:
  "file://data/documents_windows.csv":
    object: public.documents
    source_options:
      encoding: windows1252
      header: true

# Reading a UTF-8 file with BOM (Byte Order Mark)
streams:
  "file://data/international.csv":
    object: public.international_data
    source_options:
      encoding: utf8_bom
      header: true

CLI Usage

# Specify encoding when reading files with special characters
sling run \
  --src-stream "file://./data/latin1_file.csv" \
  --tgt-conn POSTGRES \
  --tgt-object public.my_table \
  --src-options '{"encoding": "latin1", "header": true}'

Important: If you don't specify the correct encoding, files with special characters may appear corrupted or cause processing errors. Always verify the encoding of your source files, especially when working with data from different regions or legacy systems.

Last updated

Was this helpful?