We can specify the stream string as dumps/orders/{YYYY}/{MM}/{DD}/*.parquet
source: aws_s3
target: postgres
defaults:
mode: incremental
streams:
# no need to specify scheme://bucket
"dumps/orders/{YYYY}/{MM}/{DD}/":
id: orders # id to run only this stream
object: 'target_schema.orders'
source_options:
format: parquet
env:
# uses the `path/to/folder` in the same AWS_S3 connection
SLING_STATE: AWS_S3/path/to/folder
Backfilling
Best Practice for first load is to backfill the period range of interest. Sling will ingest all data found according to the stream path provided. Upon completion, the incremental date value will be set to the last date in the backfill range. Therefore it is necessary to run backfills in ascending order (e.g 2021-01-01,2021-12-31, 2022-01-01,2022-12-31, etc).
# backfill from 2021-01-01 to 2022-01-01
# incremental state value will be set to 2022-01-01
sling run -d -r replication.yaml --mode backfill \
--range 2021-01-01,2022-01-01 \
--streams "orders" # only run "orders" stream
Incremental
Run normally. Sling will process the next incremental data value
We can also provide a environment variable called SLING_STATE, which is a location where sling will store the respective incremental values. See for more details. This allows you to modify the value directly if you want to change the incremental marker value manually.