Reading Excel

Examples of reading Excel files with various range options

Sling supports reading Excel files (.xlsx) with flexible range selection options. You can extract data from specific sheets, cell ranges, row ranges, or column ranges.

Range Syntax

The sheet source option supports several range formats:

Format

Example

Description

No sheet

(empty)

Uses first sheet in workbook

Sheet only

Sheet1

Read entire sheet

Range only

!A:F

First sheet, columns A through F

Cell range

Sheet1!A1:F100

Standard cell range

Column range

Sheet1!A:F

All rows, columns A through F

Row range

Sheet1!5:20

Rows 5-20, auto-detect columns with data

Row start

Sheet1!5:

From row 5 to end, auto-detect columns

Partial range

Sheet1!A5:F

From A5 to column F, last row

When no sheet name is provided (or when using ! prefix without a sheet name), Sling automatically uses the first sheet in the workbook.

Excel to Database

Using CLI Flags

sling.sh

# Read first sheet (no sheet option = uses first sheet)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.excel_data' \
  --mode full-refresh

# Read first sheet with column range (! prefix without sheet name)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "!A:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.excel_data' \
  --mode full-refresh

# Read specific sheet
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read specific cell range (A1 to F100)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A1:F100" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read column range (all rows, columns A through F)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read row range (rows 5-20, auto-detect columns)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!5:20" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_subset' \
  --mode full-refresh

# Read from row 5 to end (auto-detect columns and last row)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!5:" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_from_row5' \
  --mode full-refresh

# Read partial range (from A5 to column F, extends to last row)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A5:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

Using Replication

Running with Sling: sling run -r /path/to/replication.yaml

replication.yaml

source: LOCAL
target: MY_POSTGRES

defaults:
  mode: full-refresh

streams:
  # Read entire first sheet
  "file:///data/reports/quarterly.xlsx":
    object: public.quarterly_report

  # Read specific sheet by name
  "file:///data/reports/annual.xlsx":
    object: public.annual_summary
    source_options:
      sheet: "Summary"

  # Read specific cell range (useful for reports with headers/footers to skip)
  "file:///data/reports/financial.xlsx":
    object: public.financial_data
    source_options:
      sheet: "Q4 Results!B3:H50"

  # Read column range - all rows but only columns A through E
  "file:///data/reports/inventory.xlsx":
    object: public.inventory
    source_options:
      sheet: "Stock!A:E"

  # Read row range - rows 10-500, auto-detects columns with data
  # Useful when data starts after header rows or intro text
  "file:///data/reports/transactions.xlsx":
    object: public.transactions
    source_options:
      sheet: "Data!10:500"

  # Partial range - from row 5 to end, columns A through G
  # Great for skipping title rows while reading to the end
  "file:///data/reports/employees.xlsx":
    object: public.employees
    source_options:
      sheet: "Directory!A5:G"

  # Cast specific columns to ensure correct types
  "file:///data/reports/sales.xlsx":
    object: public.sales
    source_options:
      sheet: "Monthly!A:J"
    columns:
      sale_date: date
      amount: decimal
      quantity: integer

  # Multiple sheets from same file using wildcards
  "file:///data/reports/*.xlsx":
    object: public.{stream_file_name}
    source_options:
      sheet: "Sheet1!A:Z"

env:
  SLING_SAMPLE_SIZE: 2000  # Increase sample size for better type inference
  SLING_ROW_NUM_COLUMN: true  # Add row number column for tracking

Using Python

replication.py

from sling import Replication, ReplicationStream, SourceOptions, Mode
import os

os.environ['MY_POSTGRES'] = '...'

# Example 1: Basic Excel to database with different range types
replication = Replication(
    source='LOCAL',
    target='MY_POSTGRES',
    defaults={'mode': Mode.FULL_REFRESH},
    streams={
        # Entire sheet
        'file:///data/reports/data.xlsx': ReplicationStream(
            object='public.full_data',
            source_options=SourceOptions(sheet='Sheet1')
        ),

        # Specific cell range
        'file:///data/reports/financial.xlsx': ReplicationStream(
            object='public.financial',
            source_options=SourceOptions(sheet='Q4!B2:K100')
        ),

        # Column range (all rows)
        'file:///data/reports/inventory.xlsx': ReplicationStream(
            object='public.inventory',
            source_options=SourceOptions(sheet='Stock!A:F')
        ),

        # Row range (auto-detect columns)
        'file:///data/reports/logs.xlsx': ReplicationStream(
            object='public.logs',
            source_options=SourceOptions(sheet='Activity!15:1000')
        ),

        # Partial range (skip header rows)
        'file:///data/reports/contacts.xlsx': ReplicationStream(
            object='public.contacts',
            source_options=SourceOptions(sheet='People!A3:H')
        )
    },
    env={
        'SLING_SAMPLE_SIZE': '2000',
        'SLING_ROW_NUM_COLUMN': 'true'
    }
)

replication.run()

Excel to File (Parquet, CSV, JSON)

Convert Excel files to other formats like Parquet for efficient storage and querying.

Using CLI Flags

sling.sh

# Excel to Parquet
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Data!A:M" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/report.parquet'

# Excel to CSV
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Summary!5:100" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/summary.csv'

# Excel to JSON Lines
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Records" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/records.jsonl'

# Excel to cloud storage (S3)
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Data!A2:Z" }' \
  --tgt-conn MY_S3 \
  --tgt-object 's3://my-bucket/exports/report.parquet'

Using Replication

replication.yaml

source: LOCAL
target: MY_S3

defaults:
  mode: full-refresh

streams:
  # Convert Excel sheets to Parquet files
  "file:///data/reports/sales_2024.xlsx":
    object: "s3://data-lake/sales/2024/data.parquet"
    source_options:
      sheet: "Transactions!A:P"

  # Extract specific rows to CSV
  "file:///data/reports/quarterly.xlsx":
    object: "s3://exports/quarterly_summary.csv"
    source_options:
      sheet: "Summary!2:50"
    target_options:
      format: csv

  # Multiple Excel files to Parquet
  "file:///data/reports/monthly_*.xlsx":
    object: "s3://data-lake/monthly/{stream_file_name}.parquet"
    source_options:
      sheet: "Data!A:K"

Using Python

replication.py

from sling import Replication, ReplicationStream, SourceOptions, TargetOptions, Mode

# Convert Excel to Parquet on S3
replication = Replication(
    source='LOCAL',
    target='MY_S3',
    defaults={'mode': Mode.FULL_REFRESH},
    streams={
        # Excel to Parquet
        'file:///data/large_dataset.xlsx': ReplicationStream(
            object='s3://analytics/datasets/data.parquet',
            source_options=SourceOptions(sheet='Raw!A:AA')
        ),

        # Excel to partitioned Parquet
        'file:///data/events.xlsx': ReplicationStream(
            object='s3://analytics/events/',
            source_options=SourceOptions(sheet='Events!5:10000'),
            target_options=TargetOptions(
                format='parquet',
                file_max_rows=100000
            )
        )
    }
)

replication.run()

Advanced Options

Password-Protected Excel Files

replication.yaml

source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/secure_report.xlsx":
    object: public.secure_data
    source_options:
      sheet: "Confidential!A:F"
      password: "${EXCEL_PASSWORD}"  # Use environment variable

Custom Date Patterns

replication.yaml

source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/international_dates.xlsx":
    object: public.date_data
    source_options:
      sheet: "Dates!A:D"
      short_date_pattern: "DD/MM/YYYY"
      long_date_pattern: "DD MMMM YYYY"
      long_time_pattern: "HH:mm:ss"

Type Casting

replication.yaml

source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/mixed_types.xlsx":
    object: public.typed_data
    source_options:
      sheet: "Data!A:J"
    columns:
      id: integer
      price: decimal
      created_at: timestamp
      is_active: bool
      metadata: json
      "*": string  # Cast remaining columns to string

Environment Variables

Variable

Description

Default

SLING_SAMPLE_SIZE

Number of rows to sample for type inference

900

SLING_ROW_NUM_COLUMN

Add _sling_row_num column with Excel row number

false

SLING_STREAM_URL_COLUMN

Add _sling_stream_url column with file path

false

Tips

Skip Header Rows: Use row ranges like 5:1000 or partial ranges like A5:F to skip title rows or headers that aren't your column names.
Auto-Detect Columns: Row-only ranges (10:500) automatically detect the rightmost column with data, useful for sheets where column count varies.
Memory Efficiency: Sling streams data row-by-row, so large Excel files are processed efficiently without loading the entire file into memory.
Type Inference: Increase SLING_SAMPLE_SIZE for better type detection on large files with varied data.
Multiple Sheets: Process multiple sheets by creating separate stream entries for each sheet in your replication config.

PreviousMultiple Files & Cross-Bucket NextDatabase to Database

Last updated 1 month ago

Was this helpful?

hashtagRange Syntax

hashtagExcel to Database

hashtagUsing CLI Flags

hashtagUsing Replication

hashtagUsing Python

hashtagExcel to File (Parquet, CSV, JSON)

hashtagUsing CLI Flags

hashtagUsing Replication

hashtagUsing Python

hashtagAdvanced Options

hashtagPassword-Protected Excel Files

hashtagCustom Date Patterns

hashtagType Casting

hashtagEnvironment Variables

hashtagTips

Range Syntax

Excel to Database

Using CLI Flags

Using Replication

Using Python

Excel to File (Parquet, CSV, JSON)

Using CLI Flags

Using Replication

Using Python

Advanced Options

Password-Protected Excel Files

Custom Date Patterns

Type Casting

Environment Variables

Tips