# Reading Excel

Sling supports reading Excel files (`.xlsx` and macro-enabled `.xlsm`) with flexible range selection options. You can extract data from specific sheets, cell ranges, row ranges, or column ranges.

{% hint style="info" %}
Macro-enabled workbooks (`.xlsm`) are read the same way as `.xlsx` files — Sling extracts the sheet data and ignores any embedded VBA macros. All range, sheet, and source options below apply equally to `.xlsm` files.
{% endhint %}

## Range Syntax

The `sheet` source option supports several range formats:

| Format        | Example          | Description                              |
| ------------- | ---------------- | ---------------------------------------- |
| No sheet      | *(empty)*        | Uses first sheet in workbook             |
| Sheet only    | `Sheet1`         | Read entire sheet                        |
| Range only    | `!A:F`           | First sheet, columns A through F         |
| Cell range    | `Sheet1!A1:F100` | Standard cell range                      |
| Column range  | `Sheet1!A:F`     | All rows, columns A through F            |
| Row range     | `Sheet1!5:20`    | Rows 5-20, auto-detect columns with data |
| Row start     | `Sheet1!5:`      | From row 5 to end, auto-detect columns   |
| Partial range | `Sheet1!A5:F`    | From A5 to column F, last row            |

{% hint style="info" %}
When no sheet name is provided (or when using `!` prefix without a sheet name), Sling automatically uses the first sheet in the workbook.
{% endhint %}

## Excel to Database

### Using CLI Flags

{% code title="sling.sh" overflow="wrap" %}

```bash
# Read first sheet (no sheet option = uses first sheet)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.excel_data' \
  --mode full-refresh

# Read first sheet with column range (! prefix without sheet name)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "!A:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.excel_data' \
  --mode full-refresh

# Read specific sheet
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read specific cell range (A1 to F100)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A1:F100" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read column range (all rows, columns A through F)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read row range (rows 5-20, auto-detect columns)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!5:20" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_subset' \
  --mode full-refresh

# Read from row 5 to end (auto-detect columns and last row)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!5:" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_from_row5' \
  --mode full-refresh

# Read partial range (from A5 to column F, extends to last row)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A5:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh
```

{% endcode %}

### Using Replication

Running with Sling: `sling run -r /path/to/replication.yaml`

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: LOCAL
target: MY_POSTGRES

defaults:
  mode: full-refresh

streams:
  # Read entire first sheet
  "file:///data/reports/quarterly.xlsx":
    object: public.quarterly_report

  # Read specific sheet by name
  "file:///data/reports/annual.xlsx":
    object: public.annual_summary
    source_options:
      sheet: "Summary"

  # Read specific cell range (useful for reports with headers/footers to skip)
  "file:///data/reports/financial.xlsx":
    object: public.financial_data
    source_options:
      sheet: "Q4 Results!B3:H50"

  # Read column range - all rows but only columns A through E
  "file:///data/reports/inventory.xlsx":
    object: public.inventory
    source_options:
      sheet: "Stock!A:E"

  # Read row range - rows 10-500, auto-detects columns with data
  # Useful when data starts after header rows or intro text
  "file:///data/reports/transactions.xlsx":
    object: public.transactions
    source_options:
      sheet: "Data!10:500"

  # Partial range - from row 5 to end, columns A through G
  # Great for skipping title rows while reading to the end
  "file:///data/reports/employees.xlsx":
    object: public.employees
    source_options:
      sheet: "Directory!A5:G"

  # Cast specific columns to ensure correct types
  "file:///data/reports/sales.xlsx":
    object: public.sales
    source_options:
      sheet: "Monthly!A:J"
    columns:
      sale_date: date
      amount: decimal
      quantity: integer

  # Multiple sheets from same file using wildcards
  "file:///data/reports/*.xlsx":
    object: public.{stream_file_name}
    source_options:
      sheet: "Sheet1!A:Z"

env:
  SLING_SAMPLE_SIZE: 2000  # Increase sample size for better type inference
  SLING_ROW_NUM_COLUMN: true  # Add row number column for tracking
```

{% endcode %}

### Using Python

{% code title="replication.py" overflow="wrap" %}

```python
from sling import Replication, ReplicationStream, SourceOptions, Mode
import os

os.environ['MY_POSTGRES'] = '...'

# Example 1: Basic Excel to database with different range types
replication = Replication(
    source='LOCAL',
    target='MY_POSTGRES',
    defaults={'mode': Mode.FULL_REFRESH},
    streams={
        # Entire sheet
        'file:///data/reports/data.xlsx': ReplicationStream(
            object='public.full_data',
            source_options=SourceOptions(sheet='Sheet1')
        ),

        # Specific cell range
        'file:///data/reports/financial.xlsx': ReplicationStream(
            object='public.financial',
            source_options=SourceOptions(sheet='Q4!B2:K100')
        ),

        # Column range (all rows)
        'file:///data/reports/inventory.xlsx': ReplicationStream(
            object='public.inventory',
            source_options=SourceOptions(sheet='Stock!A:F')
        ),

        # Row range (auto-detect columns)
        'file:///data/reports/logs.xlsx': ReplicationStream(
            object='public.logs',
            source_options=SourceOptions(sheet='Activity!15:1000')
        ),

        # Partial range (skip header rows)
        'file:///data/reports/contacts.xlsx': ReplicationStream(
            object='public.contacts',
            source_options=SourceOptions(sheet='People!A3:H')
        )
    },
    env={
        'SLING_SAMPLE_SIZE': '2000',
        'SLING_ROW_NUM_COLUMN': 'true'
    }
)

replication.run()
```

{% endcode %}

## Excel to File (Parquet, CSV, JSON)

Convert Excel files to other formats like Parquet for efficient storage and querying.

### Using CLI Flags

{% code title="sling.sh" overflow="wrap" %}

```bash
# Excel to Parquet
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Data!A:M" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/report.parquet'

# Excel to CSV
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Summary!5:100" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/summary.csv'

# Excel to JSON Lines
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Records" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/records.jsonl'

# Excel to cloud storage (S3)
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Data!A2:Z" }' \
  --tgt-conn MY_S3 \
  --tgt-object 's3://my-bucket/exports/report.parquet'
```

{% endcode %}

### Using Replication

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: LOCAL
target: MY_S3

defaults:
  mode: full-refresh

streams:
  # Convert Excel sheets to Parquet files
  "file:///data/reports/sales_2024.xlsx":
    object: "s3://data-lake/sales/2024/data.parquet"
    source_options:
      sheet: "Transactions!A:P"

  # Extract specific rows to CSV
  "file:///data/reports/quarterly.xlsx":
    object: "s3://exports/quarterly_summary.csv"
    source_options:
      sheet: "Summary!2:50"
    target_options:
      format: csv

  # Multiple Excel files to Parquet
  "file:///data/reports/monthly_*.xlsx":
    object: "s3://data-lake/monthly/{stream_file_name}.parquet"
    source_options:
      sheet: "Data!A:K"
```

{% endcode %}

### Using Python

{% code title="replication.py" overflow="wrap" %}

```python
from sling import Replication, ReplicationStream, SourceOptions, TargetOptions, Mode

# Convert Excel to Parquet on S3
replication = Replication(
    source='LOCAL',
    target='MY_S3',
    defaults={'mode': Mode.FULL_REFRESH},
    streams={
        # Excel to Parquet
        'file:///data/large_dataset.xlsx': ReplicationStream(
            object='s3://analytics/datasets/data.parquet',
            source_options=SourceOptions(sheet='Raw!A:AA')
        ),

        # Excel to partitioned Parquet
        'file:///data/events.xlsx': ReplicationStream(
            object='s3://analytics/events/',
            source_options=SourceOptions(sheet='Events!5:10000'),
            target_options=TargetOptions(
                format='parquet',
                file_max_rows=100000
            )
        )
    }
)

replication.run()
```

{% endcode %}

## Advanced Options

### Password-Protected Excel Files

{% code title="replication.yaml" %}

```yaml
source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/secure_report.xlsx":
    object: public.secure_data
    source_options:
      sheet: "Confidential!A:F"
      password: "${EXCEL_PASSWORD}"  # Use environment variable
```

{% endcode %}

### Custom Date Patterns

{% code title="replication.yaml" %}

```yaml
source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/international_dates.xlsx":
    object: public.date_data
    source_options:
      sheet: "Dates!A:D"
      short_date_pattern: "DD/MM/YYYY"
      long_date_pattern: "DD MMMM YYYY"
      long_time_pattern: "HH:mm:ss"
```

{% endcode %}

### Type Casting

{% code title="replication.yaml" %}

```yaml
source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/mixed_types.xlsx":
    object: public.typed_data
    source_options:
      sheet: "Data!A:J"
    columns:
      id: integer
      price: decimal
      created_at: timestamp
      is_active: bool
      metadata: json
      "*": string  # Cast remaining columns to string
```

{% endcode %}

## Environment Variables

| Variable                  | Description                                       | Default |
| ------------------------- | ------------------------------------------------- | ------- |
| `SLING_SAMPLE_SIZE`       | Number of rows to sample for type inference       | 900     |
| `SLING_ROW_NUM_COLUMN`    | Add `_sling_row_num` column with Excel row number | false   |
| `SLING_STREAM_URL_COLUMN` | Add `_sling_stream_url` column with file path     | false   |

## Tips

1. **Skip Header Rows**: Use row ranges like `5:1000` or partial ranges like `A5:F` to skip title rows or headers that aren't your column names.
2. **Auto-Detect Columns**: Row-only ranges (`10:500`) automatically detect the rightmost column with data, useful for sheets where column count varies.
3. **Memory Efficiency**: Sling streams data row-by-row, so large Excel files are processed efficiently without loading the entire file into memory.
4. **Type Inference**: Increase `SLING_SAMPLE_SIZE` for better type detection on large files with varied data.
5. **Multiple Sheets**: Process multiple sheets by creating separate stream entries for each sheet in your replication config.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.slingdata.io/examples/file-to-database/reading-excel.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
