# Reading Excel

Sling supports reading Excel files (`.xlsx`) with flexible range selection options. You can extract data from specific sheets, cell ranges, row ranges, or column ranges.

## Range Syntax

The `sheet` source option supports several range formats:

| Format        | Example          | Description                              |
| ------------- | ---------------- | ---------------------------------------- |
| No sheet      | *(empty)*        | Uses first sheet in workbook             |
| Sheet only    | `Sheet1`         | Read entire sheet                        |
| Range only    | `!A:F`           | First sheet, columns A through F         |
| Cell range    | `Sheet1!A1:F100` | Standard cell range                      |
| Column range  | `Sheet1!A:F`     | All rows, columns A through F            |
| Row range     | `Sheet1!5:20`    | Rows 5-20, auto-detect columns with data |
| Row start     | `Sheet1!5:`      | From row 5 to end, auto-detect columns   |
| Partial range | `Sheet1!A5:F`    | From A5 to column F, last row            |

{% hint style="info" %}
When no sheet name is provided (or when using `!` prefix without a sheet name), Sling automatically uses the first sheet in the workbook.
{% endhint %}

## Excel to Database

### Using CLI Flags

{% code title="sling.sh" overflow="wrap" %}

```bash
# Read first sheet (no sheet option = uses first sheet)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.excel_data' \
  --mode full-refresh

# Read first sheet with column range (! prefix without sheet name)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "!A:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.excel_data' \
  --mode full-refresh

# Read specific sheet
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read specific cell range (A1 to F100)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A1:F100" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read column range (all rows, columns A through F)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh

# Read row range (rows 5-20, auto-detect columns)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!5:20" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_subset' \
  --mode full-refresh

# Read from row 5 to end (auto-detect columns and last row)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!5:" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_from_row5' \
  --mode full-refresh

# Read partial range (from A5 to column F, extends to last row)
$ sling run --src-stream 'file:///path/to/data.xlsx' \
  --src-options '{ sheet: "Sales!A5:F" }' \
  --tgt-conn MY_POSTGRES \
  --tgt-object 'public.sales_data' \
  --mode full-refresh
```

{% endcode %}

### Using Replication

Running with Sling: `sling run -r /path/to/replication.yaml`

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: LOCAL
target: MY_POSTGRES

defaults:
  mode: full-refresh

streams:
  # Read entire first sheet
  "file:///data/reports/quarterly.xlsx":
    object: public.quarterly_report

  # Read specific sheet by name
  "file:///data/reports/annual.xlsx":
    object: public.annual_summary
    source_options:
      sheet: "Summary"

  # Read specific cell range (useful for reports with headers/footers to skip)
  "file:///data/reports/financial.xlsx":
    object: public.financial_data
    source_options:
      sheet: "Q4 Results!B3:H50"

  # Read column range - all rows but only columns A through E
  "file:///data/reports/inventory.xlsx":
    object: public.inventory
    source_options:
      sheet: "Stock!A:E"

  # Read row range - rows 10-500, auto-detects columns with data
  # Useful when data starts after header rows or intro text
  "file:///data/reports/transactions.xlsx":
    object: public.transactions
    source_options:
      sheet: "Data!10:500"

  # Partial range - from row 5 to end, columns A through G
  # Great for skipping title rows while reading to the end
  "file:///data/reports/employees.xlsx":
    object: public.employees
    source_options:
      sheet: "Directory!A5:G"

  # Cast specific columns to ensure correct types
  "file:///data/reports/sales.xlsx":
    object: public.sales
    source_options:
      sheet: "Monthly!A:J"
    columns:
      sale_date: date
      amount: decimal
      quantity: integer

  # Multiple sheets from same file using wildcards
  "file:///data/reports/*.xlsx":
    object: public.{stream_file_name}
    source_options:
      sheet: "Sheet1!A:Z"

env:
  SLING_SAMPLE_SIZE: 2000  # Increase sample size for better type inference
  SLING_ROW_NUM_COLUMN: true  # Add row number column for tracking
```

{% endcode %}

### Using Python

{% code title="replication.py" overflow="wrap" %}

```python
from sling import Replication, ReplicationStream, SourceOptions, Mode
import os

os.environ['MY_POSTGRES'] = '...'

# Example 1: Basic Excel to database with different range types
replication = Replication(
    source='LOCAL',
    target='MY_POSTGRES',
    defaults={'mode': Mode.FULL_REFRESH},
    streams={
        # Entire sheet
        'file:///data/reports/data.xlsx': ReplicationStream(
            object='public.full_data',
            source_options=SourceOptions(sheet='Sheet1')
        ),

        # Specific cell range
        'file:///data/reports/financial.xlsx': ReplicationStream(
            object='public.financial',
            source_options=SourceOptions(sheet='Q4!B2:K100')
        ),

        # Column range (all rows)
        'file:///data/reports/inventory.xlsx': ReplicationStream(
            object='public.inventory',
            source_options=SourceOptions(sheet='Stock!A:F')
        ),

        # Row range (auto-detect columns)
        'file:///data/reports/logs.xlsx': ReplicationStream(
            object='public.logs',
            source_options=SourceOptions(sheet='Activity!15:1000')
        ),

        # Partial range (skip header rows)
        'file:///data/reports/contacts.xlsx': ReplicationStream(
            object='public.contacts',
            source_options=SourceOptions(sheet='People!A3:H')
        )
    },
    env={
        'SLING_SAMPLE_SIZE': '2000',
        'SLING_ROW_NUM_COLUMN': 'true'
    }
)

replication.run()
```

{% endcode %}

## Excel to File (Parquet, CSV, JSON)

Convert Excel files to other formats like Parquet for efficient storage and querying.

### Using CLI Flags

{% code title="sling.sh" overflow="wrap" %}

```bash
# Excel to Parquet
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Data!A:M" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/report.parquet'

# Excel to CSV
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Summary!5:100" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/summary.csv'

# Excel to JSON Lines
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Records" }' \
  --tgt-conn LOCAL \
  --tgt-object 'file:///output/records.jsonl'

# Excel to cloud storage (S3)
$ sling run --src-stream 'file:///data/report.xlsx' \
  --src-options '{ sheet: "Data!A2:Z" }' \
  --tgt-conn MY_S3 \
  --tgt-object 's3://my-bucket/exports/report.parquet'
```

{% endcode %}

### Using Replication

{% code title="replication.yaml" overflow="wrap" %}

```yaml
source: LOCAL
target: MY_S3

defaults:
  mode: full-refresh

streams:
  # Convert Excel sheets to Parquet files
  "file:///data/reports/sales_2024.xlsx":
    object: "s3://data-lake/sales/2024/data.parquet"
    source_options:
      sheet: "Transactions!A:P"

  # Extract specific rows to CSV
  "file:///data/reports/quarterly.xlsx":
    object: "s3://exports/quarterly_summary.csv"
    source_options:
      sheet: "Summary!2:50"
    target_options:
      format: csv

  # Multiple Excel files to Parquet
  "file:///data/reports/monthly_*.xlsx":
    object: "s3://data-lake/monthly/{stream_file_name}.parquet"
    source_options:
      sheet: "Data!A:K"
```

{% endcode %}

### Using Python

{% code title="replication.py" overflow="wrap" %}

```python
from sling import Replication, ReplicationStream, SourceOptions, TargetOptions, Mode

# Convert Excel to Parquet on S3
replication = Replication(
    source='LOCAL',
    target='MY_S3',
    defaults={'mode': Mode.FULL_REFRESH},
    streams={
        # Excel to Parquet
        'file:///data/large_dataset.xlsx': ReplicationStream(
            object='s3://analytics/datasets/data.parquet',
            source_options=SourceOptions(sheet='Raw!A:AA')
        ),

        # Excel to partitioned Parquet
        'file:///data/events.xlsx': ReplicationStream(
            object='s3://analytics/events/',
            source_options=SourceOptions(sheet='Events!5:10000'),
            target_options=TargetOptions(
                format='parquet',
                file_max_rows=100000
            )
        )
    }
)

replication.run()
```

{% endcode %}

## Advanced Options

### Password-Protected Excel Files

{% code title="replication.yaml" %}

```yaml
source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/secure_report.xlsx":
    object: public.secure_data
    source_options:
      sheet: "Confidential!A:F"
      password: "${EXCEL_PASSWORD}"  # Use environment variable
```

{% endcode %}

### Custom Date Patterns

{% code title="replication.yaml" %}

```yaml
source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/international_dates.xlsx":
    object: public.date_data
    source_options:
      sheet: "Dates!A:D"
      short_date_pattern: "DD/MM/YYYY"
      long_date_pattern: "DD MMMM YYYY"
      long_time_pattern: "HH:mm:ss"
```

{% endcode %}

### Type Casting

{% code title="replication.yaml" %}

```yaml
source: LOCAL
target: MY_POSTGRES

streams:
  "file:///data/mixed_types.xlsx":
    object: public.typed_data
    source_options:
      sheet: "Data!A:J"
    columns:
      id: integer
      price: decimal
      created_at: timestamp
      is_active: bool
      metadata: json
      "*": string  # Cast remaining columns to string
```

{% endcode %}

## Environment Variables

| Variable                  | Description                                       | Default |
| ------------------------- | ------------------------------------------------- | ------- |
| `SLING_SAMPLE_SIZE`       | Number of rows to sample for type inference       | 900     |
| `SLING_ROW_NUM_COLUMN`    | Add `_sling_row_num` column with Excel row number | false   |
| `SLING_STREAM_URL_COLUMN` | Add `_sling_stream_url` column with file path     | false   |

## Tips

1. **Skip Header Rows**: Use row ranges like `5:1000` or partial ranges like `A5:F` to skip title rows or headers that aren't your column names.
2. **Auto-Detect Columns**: Row-only ranges (`10:500`) automatically detect the rightmost column with data, useful for sheets where column count varies.
3. **Memory Efficiency**: Sling streams data row-by-row, so large Excel files are processed efficiently without loading the entire file into memory.
4. **Type Inference**: Increase `SLING_SAMPLE_SIZE` for better type detection on large files with varied data.
5. **Multiple Sheets**: Process multiple sheets by creating separate stream entries for each sheet in your replication config.
