Three Design Decisions Shape Data Lake ETL Performance

At CASABLANCA, we build hotel management software used by over 2,000 hospitality businesses across the DACH region. Our engineering mission is to deliver the data infrastructure that makes 5-star guest experiences possible — reliably, efficiently, and without overengineering.

The properties we work with are midsize operations. They generate a few gigabytes of operational data per day — not hundreds of gigabytes, not terabytes. They do not run Spark clusters. They need pipelines that are fast, cheap, and straightforward to maintain. Two pain points come up consistently in practice: the small file problem, where time-based data partitioned too granularly produces thousands of tiny files that quietly degrade performance over time, and the hidden cost of blob storage, where Azure charges per file operation rather than per byte — meaning a pipeline that writes 7,859 files costs dramatically more than one writing 285, even when the underlying data is identical.

This benchmark was built to quantify both. We tested three interacting decisions — partitioning strategy, framework choice, and concurrency level — because in practice these are made together, their effects compound, and the combination matters more than any single choice in isolation.

The specific task: converting raw JSON snapshots from the bronze layer into partitioned Parquet files on Azure Blob Storage.

Input: 20 daily operational snapshots from 9 tenants, covering roughly 2 years of time-series records — 124,666 records, 29.8 MB in total.

A quick note on bronze and silver

The bronze layer holds raw source data exactly as it arrives from source systems. The silver layer is the next step: data is cleaned, structured, and stored in a way that downstream queries and services can use efficiently. Bronze-to-silver conversion is where format, partitioning, and layout decisions get made — and those decisions compound over time.

What we tested

We tested four Python stacks for converting raw JSON into Parquet: built-in json as a baseline, orjson + PyArrow, orjson + Polars, and DuckDB. We compared two ways of partitioning the output files: daily (one file per tenant per day — 7,859 files total) and monthly (one file per tenant per month — 285 files total). On top of that, we varied the number of parallel upload and download workers to find where performance peaks and where adding more threads starts to hurt.

The input data consists of daily time-series records per tenant per room category. Each record captures a date, a room category identifier, and a set of daily availability and occupancy figures — roughly 10 fields per entry. The 20 snapshots span approximately 2 years of data across 9 tenants.

Everything was tested against Azure Blob Storage with zstd compression.

Tests ran on an 8-core AMD Ryzen 7 machine (16 logical processors) with 30 GB RAM running Windows 11. Resource consumption per framework (CPU and memory usage during conversion and upload) was not measured in this run and is planned as a follow-up — particularly relevant when evaluating worker counts relative to available CPU cores.

To show the range before we get into the details:

Same data: 124,666 records, 29.8 MB raw JSON input. Write and read times at 16 parallel workers.

Configuration Output size Write time Read time Cost per run
Daily partitions, plain JSON (baseline) 31 MB (+4% vs input) 56 s 27.8 s €0.046
Monthly partitions, PyArrow 1.1 MB (−96%) 3.3 s 1.8 s €0.0017
Monthly partitions, Polars 1.3 MB (−96%) 3.9 s 1.4 s €0.0017

Monthly + PyArrow writes 17× faster and costs 27× less than the baseline. Monthly + Polars reads 20× faster. The data is identical in all three cases — only the configuration changed.

The rest of this post explains which decisions drove those differences and what that means in practice.

Partitioning granularity sets the structural baseline

Partitioning is the decision that affects everything downstream. Daily granularity produced 7,859 output files. Monthly produced 285. That difference alone shaped write speed, compression, read performance, and cloud cost — before framework or concurrency even enter the picture.

At daily granularity, the write pipeline was slow — not because of the framework, but because uploading thousands of individual small files to Azure Blob Storage is inherently expensive. The storage service charges per file operation, so 7,859 files means 7,859 separate upload calls. That upload overhead consumed the vast majority of total write time across all stacks, regardless of which library did the conversion. At monthly granularity, the same pipeline completed in 3.3–3.9 seconds. At daily granularity, it took 56–83 seconds depending on the stack — 14 to 22 times slower. The data was the same. Only the file count changed.

The compression story was equally stark. At daily granularity, two of our four methods actually produced output that was larger than the raw input — the pipeline ran, the conversion completed, and the result was more data, not less. At monthly granularity, those same stacks compressed the same records down to around 1.1–1.3 MB from 29.8 MB — a 95–96% reduction, achieved purely by changing how the files were laid out.

Cost followed the same pattern. Azure Blob Storage pricing is driven by the number of file operations, not by how much data you store. Fewer files means fewer operations means a lower bill — up to 27× lower in this benchmark for Parquet-based configurations.

If your pipeline cannot use monthly partitions, the framework and concurrency sections below are your most important levers.

The framework you choose matters most on reads

Once you are on monthly partitions, the three Parquet stacks write at roughly the same speed — all complete in 3.3–3.9 seconds at 16 workers, with the actual conversion taking just 1–2 seconds of that. At this scale, write differences between frameworks are small enough to be a secondary concern.

Reads are where the choice becomes meaningful. With 16 parallel workers on monthly partitions, Polars read all 124,666 records in 1.41 seconds. DuckDB completed in 1.61 seconds, PyArrow in 1.75 seconds. The gap is not random — Polars deserialises data internally at 4.2× the speed of PyArrow (2.96 MB/s vs 0.71 MB/s). That difference compounds as data volumes grow.

DuckDB is the right pick if storage footprint is the priority. It produced the smallest files at both granularities — 0.84 MB per monthly partition (97.2% reduction). Read performance was close to Polars at monthly scale, and the slightly higher conversion time (1.9 seconds versus 1.1–1.2 seconds for the others) is immaterial in practice.

At daily granularity, framework selection has less impact on write speed because the upload bottleneck dominates regardless of which library processes the data. DuckDB's compression advantage still holds and matters for storage costs, but no stack change will meaningfully shorten the write time when 7,859 individual file uploads are the constraint.

On monthly partitions: use Polars for the fastest reads, PyArrow if your pipeline is more write-heavy, or DuckDB if minimising file size is the priority.
On daily partitions: use DuckDB for the best compression. Do not expect framework choice to significantly change your write time — the bottleneck is file count, not the library.

More parallel workers is not always better

Adding more parallel workers improves performance up to a point — then it starts to hurt. The benchmark showed this consistently across write and read tests, and the optimal setting was never the highest one we tested.

For writes at daily granularity, 32 workers was the sweet spot. PyArrow dropped from 71.2 seconds at 16 workers to 53.9 seconds at 32. Polars went from 64.3 to 53.2 seconds. At 64 workers, both got slower again — 76.2 and 81.5 seconds respectively. Beyond a certain point, the overhead of managing connections outweighs the benefit of additional parallelism.

For writes at monthly granularity, concurrency differences were small. The full range across all methods and worker counts was 3.0–3.9 seconds, so tuning matters less here. The best single result was PyArrow at 32 workers, completing in 3.1 seconds — but the gain over 16 workers is marginal.

For reads at monthly granularity, 16 workers was the clear sweet spot. With only 285 files to fetch, scaling beyond that added coordination cost with little return — 64 workers was slower than 16 in every monthly read test we ran.

The underlying reason is the same in all cases: the right concurrency setting depends on how many files your pipeline is working with. A setting that is optimal for 285 files will behave very differently on 7,859. Concurrency should follow your storage layout, not be configured independently of it.

Target 32 workers for writes and 16 workers for reads. 64 workers was strictly worse in every test we ran at both granularities.

The best result comes from getting all three right

Each of the three decisions has independent value — but they also interact, and the full gain only shows up when all three align.

The clearest illustration is on reads. Monthly partitions with Polars at 16 workers read all 124,666 records in 1.41 seconds. The weakest configuration — daily partitions, plain JSON, same 16 workers — took 27.8 seconds for identical data. That is a 20× difference from two decisions: partition granularity reduced the file count from 7,859 to 285, and Polars added its own deserialisation speed advantage on top. Neither change alone gets you there.

The same logic applies to writes. PyArrow on monthly partitions completes in 3.3 seconds. The same stack on daily partitions takes 71.2 seconds — 21× slower, with framework and concurrency held constant. That is purely the cost of 7,859 upload operations versus 285.

And concurrency has its own independent contribution. At daily granularity, moving from 16 to 32 upload workers cuts PyArrow write time from 71.2 to 53.9 seconds — without touching the stack or the storage layout. A meaningful gain from one configuration change.

No single lever tells the whole story. The post sections above cover each dimension separately, but in practice they should be considered together.

One thing to take away

These three decisions — partitioning, framework, concurrency — are usually optimised separately, in whatever order a team happens to encounter them. This benchmark suggests they are worth examining together. The interaction between them is where most of the performance and cost difference actually lives.

If you have run similar benchmarks and hit different results, or tested conditions we did not cover, we would love to compare notes — we are always happy to exchange findings with anyone working on similar problems.