Data Platforms · 03 of 04 · Spark

The engine behind the dashboard you trust.

Spark powers most modern data work, including the warehouses that try not to admit it. We tune real Spark jobs at real scale — partitions, shuffles, broadcast joins, AQE, the unsexy levers that turn a 4-hour batch into 22 minutes.

180+Spark jobs in production
−74%Avg. batch-runtime cut
4OSS contributors on team
What we do with Spark

The boring levers, pulled.

Spark performance isn't a mystery. It's a finite list of levers — and a discipline to actually pull them.

Performance Tuning

Partition strategy, broadcast joins, AQE configuration, file-size hygiene. We measure, profile, fix — in that order.

AQE · Partitions · Broadcast

Structured Streaming

Sub-minute pipelines with exactly-once semantics, stateful streaming with watermarking, checkpointing that survives restart.

Streaming · Watermarks · State

PySpark Engineering

PySpark code that's testable, type-hinted, package-able. Less notebook spaghetti, more production library.

PySpark · Pandas-on-Spark

Cluster Sizing & Spot

Right-sized clusters with autoscaling and spot pools, instance-type selection, dynamic allocation tuned to job shape.

Autoscale · Spot · Photon

Delta / Iceberg Tuning

OPTIMIZE, Z-ORDER, vacuum policy, liquid clustering. The table maintenance most teams skip until the queries hurt.

Delta · Iceberg · Z-ORDER

Spark on Kubernetes

Spark on EKS / GKE with the Spark Operator, history server, dynamic allocation. Cluster ops that the platform team owns, not the data team.

K8s · Operator · History Server
The Spark performance loop

Measure, profile, fix. Repeat.

A measured Spark job tells you exactly where it's slow. Our job is to listen, then pull the right lever.

ETY · SPARK TUNING LOOP · job: enrich_events · 3 iterationsIteration 01 · baseline4h 12m · skewed shuffle · 3,800 small filescost · $182Iteration 02 · AQE + broadcast users1h 24m · skew resolved · 280 taskscost · $61 · −66%Iteration 03 · Z-ORDER + file compaction22m · z-order on join key · 64 filescost · $14 · −92%LEVERS PULLED· spark.sql.adaptive.enabled · spark.sql.adaptive.skewJoin.enabled· broadcast(users) where size < 256MB· OPTIMIZE … ZORDER BY (user_id, event_ts)

The Spark UI tells you everything. Listen.

Most "slow Spark" is one of five things — skewed shuffle, small files, the wrong join hint, missing partition pruning, or undersized executors. We instrument the job, read the receipts, and pull the lever — not five.

  • 1
    Profile, don't guess

    Spark UI, history server, query plans. The receipt is in the run, not the README.

  • 2
    One change per iteration

    Apply one lever, re-run, measure. Two changes at once is a mystery, not a fix.

  • 3
    Maintenance, not heroics

    OPTIMIZE and VACUUM scheduled. Most "performance regressions" are file-count drift.

  • 4
    Cost as a guardrail

    Per-job cost reviewed weekly. Cluster size revisited when the workload shape changes.

The Spark surface area

Spark runs somewhere. We've shipped it everywhere.

EMR, Databricks, Dataproc, on-prem Kubernetes. We meet you where the cluster is.

Languages & APIs

PySparkSpark SQLScala SparkPandas-on-Spark

Runtimes

DatabricksEMRDataprocSpark on K8s

Streaming

Structured StreamingAuto LoaderKafkaKinesis

Table Formats

Delta LakeIcebergHudiParquet

Optimization

AQECBOZ-ORDERLiquid ClusteringPhoton

Orchestration

AirflowDagsterWorkflowsArgo

ML on Spark

MLlibPandasUDFPetastormHorovod

Observability

Spark UIHistory ServerPrometheusDataDog
Recent Spark work

Hours into minutes. Dollars into cents.

Three quick takes from the last twelve months.

Travel · 8TB daily batch

4h12m batch down to 22m, with 92% lower spend.

Skew handling via AQE, broadcast join for the dimension, Z-ORDER on the join key. The shuffle went from 240GB to 18GB.

−92%Cost
11×Speed
AQEZ-ORDERDelta
IoT · 90k events/s streaming

Stateful streaming with exactly-once, no surprises.

Structured Streaming with RocksDB state, watermarks tuned to actual late arrival distribution, checkpoint to S3 with a tested restart path.

90k/sSustained
0Duplicates · 12 mo
StreamingRocksDBKafka
Genomics · large-scale ML featurization

Pandas UDFs replaced a 14-day PySpark batch.

Vectorized UDFs with Arrow, Petastorm for ML hand-off, cluster right-sizing. Batch finished in under a day; scientists shipped weekly.

14×Faster
weeklyCadence
Pandas UDFArrowPetastorm

Spark, tuned by people who've broken it.

30 minutes. Bring your worst Spark UI screenshot. We'll point to the lever that'll pay for the call.