Data Platforms · 03 of 04 · Spark

The engine behind the dashboard you trust.

Spark powers most modern data work, including the warehouses that try not to admit it. We tune real Spark jobs at real scale — partitions, shuffles, broadcast joins, AQE, the unsexy levers that turn a 4-hour batch into 22 minutes.

Plan a Spark sprint ↗What we tune

180+Spark jobs in production

−74%Avg. batch-runtime cut

4OSS contributors on team

What we do with Spark

The boring levers, pulled.

Spark performance isn't a mystery. It's a finite list of levers — and a discipline to actually pull them.

Performance Tuning

Partition strategy, broadcast joins, AQE configuration, file-size hygiene. We measure, profile, fix — in that order.

AQE · Partitions · Broadcast

Structured Streaming

Sub-minute pipelines with exactly-once semantics, stateful streaming with watermarking, checkpointing that survives restart.

Streaming · Watermarks · State

PySpark Engineering

PySpark code that's testable, type-hinted, package-able. Less notebook spaghetti, more production library.

PySpark · Pandas-on-Spark

Cluster Sizing & Spot

Right-sized clusters with autoscaling and spot pools, instance-type selection, dynamic allocation tuned to job shape.

Autoscale · Spot · Photon

Delta / Iceberg Tuning

OPTIMIZE, Z-ORDER, vacuum policy, liquid clustering. The table maintenance most teams skip until the queries hurt.

Delta · Iceberg · Z-ORDER

Spark on Kubernetes

Spark on EKS / GKE with the Spark Operator, history server, dynamic allocation. Cluster ops that the platform team owns, not the data team.

K8s · Operator · History Server

The Spark performance loop

Measure, profile, fix. Repeat.

A measured Spark job tells you exactly where it's slow. Our job is to listen, then pull the right lever.

The Spark UI tells you everything. Listen.

Most "slow Spark" is one of five things — skewed shuffle, small files, the wrong join hint, missing partition pruning, or undersized executors. We instrument the job, read the receipts, and pull the lever — not five.

1
Profile, don't guess
Spark UI, history server, query plans. The receipt is in the run, not the README.
2
One change per iteration
Apply one lever, re-run, measure. Two changes at once is a mystery, not a fix.
3
Maintenance, not heroics
OPTIMIZE and VACUUM scheduled. Most "performance regressions" are file-count drift.
4
Cost as a guardrail
Per-job cost reviewed weekly. Cluster size revisited when the workload shape changes.

The Spark surface area

Spark runs somewhere. We've shipped it everywhere.

EMR, Databricks, Dataproc, on-prem Kubernetes. We meet you where the cluster is.

Languages & APIs

PySparkSpark SQLScala SparkPandas-on-Spark

Runtimes

DatabricksEMRDataprocSpark on K8s

Streaming

Structured StreamingAuto LoaderKafkaKinesis

Table Formats

Delta LakeIcebergHudiParquet

Optimization

AQECBOZ-ORDERLiquid ClusteringPhoton

Orchestration

AirflowDagsterWorkflowsArgo

ML on Spark

MLlibPandasUDFPetastormHorovod

Observability

Spark UIHistory ServerPrometheusDataDog

Recent Spark work

Hours into minutes. Dollars into cents.

Three quick takes from the last twelve months.

Travel · 8TB daily batch

4h12m batch down to 22m, with 92% lower spend.

Skew handling via AQE, broadcast join for the dimension, Z-ORDER on the join key. The shuffle went from 240GB to 18GB.

−92%Cost

11×Speed

AQEZ-ORDERDelta

IoT · 90k events/s streaming

Stateful streaming with exactly-once, no surprises.

Structured Streaming with RocksDB state, watermarks tuned to actual late arrival distribution, checkpoint to S3 with a tested restart path.

90k/sSustained

0Duplicates · 12 mo

StreamingRocksDBKafka

Genomics · large-scale ML featurization

Pandas UDFs replaced a 14-day PySpark batch.

Vectorized UDFs with Arrow, Petastorm for ML hand-off, cluster right-sizing. Batch finished in under a day; scientists shipped weekly.

14×Faster

weeklyCadence

Pandas UDFArrowPetastorm

Databricks

The lakehouse that wraps Spark with governance, ML and SQL.

→

dbt

SQL-first transformations that complement Spark for the modeling layer.

→

Spark, tuned by people who've broken it.

30 minutes. Bring your worst Spark UI screenshot. We'll point to the lever that'll pay for the call.

Book a discovery call ↗All Data Platforms

AI/ML

Data Engineering

Cloud and Devops

Development

Need help choosing the right service?

Cloud Platforms

Data Platforms

industry

Portfolio

Company