Performance Tuning
Partition strategy, broadcast joins, AQE configuration, file-size hygiene. We measure, profile, fix — in that order.
AQE · Partitions · BroadcastSpark powers most modern data work, including the warehouses that try not to admit it. We tune real Spark jobs at real scale — partitions, shuffles, broadcast joins, AQE, the unsexy levers that turn a 4-hour batch into 22 minutes.
Spark performance isn't a mystery. It's a finite list of levers — and a discipline to actually pull them.
Partition strategy, broadcast joins, AQE configuration, file-size hygiene. We measure, profile, fix — in that order.
AQE · Partitions · BroadcastSub-minute pipelines with exactly-once semantics, stateful streaming with watermarking, checkpointing that survives restart.
Streaming · Watermarks · StatePySpark code that's testable, type-hinted, package-able. Less notebook spaghetti, more production library.
PySpark · Pandas-on-SparkRight-sized clusters with autoscaling and spot pools, instance-type selection, dynamic allocation tuned to job shape.
Autoscale · Spot · PhotonOPTIMIZE, Z-ORDER, vacuum policy, liquid clustering. The table maintenance most teams skip until the queries hurt.
Delta · Iceberg · Z-ORDERSpark on EKS / GKE with the Spark Operator, history server, dynamic allocation. Cluster ops that the platform team owns, not the data team.
K8s · Operator · History ServerA measured Spark job tells you exactly where it's slow. Our job is to listen, then pull the right lever.
Most "slow Spark" is one of five things — skewed shuffle, small files, the wrong join hint, missing partition pruning, or undersized executors. We instrument the job, read the receipts, and pull the lever — not five.
Spark UI, history server, query plans. The receipt is in the run, not the README.
Apply one lever, re-run, measure. Two changes at once is a mystery, not a fix.
OPTIMIZE and VACUUM scheduled. Most "performance regressions" are file-count drift.
Per-job cost reviewed weekly. Cluster size revisited when the workload shape changes.
EMR, Databricks, Dataproc, on-prem Kubernetes. We meet you where the cluster is.
Three quick takes from the last twelve months.
Skew handling via AQE, broadcast join for the dimension, Z-ORDER on the join key. The shuffle went from 240GB to 18GB.
Structured Streaming with RocksDB state, watermarks tuned to actual late arrival distribution, checkpoint to S3 with a tested restart path.
Vectorized UDFs with Arrow, Petastorm for ML hand-off, cluster right-sizing. Batch finished in under a day; scientists shipped weekly.
30 minutes. Bring your worst Spark UI screenshot. We'll point to the lever that'll pay for the call.