How to Run Spark on Kubernetes

Modern data platforms are expected to process massive volumes of data quickly, reliably, and at scale. From application logs and event streams to machine learning feature pipelines, these workloads often exceed the limits of a single machine. Apache Spark was designed to solve this problem by distributing computation across clusters, while Kubernetes provides a flexible and cloud-native way to run those clusters. Together, Spark and Kubernetes form a powerful foundation for large-scale data processing in modern cloud environments, including Amazon EKS.

This article walks through Apache Spark fundamentals, its execution model, and how to run scalable Spark jobs on Kubernetes using Amazon EKS.

What is Apache Spark?

Apache Spark is an open-source, distributed data processing engine built to process large volumes of data efficiently. It enables parallel execution of complex workloads, such as transformations, joins, and aggregations – across many machines.

Spark is commonly used for:

Large-scale ETL pipelines
Batch analytics
Machine learning workloads
Stream processing

Its main advantage is scalability and speed: Spark processes data in memory when possible and distributes work across multiple executors.

How does Spark work?

Spark follows a driver–executor architecture:

Driver
- Coordinates the application
- Builds the execution plan
- Schedules tasks
Executors
- Run tasks in parallel
- Process partitions of data
- Scale up and down during execution

Execution flow:

A Spark job is submitted
The driver creates a logical execution plan
The plan is split into stages and tasks
Executors process data in parallel
Results are aggregated and written to storage

This model allows Spark to handle datasets that are far too large for a single machine.

A real-world use case: Large-scale log processing

A high-traffic application can generate hundreds of GBs or even TBs of logs per day.
Data teams need to clean, enrich, and aggregate this data to produce analytics and reports.

Spark is a good fit because it:

Splits data into partitions and processes them in parallel
Efficiently handles large joins and aggregations
Scales with data size

What might take hours on a single server can finish in minutes with Spark.

Why run Spark on Kubernetes?

Traditionally, Spark ran on YARN or dedicated clusters. Today, Kubernetes is a popular runtime because it provides:

Elastic scaling per job
Strong isolation between workloads
Unified infrastructure for all applications
Native integration with cloud IAM, networking, and autoscaling

Running Spark on Kubernetes lets you treat Spark jobs like any other containerized workload—scaling up when needed and tearing everything down when the job completes.

Tailoring Spark on Kubernetes for Amazon EKS

When running Spark on Amazon EKS, a few AWS-specific integrations make the setup more secure and production-ready.

Identity and access IAM Roles for Service Accounts (IRSA) are commonly used so Spark drivers and executors can access AWS services such as S3 without static credentials.

Container images Spark images are usually stored in Amazon ECR. Production setups often rely on custom images that include Hadoop AWS libraries and application-specific dependencies.

Autoscaling and compute Spark workloads benefit from dynamic scaling:

Executors scale per job
Node capacity scales via Cluster Autoscaler or Karpenter
Spot instances are often used for executors to reduce cost

Observability Drivers and executors run as standard pods and integrate naturally with CloudWatch, Prometheus, and cluster-level logging.

How to actually run a Spark job on Kubernetes

At a high level, running Spark on Kubernetes looks like this:

Spark runs inside containers
The driver runs as a Kubernetes pod
Executors are created dynamically as separate pods
Kubernetes schedules and manages all resources

Before jumping into commands, it helps to understand the runtime architecture.

Architecture: Spark on Kubernetes (simplified)

This diagram illustrates a typical Spark-on-Kubernetes flow:

spark-submit A user or CI system submits a Spark job using spark-submit.
Kubernetes API Server The submission is sent to the Kubernetes API, which creates the Spark driver pod.
Spark Driver Pod The driver:
- Builds the execution plan
- Requests executor pods from Kubernetes
- Coordinates task execution
Kubernetes Scheduler The scheduler places executor pods on available nodes based on CPU and memory requirements.
Spark Executor Pods Executors run tasks in parallel, processing data partitions and reporting results back to the driver.
Job completion When the job finishes, executor pods terminate automatically, followed by the driver pod.

This clean mapping between Spark components and Kubernetes primitives is what makes Spark on Kubernetes easy to operate and scale.

Minimal hands-on example: Spark on Kubernetes

1. Create namespace and permissions (YAML)


  # spark-setup.yaml

apiVersion: v1

kind: Namespace

metadata:

  name: spark

---

apiVersion: v1

kind: ServiceAccount

metadata:

  name: spark

  namespace: spark

---

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

  name: spark-role

  namespace: spark

rules:

  - apiGroups: [""]

    resources: ["pods", "pods/log", "services"]

    verbs: ["get", "list", "watch", "create", "delete"]

---

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

  name: spark-rolebinding

  namespace: spark

subjects:

  - kind: ServiceAccount

    name: spark

    namespace: spark

roleRef:

  kind: Role

  name: spark-role

  apiGroup: rbac.authorization.k8s.io

Apply it:

kubectl apply -f spark-setup.yaml

2. Submit a Spark job


  spark-submit \

  --master k8s://https://<KUBERNETES_API_ENDPOINT> \

  --deploy-mode cluster \

  --name spark-pi \

  --class org.apache.spark.examples.SparkPi \

  --conf spark.kubernetes.namespace=spark \

  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \

  --conf spark.executor.instances=3 \

  --conf spark.kubernetes.container.image=apache/spark:3.5.1 \

  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.1.jar

This creates:

One driver pod
Multiple executor pods
Automatic cleanup when the job finishes

Apache Spark at a glance

Apache Spark is a powerful engine for large-scale data processing, and Kubernetes provides a modern, flexible runtime for running Spark jobs. On Amazon EKS, tight integration with IAM, autoscaling, and object storage makes Spark both scalable and cost-efficient.

With this foundation, teams can extend the setup with production features such as S3 integration, dynamic executor allocation, Spot usage, and monitoring—while keeping Spark workloads fully cloud-native.

Kubernetes Resource Optimization

Cloud Commitment Optimization

Spike Protection

What's new

Get to know Zesty

Hear it from our customers

Learn Kubernetes

Industry learning

Platform learning

Platform support

Podcast

How to Run Spark on Kubernetes

What is Apache Spark?

How does Spark work?

A real-world use case: Large-scale log processing

Why run Spark on Kubernetes?

Tailoring Spark on Kubernetes for Amazon EKS

Architecture: Spark on Kubernetes (simplified)

Minimal hands-on example: Spark on Kubernetes

1. Create namespace and permissions (YAML)

2. Submit a Spark job

Apache Spark at a glance

Simple Hacks for Cost Optimization with Karpenter

How load balancing strategy affects pod utilization and rightsizing accuracy

Tuning Karpenter for workloads with spiky traffic

Why stateful workloads are often the biggest scaling bottleneck in K8s

Using Karpenter and still overpaying?

How to avoid costly instance selection mistakes in Karpenter

Ready to Cut Cloud Costs?

Platform

Company

Resources

Proud to be

Kubernetes Resource Optimization

Cloud Commitment Optimization

Spike Protection

What's new

Get to know Zesty

Hear it from our customers

Learn Kubernetes

Industry learning

Platform learning

Platform support

Podcast

How to Run Spark on Kubernetes

What is Apache Spark?

How does Spark work?

A real-world use case: Large-scale log processing

Why run Spark on Kubernetes?

Tailoring Spark on Kubernetes for Amazon EKS

Architecture: Spark on Kubernetes (simplified)

Minimal hands-on example: Spark on Kubernetes

1. Create namespace and permissions (YAML)

2. Submit a Spark job

Apache Spark at a glance

Check out related topics

Simple Hacks for Cost Optimization with Karpenter

How load balancing strategy affects pod utilization and rightsizing accuracy

Tuning Karpenter for workloads with spiky traffic

Why stateful workloads are often the biggest scaling bottleneck in K8s

Using Karpenter and still overpaying?

How to avoid costly instance selection mistakes in Karpenter

Ready to Cut Cloud Costs?

Platform

Company

Resources

Proud to be