Modern data platforms are expected to process massive volumes of data quickly, reliably, and at scale. From application logs and event streams to machine learning feature pipelines, these workloads often exceed the limits of a single machine. Apache Spark was designed to solve this problem by distributing computation across clusters, while Kubernetes provides a flexible and cloud-native way to run those clusters. Together, Spark and Kubernetes form a powerful foundation for large-scale data processing in modern cloud environments, including Amazon EKS.
This article walks through Apache Spark fundamentals, its execution model, and how to run scalable Spark jobs on Kubernetes using Amazon EKS.
What is Apache Spark?
Apache Spark is an open-source, distributed data processing engine built to process large volumes of data efficiently. It enables parallel execution of complex workloads, such as transformations, joins, and aggregations – across many machines.
Spark is commonly used for:
- Large-scale ETL pipelines
- Batch analytics
- Machine learning workloads
- Stream processing
Its main advantage is scalability and speed: Spark processes data in memory when possible and distributes work across multiple executors.
How does Spark work?
Spark follows a driver–executor architecture:
- Driver
- Coordinates the application
- Builds the execution plan
- Schedules tasks
- Coordinates the application
- Executors
- Run tasks in parallel
- Process partitions of data
- Scale up and down during execution
- Run tasks in parallel
Execution flow:
- A Spark job is submitted
- The driver creates a logical execution plan
- The plan is split into stages and tasks
- Executors process data in parallel
- Results are aggregated and written to storage
This model allows Spark to handle datasets that are far too large for a single machine.
A real-world use case: Large-scale log processing
A high-traffic application can generate hundreds of GBs or even TBs of logs per day.
Data teams need to clean, enrich, and aggregate this data to produce analytics and reports.
Spark is a good fit because it:
- Splits data into partitions and processes them in parallel
- Efficiently handles large joins and aggregations
- Scales with data size
What might take hours on a single server can finish in minutes with Spark.
Why run Spark on Kubernetes?
Traditionally, Spark ran on YARN or dedicated clusters. Today, Kubernetes is a popular runtime because it provides:
- Elastic scaling per job
- Strong isolation between workloads
- Unified infrastructure for all applications
- Native integration with cloud IAM, networking, and autoscaling
Running Spark on Kubernetes lets you treat Spark jobs like any other containerized workload—scaling up when needed and tearing everything down when the job completes.
Tailoring Spark on Kubernetes for Amazon EKS
When running Spark on Amazon EKS, a few AWS-specific integrations make the setup more secure and production-ready.
Identity and access IAM Roles for Service Accounts (IRSA) are commonly used so Spark drivers and executors can access AWS services such as S3 without static credentials.
Container images Spark images are usually stored in Amazon ECR. Production setups often rely on custom images that include Hadoop AWS libraries and application-specific dependencies.
Autoscaling and compute Spark workloads benefit from dynamic scaling:
- Executors scale per job
- Node capacity scales via Cluster Autoscaler or Karpenter
- Spot instances are often used for executors to reduce cost
Observability Drivers and executors run as standard pods and integrate naturally with CloudWatch, Prometheus, and cluster-level logging.
How to actually run a Spark job on Kubernetes
At a high level, running Spark on Kubernetes looks like this:
- Spark runs inside containers
- The driver runs as a Kubernetes pod
- Executors are created dynamically as separate pods
- Kubernetes schedules and manages all resources
Before jumping into commands, it helps to understand the runtime architecture.
Architecture: Spark on Kubernetes (simplified)


This diagram illustrates a typical Spark-on-Kubernetes flow:
- spark-submit A user or CI system submits a Spark job using
spark-submit. - Kubernetes API Server The submission is sent to the Kubernetes API, which creates the Spark driver pod.
- Spark Driver Pod The driver:
- Builds the execution plan
- Requests executor pods from Kubernetes
- Coordinates task execution
- Builds the execution plan
- Kubernetes Scheduler The scheduler places executor pods on available nodes based on CPU and memory requirements.
- Spark Executor Pods Executors run tasks in parallel, processing data partitions and reporting results back to the driver.
- Job completion When the job finishes, executor pods terminate automatically, followed by the driver pod.
This clean mapping between Spark components and Kubernetes primitives is what makes Spark on Kubernetes easy to operate and scale.
Minimal hands-on example: Spark on Kubernetes
1. Create namespace and permissions (YAML)
# spark-setup.yaml
apiVersion: v1
kind: Namespace
metadata:
name: spark
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: spark-role
namespace: spark
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services"]
verbs: ["get", "list", "watch", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-rolebinding
namespace: spark
subjects:
- kind: ServiceAccount
name: spark
namespace: spark
roleRef:
kind: Role
name: spark-role
apiGroup: rbac.authorization.k8s.io
Apply it:
kubectl apply -f spark-setup.yaml
2. Submit a Spark job
spark-submit \
--master k8s://https://<KUBERNETES_API_ENDPOINT> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.container.image=apache/spark:3.5.1 \
local:///opt/spark/examples/jars/spark-examples_2.12-3.5.1.jar
This creates:
- One driver pod
- Multiple executor pods
- Automatic cleanup when the job finishes
Apache Spark at a glance
Apache Spark is a powerful engine for large-scale data processing, and Kubernetes provides a modern, flexible runtime for running Spark jobs. On Amazon EKS, tight integration with IAM, autoscaling, and object storage makes Spark both scalable and cost-efficient.
With this foundation, teams can extend the setup with production features such as S3 integration, dynamic executor allocation, Spot usage, and monitoring—while keeping Spark workloads fully cloud-native.
