Modern data platforms are expected to process massive volumes of data quickly, reliably, and at scale. From application logs and event streams to machine learning feature pipelines, these workloads often exceed the limits of a single machine. Apache Spark was designed to solve this problem by distributing computation across clusters, while Kubernetes provides a flexible and cloud-native way to run those clusters. Together, Spark and Kubernetes form a powerful foundation for large-scale data processing in modern cloud environments, including Amazon EKS.

This article walks through Apache Spark fundamentals, its execution model, and how to run scalable Spark jobs on Kubernetes using Amazon EKS.

What is Apache Spark?

Apache Spark is an open-source, distributed data processing engine built to process large volumes of data efficiently. It enables parallel execution of complex workloads, such as transformations, joins, and aggregations – across many machines.

Spark is commonly used for:

  • Large-scale ETL pipelines
  • Batch analytics
  • Machine learning workloads
  • Stream processing

Its main advantage is scalability and speed: Spark processes data in memory when possible and distributes work across multiple executors.

How does Spark work?

Spark follows a driver–executor architecture:

  • Driver
    • Coordinates the application
    • Builds the execution plan
    • Schedules tasks
  • Executors
    • Run tasks in parallel
    • Process partitions of data
    • Scale up and down during execution

Execution flow:

  1. A Spark job is submitted
  2. The driver creates a logical execution plan
  3. The plan is split into stages and tasks
  4. Executors process data in parallel
  5. Results are aggregated and written to storage

This model allows Spark to handle datasets that are far too large for a single machine.

A real-world use case: Large-scale log processing

A high-traffic application can generate hundreds of GBs or even TBs of logs per day.
Data teams need to clean, enrich, and aggregate this data to produce analytics and reports.

Spark is a good fit because it:

  • Splits data into partitions and processes them in parallel
  • Efficiently handles large joins and aggregations
  • Scales with data size

What might take hours on a single server can finish in minutes with Spark.

Why run Spark on Kubernetes?

Traditionally, Spark ran on YARN or dedicated clusters. Today, Kubernetes is a popular runtime because it provides:

  • Elastic scaling per job
  • Strong isolation between workloads
  • Unified infrastructure for all applications
  • Native integration with cloud IAM, networking, and autoscaling

Running Spark on Kubernetes lets you treat Spark jobs like any other containerized workload—scaling up when needed and tearing everything down when the job completes.

Tailoring Spark on Kubernetes for Amazon EKS

When running Spark on Amazon EKS, a few AWS-specific integrations make the setup more secure and production-ready.

Identity and access IAM Roles for Service Accounts (IRSA) are commonly used so Spark drivers and executors can access AWS services such as S3 without static credentials.

Container images Spark images are usually stored in Amazon ECR. Production setups often rely on custom images that include Hadoop AWS libraries and application-specific dependencies.

Autoscaling and compute Spark workloads benefit from dynamic scaling:

  • Executors scale per job
  • Node capacity scales via Cluster Autoscaler or Karpenter
  • Spot instances are often used for executors to reduce cost

Observability Drivers and executors run as standard pods and integrate naturally with CloudWatch, Prometheus, and cluster-level logging.

How to actually run a Spark job on Kubernetes

At a high level, running Spark on Kubernetes looks like this:

  1. Spark runs inside containers
  2. The driver runs as a Kubernetes pod
  3. Executors are created dynamically as separate pods
  4. Kubernetes schedules and manages all resources

Before jumping into commands, it helps to understand the runtime architecture.

Architecture: Spark on Kubernetes (simplified)

This diagram illustrates a typical Spark-on-Kubernetes flow:

  1. spark-submit A user or CI system submits a Spark job using spark-submit.
  2. Kubernetes API Server The submission is sent to the Kubernetes API, which creates the Spark driver pod.
  3. Spark Driver Pod The driver:
    • Builds the execution plan
    • Requests executor pods from Kubernetes
    • Coordinates task execution
  4. Kubernetes Scheduler The scheduler places executor pods on available nodes based on CPU and memory requirements.
  5. Spark Executor Pods Executors run tasks in parallel, processing data partitions and reporting results back to the driver.
  6. Job completion When the job finishes, executor pods terminate automatically, followed by the driver pod.

This clean mapping between Spark components and Kubernetes primitives is what makes Spark on Kubernetes easy to operate and scale.

Minimal hands-on example: Spark on Kubernetes

1. Create namespace and permissions (YAML)


  # spark-setup.yaml

apiVersion: v1

kind: Namespace

metadata:

  name: spark

---

apiVersion: v1

kind: ServiceAccount

metadata:

  name: spark

  namespace: spark

---

apiVersion: rbac.authorization.k8s.io/v1

kind: Role

metadata:

  name: spark-role

  namespace: spark

rules:

  - apiGroups: [""]

    resources: ["pods", "pods/log", "services"]

    verbs: ["get", "list", "watch", "create", "delete"]

---

apiVersion: rbac.authorization.k8s.io/v1

kind: RoleBinding

metadata:

  name: spark-rolebinding

  namespace: spark

subjects:

  - kind: ServiceAccount

    name: spark

    namespace: spark

roleRef:

  kind: Role

  name: spark-role

  apiGroup: rbac.authorization.k8s.io

Apply it:

kubectl apply -f spark-setup.yaml

2. Submit a Spark job


  spark-submit \

  --master k8s://https://<KUBERNETES_API_ENDPOINT> \

  --deploy-mode cluster \

  --name spark-pi \

  --class org.apache.spark.examples.SparkPi \

  --conf spark.kubernetes.namespace=spark \

  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \

  --conf spark.executor.instances=3 \

  --conf spark.kubernetes.container.image=apache/spark:3.5.1 \

  local:///opt/spark/examples/jars/spark-examples_2.12-3.5.1.jar

This creates:

  • One driver pod
  • Multiple executor pods
  • Automatic cleanup when the job finishes

Apache Spark at a glance

Apache Spark is a powerful engine for large-scale data processing, and Kubernetes provides a modern, flexible runtime for running Spark jobs. On Amazon EKS, tight integration with IAM, autoscaling, and object storage makes Spark both scalable and cost-efficient.

With this foundation, teams can extend the setup with production features such as S3 integration, dynamic executor allocation, Spot usage, and monitoring—while keeping Spark workloads fully cloud-native.