From Zero to EMR Serverless Dev Environment with Terraform

This Terraform setup provisions a practical EMR Serverless development environment on AWS — not just isolated resources, but a coherent workspace where engineers can actually experiment, iterate, and understand how EMR behaves in real conditions. It includes networking, storage, IAM roles, logging, and EMR Studio so developers can run Spark workloads without spending days wiring infrastructure manually.

The goal here is not “deploy infrastructure for the sake of it.” The goal is to remove friction from testing distributed data workloads.

Architecture overview

The diagram below summarizes how the main pieces fit together: inputs (variables) drive a VPC with public and private subnets, EMR Serverless and EMR Studio in the private tier, an S3 gateway endpoint for in-VPC access to the bucket, and regional resources (S3, CloudWatch) plus IAM roles wired to the outputs you use from scripts and CI.

graph TB
  subgraph inputs
    V_region[region]
    V_project[project_name]
    V_release[emr_release_label]
    V_emr[emr_serverless]
    V_tags[tags]
  end

  subgraph vpc[VPC module]
    subgraph public[Public Subnets 2 AZs]
      NAT[NAT Gateway]
    end
    subgraph private[Private Subnets 2 AZs]
      SG[Security Group]
      EMR_APP[EMR Serverless App]
      STUDIO[EMR Studio]
    end
    IGW[Internet Gateway]
    S3_EP[S3 Gateway Endpoint]
  end

  subgraph regional[Regional]
    S3[S3 Bucket]
    CW[CloudWatch Log Group]
  end

  subgraph iam[IAM]
    ROLE_EXEC[Execution Role]
    ROLE_STUDIO[Studio Role]
  end

  subgraph outputs
    O_APP[application_id]
    O_ROLE[execution_role_arn]
    O_STUDIO_URL[emr_studio_url]
    O_BUCKET[bucket_name]
    O_LOG[log_group_name]
    O_REGION[region]
  end

  V_emr --> EMR_APP
  ROLE_EXEC --> EMR_APP
  SG --> EMR_APP
  EMR_APP --> S3
  EMR_APP --> CW
  ROLE_STUDIO --> STUDIO
  SG --> STUDIO
  STUDIO --> S3
  S3_EP --> S3
  NAT --> IGW
  EMR_APP --> O_APP
  ROLE_EXEC --> O_ROLE
  STUDIO --> O_STUDIO_URL
  S3 --> O_BUCKET
  CW --> O_LOG
  V_region --> O_REGION

Why this stack exists

If you’ve ever tried to test EMR seriously, you probably noticed something: it’s not trivial to spin up a safe, reproducible dev environment. Permissions, networking, logging, data access, runtime dependencies — everything needs to line up, and when one piece is missing, debugging quickly becomes painful.

This stack gives developers a controlled space to:

run Spark jobs on EMR Serverless without touching production,
keep scripts, dependencies, inputs, and outputs in a single S3 location,
inspect logs centrally in CloudWatch,
interact through EMR Studio instead of raw CLI workflows.

Everything is split into logical Terraform files so the environment remains understandable and maintainable rather than becoming a monolithic IaC blob.

Terraform resources summary

This table summarizes the Terraform resources that make up the EMR Serverless development environment: networking, compute, security, IAM, storage, and observability.

Layer	Resource / module	Purpose
Networking	module.vpc	VPC, 2 public + 2 private subnets, 1 NAT, IGW
Networking	aws_vpc_endpoint.s3	S3 Gateway endpoint on private route tables
Compute	aws_emrserverless_application.main	Spark app (driver + executors in private subnets)
Compute	aws_emr_studio.main	EMR Studio (notebooks, in private subnets)
Security	aws_security_group.emr_serverless	Single SG for app + studio (egress only)
IAM	aws_iam_role.emr_serverless_execution	Role + inline policy (S3, CW, EC2 ENI in VPC)
IAM	aws_iam_role.emr_studio	Role + Editors, S3Full, EMRFull
Storage	aws_s3_bucket.emr	One bucket: logs, libraries, studio, tests
Observability	aws_cloudwatch_log_group.emr	Log group for EMR Serverless

data.tf: environment awareness and policy composition

This file doesn’t create infrastructure. It teaches Terraform about the environment it’s operating in and builds IAM policy documents dynamically.

Context lookups

These data sources remove hardcoded assumptions:

availability zones for resilient subnet placement,
account ID for deterministic naming,
region and partition so ARNs work across standard AWS, GovCloud, or other partitions.

That might sound minor, but this is what makes infrastructure portable instead of brittle.

IAM policy documents

This is where most EMR deployments either become secure… or messy.

We generate:

a trust policy for EMR Studio,
a trust policy for EMR Serverless execution,
an execution policy that allows:
- access to AWS-managed EMR runtime artifacts,
- controlled access to your project S3 bucket,
- network interface creation inside the VPC,
- log publishing to CloudWatch.

Using aws_iam_policy_document keeps policies readable, composable, and version-controlled. It produces JSON only; actual IAM roles attach it later.

main.tf: building the actual playground

This file turns design into running infrastructure.

Networking — isolation with connectivity

The VPC module creates:

private subnets for compute isolation,
public subnets mainly for NAT,
DNS support so AWS service discovery works,
a NAT gateway so workloads can reach required AWS endpoints safely.

An S3 VPC endpoint is added intentionally. Without it, Spark jobs in private subnets would route S3 traffic through NAT, increasing cost and latency. This small addition usually pays for itself quickly.

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.8.1"

  name                 = "${var.project_name}-vpc"
  cidr                 = "10.100.0.0/16"
  azs                  = slice(data.aws_availability_zones.available.names, 0, 2)
  private_subnets      = ["10.100.1.0/24", "10.100.2.0/24"]
  public_subnets       = ["10.100.101.0/24", "10.100.102.0/24"]
  enable_nat_gateway   = true
  single_nat_gateway   = true
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags                 = merge(var.tags, { Name = "${var.project_name}-vpc" })
}

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = module.vpc.vpc_id
  service_name      = "com.amazonaws.${data.aws_region.current.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = module.vpc.private_route_table_ids
  tags              = merge(var.tags, { Name = "${var.project_name}-s3-ep" })
}

Storage — the operational anchor

The S3 bucket becomes the central artifact store:

job scripts,
runtime dependencies,
intermediate outputs,
logs and test artifacts.

Public access is blocked, versioning protects against accidental overwrites, and encryption is enforced by default. These aren’t luxury settings; they’re baseline operational hygiene.

resource "aws_s3_bucket" "emr" {
  bucket        = "${var.project_name}-bucket-${data.aws_caller_identity.current.account_id}"
  force_destroy = true
  tags          = merge(var.tags, { Name = "${var.project_name}-bucket" })
}

resource "aws_s3_bucket_public_access_block" "emr" {
  bucket = aws_s3_bucket.emr.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Logging — observability from day one

A dedicated CloudWatch log group captures driver and executor logs. When something fails in distributed processing, logs are often the only way to understand what happened. Centralizing logs from day one avoids painful retrofits later.

resource "aws_cloudwatch_log_group" "emr" {
  name              = "/aws/emr-serverless/${var.project_name}"
  retention_in_days = 7
  tags              = merge(var.tags, { Name = "${var.project_name}-log-group" })
}

Security and IAM — where most complexity lives

Two roles matter here:

EMR Serverless execution role — what Spark jobs actually run as.
EMR Studio service role — what enables the interactive workspace.

Security groups remain intentionally simple: outbound allowed, inbound restricted. EMR Serverless relies heavily on AWS-managed service communication, so overly strict network rules often cause subtle failures that are difficult to diagnose.

resource "aws_iam_role" "emr_serverless_execution" {
  name               = "${var.project_name}-execution-role"
  assume_role_policy = data.aws_iam_policy_document.emr_serverless_assume.json
  tags               = merge(var.tags, { Name = "${var.project_name}-execution-role" })
}
resource "aws_iam_role_policy" "emr_serverless_execution" {
  name   = "${var.project_name}-execution-policy"
  role   = aws_iam_role.emr_serverless_execution.id
  policy = data.aws_iam_policy_document.emr_serverless_execution.json
}

resource "aws_iam_role" "emr_studio" {
  name               = "${var.project_name}-studio-role"
  assume_role_policy = data.aws_iam_policy_document.emr_studio_assume.json
  tags               = merge(var.tags, { Name = "${var.project_name}-studio-role" })
}
resource "aws_iam_role_policy_attachment" "emr_studio_editors" {
  role       = aws_iam_role.emr_studio.name
  policy_arn = "arn:${data.aws_partition.current.partition}:iam::aws:policy/service-role/AmazonElasticMapReduceEditorsRole"
}
resource "aws_iam_role_policy_attachment" "emr_studio_s3" {
  role       = aws_iam_role.emr_studio.name
  policy_arn = "arn:${data.aws_partition.current.partition}:iam::aws:policy/AmazonS3FullAccess"
}
resource "aws_iam_role_policy_attachment" "emr_studio_emr" {
  role       = aws_iam_role.emr_studio.name
  policy_arn = "arn:${data.aws_partition.current.partition}:iam::aws:policy/AmazonElasticMapReduceFullAccess"
}

Compute and developer interface

This is where the environment becomes usable:

EMR Serverless application configured for Spark workloads, auto-start/stop behavior, and defined capacity boundaries.
EMR Studio connected to the VPC, security group, service role, and S3 location so developers can run notebooks, inspect jobs, and iterate quickly.

Without Studio, EMR experimentation tends to become CLI-heavy and slower for most teams.

resource "aws_emrserverless_application" "main" {
  name          = "${var.project_name}-app"
  release_label = var.emr_release_label
  type          = var.emr_serverless.application_type
  architecture  = var.emr_serverless.architecture

  auto_start_configuration {
    enabled = true
  }
  auto_stop_configuration {
    enabled              = true
    idle_timeout_minutes = var.emr_serverless.auto_stop_idle_minutes
  }

  network_configuration {
    subnet_ids         = module.vpc.private_subnets
    security_group_ids = [aws_security_group.emr_serverless.id]
  }

  initial_capacity {
    initial_capacity_type = "Driver"
    initial_capacity_config {
      worker_count = var.emr_serverless.driver.worker_count
      worker_configuration {
        cpu    = var.emr_serverless.driver.cpu
        memory = var.emr_serverless.driver.memory
        disk   = var.emr_serverless.driver.disk
      }
    }
  }
  initial_capacity {
    initial_capacity_type = "Executor"
    initial_capacity_config {
      worker_count = var.emr_serverless.executor.worker_count
      worker_configuration {
        cpu    = var.emr_serverless.executor.cpu
        memory = var.emr_serverless.executor.memory
        disk   = var.emr_serverless.executor.disk
      }
    }
  }
  maximum_capacity {
    cpu    = var.emr_serverless.maximum_capacity.cpu
    memory = var.emr_serverless.maximum_capacity.memory
    disk   = var.emr_serverless.maximum_capacity.disk
  }

  tags = var.tags
}

resource "aws_emr_studio" "main" {
  name                        = "${var.project_name}-studio"
  auth_mode                   = "IAM"
  vpc_id                      = module.vpc.vpc_id
  subnet_ids                  = module.vpc.private_subnets
  service_role                = aws_iam_role.emr_studio.arn
  workspace_security_group_id = aws_security_group.emr_serverless.id
  engine_security_group_id    = aws_security_group.emr_serverless.id
  default_s3_location         = "s3://${aws_s3_bucket.emr.id}/studio/"
  tags                        = var.tags
}

variables.tf: making the stack reusable

This file abstracts environment-specific settings:

region and naming,
EMR release label,
capacity tuning (driver/executor sizing, idle timeout),
tagging standards.

Separating configuration from infrastructure code keeps the module reusable across dev, staging, or experimental environments without rewriting resources.

outputs.tf: operational handoff

Outputs expose what engineers actually need after deployment:

EMR Serverless application ID,
execution role ARN,
Studio URL,
S3 bucket name,
log group,
region.

These values typically feed CI pipelines, test scripts, or job submission tooling.

terraform.tfvars(.example): deployment profiles

These files define concrete environment values. Think of them as profiles:

which region,
how large the EMR application should be,
naming conventions,
tagging policies.

This separation makes spinning up multiple environments predictable.

This environment isn’t intended to be production-ready out of the box. It’s designed to make experimentation with EMR Serverless practical: reproducible infrastructure, sensible defaults, and enough observability to understand what your workloads are actually doing.

In the next post, I’ll move from infrastructure into running real Spark workloads and what that experience looks like in practice.

Repository Reference

The complete Terraform configuration used in this article is available in GitHub.

The repository includes the full infrastructure code, example configuration files, diagrams, and deployment guidance so you can reproduce the environment, experiment safely, or adapt it to your own workloads.

Feel free to explore the code, open issues, or reuse the setup as a starting point for building your own EMR Serverless development environments.