From Zero to EMR Serverless Dev Environment with Terraform
This Terraform setup provisions a practical EMR Serverless development environment on AWS — not just isolated resources, but a coherent workspace where engineers can actually experiment, iterate, and understand how EMR behaves in real conditions. It includes networking, storage, IAM roles, logging, and EMR Studio so developers can run Spark workloads without spending days wiring infrastructure manually.
The goal here is not “deploy infrastructure for the sake of it.” The goal is to remove friction from testing distributed data workloads.
Architecture overview
The diagram below summarizes how the main pieces fit together: inputs (variables) drive a VPC with public and private subnets, EMR Serverless and EMR Studio in the private tier, an S3 gateway endpoint for in-VPC access to the bucket, and regional resources (S3, CloudWatch) plus IAM roles wired to the outputs you use from scripts and CI.
graph TB
subgraph inputs
V_region[region]
V_project[project_name]
V_release[emr_release_label]
V_emr[emr_serverless]
V_tags[tags]
end
subgraph vpc[VPC module]
subgraph public[Public Subnets 2 AZs]
NAT[NAT Gateway]
end
subgraph private[Private Subnets 2 AZs]
SG[Security Group]
EMR_APP[EMR Serverless App]
STUDIO[EMR Studio]
end
IGW[Internet Gateway]
S3_EP[S3 Gateway Endpoint]
end
subgraph regional[Regional]
S3[S3 Bucket]
CW[CloudWatch Log Group]
end
subgraph iam[IAM]
ROLE_EXEC[Execution Role]
ROLE_STUDIO[Studio Role]
end
subgraph outputs
O_APP[application_id]
O_ROLE[execution_role_arn]
O_STUDIO_URL[emr_studio_url]
O_BUCKET[bucket_name]
O_LOG[log_group_name]
O_REGION[region]
end
V_emr --> EMR_APP
ROLE_EXEC --> EMR_APP
SG --> EMR_APP
EMR_APP --> S3
EMR_APP --> CW
ROLE_STUDIO --> STUDIO
SG --> STUDIO
STUDIO --> S3
S3_EP --> S3
NAT --> IGW
EMR_APP --> O_APP
ROLE_EXEC --> O_ROLE
STUDIO --> O_STUDIO_URL
S3 --> O_BUCKET
CW --> O_LOG
V_region --> O_REGION
Why this stack exists
If you’ve ever tried to test EMR seriously, you probably noticed something: it’s not trivial to spin up a safe, reproducible dev environment. Permissions, networking, logging, data access, runtime dependencies — everything needs to line up, and when one piece is missing, debugging quickly becomes painful.
This stack gives developers a controlled space to:
- run Spark jobs on EMR Serverless without touching production,
- keep scripts, dependencies, inputs, and outputs in a single S3 location,
- inspect logs centrally in CloudWatch,
- interact through EMR Studio instead of raw CLI workflows.
Everything is split into logical Terraform files so the environment remains understandable and maintainable rather than becoming a monolithic IaC blob.
Terraform resources summary
This table summarizes the Terraform resources that make up the EMR Serverless development environment: networking, compute, security, IAM, storage, and observability.
| Layer | Resource / module | Purpose |
|---|---|---|
| Networking | module.vpc | VPC, 2 public + 2 private subnets, 1 NAT, IGW |
| Networking | aws_vpc_endpoint.s3 | S3 Gateway endpoint on private route tables |
| Compute | aws_emrserverless_application.main | Spark app (driver + executors in private subnets) |
| Compute | aws_emr_studio.main | EMR Studio (notebooks, in private subnets) |
| Security | aws_security_group.emr_serverless | Single SG for app + studio (egress only) |
| IAM | aws_iam_role.emr_serverless_execution | Role + inline policy (S3, CW, EC2 ENI in VPC) |
| IAM | aws_iam_role.emr_studio | Role + Editors, S3Full, EMRFull |
| Storage | aws_s3_bucket.emr | One bucket: logs, libraries, studio, tests |
| Observability | aws_cloudwatch_log_group.emr | Log group for EMR Serverless |
data.tf: environment awareness and policy composition
This file doesn’t create infrastructure. It teaches Terraform about the environment it’s operating in and builds IAM policy documents dynamically.
Context lookups
These data sources remove hardcoded assumptions:
- availability zones for resilient subnet placement,
- account ID for deterministic naming,
- region and partition so ARNs work across standard AWS, GovCloud, or other partitions.
That might sound minor, but this is what makes infrastructure portable instead of brittle.
IAM policy documents
This is where most EMR deployments either become secure… or messy.
We generate:
- a trust policy for EMR Studio,
- a trust policy for EMR Serverless execution,
- an execution policy that allows:
- access to AWS-managed EMR runtime artifacts,
- controlled access to your project S3 bucket,
- network interface creation inside the VPC,
- log publishing to CloudWatch.
Using aws_iam_policy_document keeps policies readable, composable, and version-controlled. It produces JSON only; actual IAM roles attach it later.
main.tf: building the actual playground
This file turns design into running infrastructure.
Networking — isolation with connectivity
The VPC module creates:
- private subnets for compute isolation,
- public subnets mainly for NAT,
- DNS support so AWS service discovery works,
- a NAT gateway so workloads can reach required AWS endpoints safely.
An S3 VPC endpoint is added intentionally. Without it, Spark jobs in private subnets would route S3 traffic through NAT, increasing cost and latency. This small addition usually pays for itself quickly.
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.8.1"
name = "${var.project_name}-vpc"
cidr = "10.100.0.0/16"
azs = slice(data.aws_availability_zones.available.names, 0, 2)
private_subnets = ["10.100.1.0/24", "10.100.2.0/24"]
public_subnets = ["10.100.101.0/24", "10.100.102.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(var.tags, { Name = "${var.project_name}-vpc" })
}
resource "aws_vpc_endpoint" "s3" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${data.aws_region.current.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = module.vpc.private_route_table_ids
tags = merge(var.tags, { Name = "${var.project_name}-s3-ep" })
}
Storage — the operational anchor
The S3 bucket becomes the central artifact store:
- job scripts,
- runtime dependencies,
- intermediate outputs,
- logs and test artifacts.
Public access is blocked, versioning protects against accidental overwrites, and encryption is enforced by default. These aren’t luxury settings; they’re baseline operational hygiene.
resource "aws_s3_bucket" "emr" {
bucket = "${var.project_name}-bucket-${data.aws_caller_identity.current.account_id}"
force_destroy = true
tags = merge(var.tags, { Name = "${var.project_name}-bucket" })
}
resource "aws_s3_bucket_public_access_block" "emr" {
bucket = aws_s3_bucket.emr.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Logging — observability from day one
A dedicated CloudWatch log group captures driver and executor logs. When something fails in distributed processing, logs are often the only way to understand what happened. Centralizing logs from day one avoids painful retrofits later.
resource "aws_cloudwatch_log_group" "emr" {
name = "/aws/emr-serverless/${var.project_name}"
retention_in_days = 7
tags = merge(var.tags, { Name = "${var.project_name}-log-group" })
}
Security and IAM — where most complexity lives
Two roles matter here:
- EMR Serverless execution role — what Spark jobs actually run as.
- EMR Studio service role — what enables the interactive workspace.
Security groups remain intentionally simple: outbound allowed, inbound restricted. EMR Serverless relies heavily on AWS-managed service communication, so overly strict network rules often cause subtle failures that are difficult to diagnose.
resource "aws_iam_role" "emr_serverless_execution" {
name = "${var.project_name}-execution-role"
assume_role_policy = data.aws_iam_policy_document.emr_serverless_assume.json
tags = merge(var.tags, { Name = "${var.project_name}-execution-role" })
}
resource "aws_iam_role_policy" "emr_serverless_execution" {
name = "${var.project_name}-execution-policy"
role = aws_iam_role.emr_serverless_execution.id
policy = data.aws_iam_policy_document.emr_serverless_execution.json
}
resource "aws_iam_role" "emr_studio" {
name = "${var.project_name}-studio-role"
assume_role_policy = data.aws_iam_policy_document.emr_studio_assume.json
tags = merge(var.tags, { Name = "${var.project_name}-studio-role" })
}
resource "aws_iam_role_policy_attachment" "emr_studio_editors" {
role = aws_iam_role.emr_studio.name
policy_arn = "arn:${data.aws_partition.current.partition}:iam::aws:policy/service-role/AmazonElasticMapReduceEditorsRole"
}
resource "aws_iam_role_policy_attachment" "emr_studio_s3" {
role = aws_iam_role.emr_studio.name
policy_arn = "arn:${data.aws_partition.current.partition}:iam::aws:policy/AmazonS3FullAccess"
}
resource "aws_iam_role_policy_attachment" "emr_studio_emr" {
role = aws_iam_role.emr_studio.name
policy_arn = "arn:${data.aws_partition.current.partition}:iam::aws:policy/AmazonElasticMapReduceFullAccess"
}
Compute and developer interface
This is where the environment becomes usable:
- EMR Serverless application configured for Spark workloads, auto-start/stop behavior, and defined capacity boundaries.
- EMR Studio connected to the VPC, security group, service role, and S3 location so developers can run notebooks, inspect jobs, and iterate quickly.
Without Studio, EMR experimentation tends to become CLI-heavy and slower for most teams.
resource "aws_emrserverless_application" "main" {
name = "${var.project_name}-app"
release_label = var.emr_release_label
type = var.emr_serverless.application_type
architecture = var.emr_serverless.architecture
auto_start_configuration {
enabled = true
}
auto_stop_configuration {
enabled = true
idle_timeout_minutes = var.emr_serverless.auto_stop_idle_minutes
}
network_configuration {
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.emr_serverless.id]
}
initial_capacity {
initial_capacity_type = "Driver"
initial_capacity_config {
worker_count = var.emr_serverless.driver.worker_count
worker_configuration {
cpu = var.emr_serverless.driver.cpu
memory = var.emr_serverless.driver.memory
disk = var.emr_serverless.driver.disk
}
}
}
initial_capacity {
initial_capacity_type = "Executor"
initial_capacity_config {
worker_count = var.emr_serverless.executor.worker_count
worker_configuration {
cpu = var.emr_serverless.executor.cpu
memory = var.emr_serverless.executor.memory
disk = var.emr_serverless.executor.disk
}
}
}
maximum_capacity {
cpu = var.emr_serverless.maximum_capacity.cpu
memory = var.emr_serverless.maximum_capacity.memory
disk = var.emr_serverless.maximum_capacity.disk
}
tags = var.tags
}
resource "aws_emr_studio" "main" {
name = "${var.project_name}-studio"
auth_mode = "IAM"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
service_role = aws_iam_role.emr_studio.arn
workspace_security_group_id = aws_security_group.emr_serverless.id
engine_security_group_id = aws_security_group.emr_serverless.id
default_s3_location = "s3://${aws_s3_bucket.emr.id}/studio/"
tags = var.tags
}
variables.tf: making the stack reusable
This file abstracts environment-specific settings:
- region and naming,
- EMR release label,
- capacity tuning (driver/executor sizing, idle timeout),
- tagging standards.
Separating configuration from infrastructure code keeps the module reusable across dev, staging, or experimental environments without rewriting resources.
outputs.tf: operational handoff
Outputs expose what engineers actually need after deployment:
- EMR Serverless application ID,
- execution role ARN,
- Studio URL,
- S3 bucket name,
- log group,
- region.
These values typically feed CI pipelines, test scripts, or job submission tooling.
terraform.tfvars(.example): deployment profiles
These files define concrete environment values. Think of them as profiles:
- which region,
- how large the EMR application should be,
- naming conventions,
- tagging policies.
This separation makes spinning up multiple environments predictable.
This environment isn’t intended to be production-ready out of the box. It’s designed to make experimentation with EMR Serverless practical: reproducible infrastructure, sensible defaults, and enough observability to understand what your workloads are actually doing.
In the next post, I’ll move from infrastructure into running real Spark workloads and what that experience looks like in practice.
Repository Reference
The complete Terraform configuration used in this article is available in GitHub.
The repository includes the full infrastructure code, example configuration files, diagrams, and deployment guidance so you can reproduce the environment, experiment safely, or adapt it to your own workloads.
Feel free to explore the code, open issues, or reuse the setup as a starting point for building your own EMR Serverless development environments.