Guide to Self-Hosting Next.js on AWS ECS with CloudFront

TL;DR: Is This Guide For You?

Use this architecture if:

Serving 5M+ requests/month (cost savings justify complexity)
Need full infrastructure control (compliance, custom optimizations)
Have AWS/DevOps expertise on your team
Want 60-70% cost savings vs Vercel at scale
Need sub-120ms P99 response times globally

Stick with Vercel if:

Under 5M requests/month (managed hosting is more cost-effective)
Small team focused on features, not infrastructure
Need preview deployments for every PR out-of-the-box
Value convenience over cost optimization

What you'll learn:

Complete Terraform infrastructure setup (VPC, ECS, CloudFront, WAF)
Docker optimization techniques (800MB → 350MB images)
Advanced CloudFront caching strategies (95%+ cache hit ratio)
Zero-downtime CI/CD pipeline with automatic rollback
Production troubleshooting and monitoring

Introduction: Why Self-Host Next.js?

Vercel provides an exceptional developer experience for Next.js applications, but at scale, you might find yourself questioning the economics. When you're serving millions of requests per month, the cost difference between Vercel and self-hosting can be substantial. Beyond cost, self-hosting gives you complete control over your infrastructure, better compliance options, and the ability to fine-tune every aspect of your stack.

But self-hosting Next.js properly is non-trivial. You need to handle:

Container orchestration and scaling
CDN integration and cache invalidation
Zero-downtime deployments
Security hardening
Performance optimization
Cost management

This guide walks through a production-grade architecture that addresses all of these challenges, optimized over months of real-world operation serving millions of requests daily.

Architecture Overview

Our architecture leverages AWS managed services to create a highly available, performant, and cost-effective Next.js hosting platform:

User Request
    ↓
CloudFront CDN (with WAF)
    ↓
Application Load Balancer (HTTPS)
    ↓
ECS Fargate Tasks (Auto-scaling)
    ↓
Next.js Application (Standalone Mode)

Key Components:

VPC: Multi-AZ setup with public/private subnets
ECS Fargate: Container orchestration without server management
Application Load Balancer: HTTPS termination and health checking
CloudFront: Global CDN with sophisticated caching policies
WAF: Security layer with managed rules and bot protection
S3: Build artifact storage and CloudFront logging

This architecture provides:

99.99% availability with multi-AZ deployment
Global edge caching via CloudFront's 400+ locations
Auto-scaling based on CPU, memory, and request count
Zero-downtime deployments with circuit breaker rollback
~350MB Docker images (down from 800MB+)
Sub-second P99 response times globally

Part 1: Infrastructure as Code with Terraform

VPC and Networking Setup

Start with a proper VPC foundation spanning multiple availability zones:

resource "aws_vpc" "main" {
  cidr_block = var.environment == "production" ? "10.10.0.0/16" : "10.20.0.0/16"

  tags = {
    Name = "${var.app_name}-${var.environment}-vpc"
  }
}

# Public subnets for ALB (adjust AZs based on your region)
resource "aws_subnet" "public_subnet_1" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.1.0/24" : "10.20.1.0/24"
  availability_zone = "${var.aws_region}a"
  map_public_ip_on_launch = true
}

resource "aws_subnet" "public_subnet_2" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.2.0/24" : "10.20.2.0/24"
  availability_zone = "${var.aws_region}b"
  map_public_ip_on_launch = true
}

# Private subnets for ECS tasks
resource "aws_subnet" "private_subnet_1" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.10.0/24" : "10.20.10.0/24"
  availability_zone = "${var.aws_region}a"
}

resource "aws_subnet" "private_subnet_2" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.11.0/24" : "10.20.11.0/24"
  availability_zone = "${var.aws_region}b"
}

Why this matters:

Multi-AZ deployment: If one availability zone fails, your app stays up
Public/private subnet separation: ALB in public, ECS tasks in private for security
Different CIDR blocks per environment: Prevents IP conflicts if you ever need VPC peering

NAT Gateway for Private Subnet Internet Access

ECS tasks in private subnets need internet access for pulling Docker images and making external API calls:

resource "aws_eip" "nat" {
  tags = {
    Name = "${var.app_name}-${var.environment}-nat-eip"
  }
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public_subnet_1.id
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }
}

Cost consideration: NAT Gateway costs ~$32/month plus data transfer. For production, this is essential. For development environments, you might consider placing tasks in public subnets to save costs.

Security Groups: The CloudFront Challenge

Here's a gotcha: CloudFront uses 60+ IP ranges globally. AWS security groups have a 60-rule limit per group. The solution? Split CloudFront IPs across multiple security groups:

data "aws_ip_ranges" "cloudfront" {
  regions  = ["global"]
  services = ["cloudfront"]
}

# First 50 CloudFront IPs
resource "aws_security_group" "cloudfront_sg_1" {
  name        = "${var.app_name}-${var.environment}-cf-sg-1"
  description = "Security group for CloudFront IP ranges (1-50)"
  vpc_id      = aws_vpc.main.id

  dynamic "ingress" {
    for_each = slice(data.aws_ip_ranges.cloudfront.cidr_blocks, 0, 50)
    content {
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = [ingress.value]
      description = "Allow HTTPS from CloudFront"
    }
  }
}

# Remaining CloudFront IPs
resource "aws_security_group" "cloudfront_sg_2" {
  name        = "${var.app_name}-${var.environment}-cf-sg-2"
  description = "Security group for CloudFront IP ranges (51+)"
  vpc_id      = aws_vpc.main.id

  dynamic "ingress" {
    for_each = slice(data.aws_ip_ranges.cloudfront.cidr_blocks, 50, length(data.aws_ip_ranges.cloudfront.cidr_blocks))
    content {
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = [ingress.value]
      description = "Allow HTTPS from CloudFront"
    }
  }
}

# Main ALB security group (for office IPs, monitoring, etc.)
resource "aws_security_group" "alb_sg" {
  name        = "${var.app_name}-${var.environment}-alb-sg"
  description = "Security group for ALB"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["YOUR_OFFICE_IP/32"]  # Optional: direct ALB access
    description = "Allow HTTPS from office"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# ECS tasks security group
resource "aws_security_group" "ecs_sg" {
  name        = "${var.app_name}-${var.environment}-ecs-sg"
  description = "Security group for ECS tasks"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = var.app_port  # e.g., 3000 for Next.js, 8080 for others
    to_port         = var.app_port
    protocol        = "tcp"
    security_groups = [
      aws_security_group.alb_sg.id,
      aws_security_group.cloudfront_sg_1.id,
      aws_security_group.cloudfront_sg_2.id,
    ]
    description = "Allow traffic from ALB"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Impact: By using aws_ip_ranges data source, your security groups automatically update when AWS adds new CloudFront IPs. This prevents mysterious connection failures months later.

ECS Cluster with Auto-Scaling

ECS Fargate removes the need to manage EC2 instances:

resource "aws_ecs_cluster" "main" {
  name = "ecs-${var.app_name}-${var.environment}-cluster"

  configuration {
    execute_command_configuration {
      logging = "DEFAULT"
    }
  }

  setting {
    name  = "containerInsights"
    value = "enabled"  # Essential for monitoring
  }
}

resource "aws_ecs_service" "main" {
  cluster                            = aws_ecs_cluster.main.arn
  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100
  desired_count                      = var.environment == "production" ? var.prod_task_count : var.staging_task_count
  health_check_grace_period_seconds  = 120
  name                               = "${var.app_name}-${var.environment}"
  task_definition                    = aws_ecs_task_definition.main.arn

  capacity_provider_strategy {
    base              = var.environment == "production" ? var.prod_task_count : var.staging_task_count
    capacity_provider = var.environment == "production" ? "FARGATE" : "FARGATE_SPOT"
    weight            = 1
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true  # Auto-rollback on failed deployments
  }

  load_balancer {
    container_name   = "${var.app_name}-${var.environment}"
    container_port   = var.app_port
    target_group_arn = aws_lb_target_group.main.arn
  }

  network_configuration {
    assign_public_ip = false
    security_groups  = [aws_security_group.ecs_sg.id]
    subnets          = [aws_subnet.private_subnet_1.id, aws_subnet.private_subnet_2.id]
  }
}

Key decisions:

FARGATE_SPOT for staging: Save ~70% on compute costs for non-production
Circuit breaker enabled: Automatically rolls back bad deployments
120s health check grace period: Gives Next.js time to start up and warm up
200% deployment max, 100% min: Allows full blue/green deployment

Three-Dimensional Auto-Scaling

Scale on CPU, memory, AND request count for comprehensive coverage:

resource "aws_appautoscaling_target" "ecs_service" {
  max_capacity       = var.max_task_count
  min_capacity       = var.min_task_count
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# CPU-based scaling
resource "aws_appautoscaling_policy" "cpu_scaling" {
  name               = "cpu-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = var.cpu_target_value  # e.g., 75.0
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    scale_in_cooldown  = var.scale_in_cooldown  # e.g., 120
    scale_out_cooldown = var.scale_out_cooldown  # e.g., 30
  }
}

# Memory-based scaling
resource "aws_appautoscaling_policy" "memory_scaling" {
  name               = "memory-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = var.memory_target_value  # e.g., 80.0
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }
    scale_in_cooldown  = var.scale_in_cooldown
    scale_out_cooldown = var.scale_out_cooldown
  }
}

# Request count-based scaling
resource "aws_appautoscaling_policy" "request_count_scaling" {
  name               = "request-count-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.main.arn_suffix}/${aws_lb_target_group.main.arn_suffix}"
    }
    target_value       = var.requests_per_target  # e.g., 1000
    scale_in_cooldown  = var.scale_in_cooldown
    scale_out_cooldown = var.scale_out_cooldown
  }
}

Why three metrics?

CPU: Catches compute-intensive operations (image processing, data transformation)
Memory: Catches memory leaks or large data operations
Request count: Proactively scales before CPU/memory spike from traffic surge

Production vs staging cooldowns:

Production: Aggressive scale-out (e.g., 30s), conservative scale-in (e.g., 120s)
Staging: Conservative on both to reduce churn and save costs
Tune based on your traffic patterns and cost sensitivity

Application Load Balancer Configuration

resource "aws_lb" "main" {
  name               = "nextjs-ecs-${var.environment}"
  internal           = false
  load_balancer_type = "application"
  security_groups = [
    aws_security_group.alb_sg.id,
    aws_security_group.cloudfront_sg_1.id,
    aws_security_group.cloudfront_sg_2.id,
  ]
  subnets = [aws_subnet.public_subnet_1.id, aws_subnet.public_subnet_2.id]

  enable_deletion_protection = false
  enable_http2              = true
  idle_timeout              = 30  # Important for keepAlive tuning

  access_logs {
    bucket  = "your-alb-logs-bucket"
    enabled = true
    prefix  = "alb-logs/${var.environment}"
  }
}

resource "aws_lb_target_group" "nextjs" {
  name        = "nextjs-${var.environment}"
  port        = 3000
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"  # Required for Fargate

  deregistration_delay = 30  # Drain connections for 30s before killing

  health_check {
    interval            = 15   # Check every 15 seconds
    path                = var.health_check_path  # e.g., "/health", "/api/status"
    protocol            = "HTTP"
    healthy_threshold   = 3    # Healthy after 3 successful checks
    unhealthy_threshold = 3    # Unhealthy after 3 failed checks
    timeout             = 5
    matcher             = "200-299"
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-FIPS-2023-04"
  certificate_arn   = "YOUR_ACM_CERTIFICATE_ARN"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.nextjs.arn
  }
}

Health check tuning:

15-second interval with threshold of 3 means unhealthy tasks are removed in 45s
New tasks become healthy after 45s (3 successful checks)
Balance between fast detection and avoiding false positives
Adjust interval and thresholds based on your app startup time

Deregistration delay (30s):

Allows in-flight requests to complete before terminating the task
Must be less than the deployment health check grace period (120s)

Terraform Variables (variables.tf)

All the Terraform code above uses variables. Key ones you'll need:

variable "environment" { type = string }  # production, staging
variable "app_name" { type = string }
variable "docker_image" { type = string }
variable "domain_name" { type = string }
variable "acm_certificate_arn" { type = string }

# ECS scaling
variable "min_task_count" { default = 2 }
variable "max_task_count" { default = 10 }
variable "cpu_target_value" { default = 75.0 }
variable "memory_target_value" { default = 80.0 }

Example values:

environment = "production"
app_name    = "my-nextjs-app"
docker_image = "registry/app:prod-abc123"
domain_name = "example.com"

Part 2: Docker Optimization for Next.js

The Standalone Mode Revolution

Next.js 12.2+ introduced standalone mode, which dramatically reduces Docker image sizes:

// next.config.js
export default {
  output: 'standalone',  // This is the magic line
  swcMinify: true,
  // ... rest of config
}

What happens:

Next.js traces your dependencies and creates .next/standalone with only required node_modules
Before: 800MB+ Docker images with full node_modules
After: 350-500MB Docker images with minimal dependencies
Impact: Faster deploys, lower storage costs, quicker task startup

Production Dockerfile Optimized

FROM node:lts-alpine

# Install dumb-init for proper signal handling
RUN apk add --no-cache dumb-init

WORKDIR /app

# Set production environment
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
ENV HOSTNAME=0.0.0.0

# CRITICAL: Set keepAliveTimeout for ALB compatibility
# ALB idle timeout is 30s, so we set this HIGHER to ensure
# ALB closes connections before the app does
ENV KEEP_ALIVE_TIMEOUT=35000

# Copy standalone server (contains minimal node_modules)
COPY .next/standalone ./

# Copy static assets
COPY .next/static ./.next/static
COPY public ./public

# Copy .env.production for runtime environment variables
COPY .env.production ./.env.production

# Fix hostname to 0.0.0.0 (Docker overrides HOSTNAME env at runtime)
RUN sed -i "s/const hostname = process.env.HOSTNAME || '0.0.0.0'/const hostname = '0.0.0.0'/" server.js

# Set headersTimeout for ALB compatibility
# Inject after line 6 (const __dirname) using ES module import
RUN head -6 server.js > server.js.tmp && \
    printf '\nimport http from '"'"'http'"'"';\n' >> server.js.tmp && \
    printf 'const httpServer = http.Server.prototype;\n' >> server.js.tmp && \
    printf 'const originalListen = httpServer.listen;\n' >> server.js.tmp && \
    printf 'httpServer.listen = function(...args) {\n' >> server.js.tmp && \
    printf '  const result = originalListen.apply(this, args);\n' >> server.js.tmp && \
    printf '  this.headersTimeout = 36000;\n' >> server.js.tmp && \
    printf '  return result;\n' >> server.js.tmp && \
    printf '};\n\n' >> server.js.tmp && \
    tail -n +7 server.js >> server.js.tmp && \
    mv server.js.tmp server.js

EXPOSE 3000

# Health check for ECS (adjust port and path as needed)
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
  CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"

# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "server.js"]

The ALB Timeout Problem (And Solution)

This is a critical issue that will cause mysterious 502 errors if not addressed:

The Problem:

ALB has a configurable idle timeout (commonly 30-60 seconds, default 60s)
Node.js default keepAliveTimeout is 5 seconds (too short!)
If the backend keepAliveTimeout is SHORTER than ALB idle timeout, ALB will try to reuse closed connections
This causes 502 errors when ALB sends requests on already-closed connections

The Solution:

Set KEEP_ALIVE_TIMEOUT to be HIGHER than ALB idle timeout (e.g., 35000ms for 30s ALB timeout)
Set headersTimeout slightly higher than keepAliveTimeout (e.g., 36000ms)
This ensures ALB closes idle connections BEFORE the backend does
The backend keeps connections open longer, preventing ALB from reusing closed connections

Impact:

Before fix: Random 502 errors under load
After fix: Zero timeout-related errors

Why the shell script patching?

Next.js standalone server.js doesn't expose headersTimeout yet (as of late 2024)
KEEP_ALIVE_TIMEOUT is officially supported via env var
headersTimeout requires patching server.js
This is a temporary workaround until Next.js adds official support

dumb-init: The Signal Handling Hero

ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "server.js"]

Why dumb-init?

Docker sends SIGTERM to PID 1 when stopping containers
Node.js doesn't handle signals properly as PID 1
Without dumb-init: 30s forced kill during deployments
With dumb-init: Graceful shutdowns, in-flight requests complete

Impact: Zero dropped requests during deployments.

Part 3: CloudFront CDN Configuration

CloudFront is where the real performance magic happens. Our setup uses 5 different cache policies for different content types.

Cache Policy 1: HTML Pages (Aggressive Caching)

resource "aws_cloudfront_cache_policy" "html_cache_policy" {
  name        = "html-cache-policy-${var.environment}"
  comment     = "Cache policy for HTML pages - 1 year TTL since we invalidate on deploy"

  default_ttl = 31536000  # 1 year
  max_ttl     = 31536000
  min_ttl     = 31536000  # Override origin max-age=0

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    headers_config {
      header_behavior = "none"  # Don't vary on geolocation for HTML
    }

    cookies_config {
      cookie_behavior = "none"
    }

    query_strings_config {
      query_string_behavior = "all"  # UTM params, etc.
    }
  }
}

Why 1 year TTL for HTML?

CloudFront caching is safe because we invalidate on every deploy
Long TTL = high cache hit ratio = faster response times
min_ttl = 31536000 overrides Next.js's Cache-Control: max-age=0

Impact:

95%+ cache hit ratio for HTML pages
P50 response time: <50ms globally
P99 response time: <120ms globally

Cache Policy 2: Static Assets (Immutable Content)

resource "aws_cloudfront_cache_policy" "static_assets_cache_policy" {
  name    = "static-assets-cache-policy-${var.environment}"
  comment = "For /static/* (invalidated) and /_next/static/* (immutable)"

  default_ttl = 31536000
  max_ttl     = 31536000
  min_ttl     = 0  # Allow immediate updates if needed

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    headers_config {
      header_behavior = "none"
    }

    cookies_config {
      cookie_behavior = "none"
    }

    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

Two types of static assets:

/_next/static/*: Content-addressed (contains BUILD_ID and hash)
- Files: /_next/static/[BUILD_ID]/[hash].js
- NEVER invalidate these
- Old clients need old bundles to prevent "chunk failed to load" errors
/static/*: Public folder assets
- Can be invalidated on deploy
- Use versioned filenames when possible

Cache Policy 3: Next.js Data Files

resource "aws_cloudfront_cache_policy" "nextjs_data_cache_policy" {
  name    = "nextjs-data-cache-policy-${var.environment}"
  comment = "For /_next/data/* - invalidated on every deploy"

  default_ttl = 31536000
  max_ttl     = 31536000
  min_ttl     = 0

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

What are data files?

/_next/data/[BUILD_ID]/page.json
Used by Next.js client-side navigation
Must be invalidated on deploy for content updates

Cache Policy 4: Image Optimization

resource "aws_cloudfront_cache_policy" "image_cache_policy" {
  name    = "image-cache-policy-${var.environment}"
  comment = "For Next.js image optimizer - NOT invalidated"

  default_ttl = 31536000
  max_ttl     = 31536000
  min_ttl     = 31536000  # Override Next.js Cache-Control headers

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    headers_config {
      header_behavior = "whitelist"
      headers {
        items = ["Accept"]  # Distinguish WebP/AVIF vs JPEG
      }
    }

    query_strings_config {
      query_string_behavior = "all"  # w, q, url parameters
    }
  }
}

Why vary on Accept header?

Modern browsers send Accept: image/avif,image/webp,*/*
Old browsers send Accept: */*
CloudFront serves appropriate format based on client support
Each format cached separately

Why override Cache-Control?

Next.js Image Optimization sets short cache times
We want aggressive CloudFront caching
Images don't change (or use versioned URLs if they do)

Cache Policy 5: API Routes (No Cache)

resource "aws_cloudfront_cache_policy" "no_cache_policy" {
  name    = "no-cache-policy-${var.environment}"
  comment = "Disable caching for API requests"

  default_ttl = 0
  max_ttl     = 0
  min_ttl     = 0

  parameters_in_cache_key_and_forwarded_to_origin {
    headers_config {
      header_behavior = "none"
    }
    cookies_config {
      cookie_behavior = "none"
    }
    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

CloudFront Distribution with Ordered Cache Behaviors

resource "aws_cloudfront_distribution" "main" {
  enabled         = true
  is_ipv6_enabled = true
  http_version    = "http2and3"  # Enable HTTP/3
  price_class     = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"

  origin {
    domain_name = var.environment == "production" ? "alb-prod.example.com" : "alb-staging.example.com"
    origin_id   = "origin-nextjs-${var.environment}"

    origin_shield {
      enabled              = true
      origin_shield_region = "${var.aws_region}"
    }

    custom_origin_config {
      http_port                = 80
      https_port               = 443
      origin_protocol_policy   = "https-only"
      origin_ssl_protocols     = ["TLSv1.2"]
      origin_read_timeout      = 45
      origin_keepalive_timeout = 10
    }
  }

  # Access logging for performance analysis
  logging_config {
    include_cookies = false
    bucket          = aws_s3_bucket.cloudfront_logs.bucket_domain_name
    prefix          = "cloudfront-logs/"
  }

  aliases = var.environment == "production" ? ["example.com"] : ["staging.example.com"]

  # Default cache behavior for HTML pages
  default_cache_behavior {
    target_origin_id       = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    allowed_methods        = ["GET", "HEAD", "OPTIONS"]
    cached_methods         = ["GET", "HEAD"]
    cache_policy_id        = aws_cloudfront_cache_policy.html_cache_policy.id
    compress               = true
    origin_request_policy_id = aws_cloudfront_origin_request_policy.geolocation_policy.id
  }

  # API routes (no cache)
  ordered_cache_behavior {
    path_pattern     = "/api/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS", "POST", "PUT", "PATCH", "DELETE"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.no_cache_policy.id
    compress               = true
    origin_request_policy_id = aws_cloudfront_origin_request_policy.geolocation_policy.id
  }

  # Static assets
  ordered_cache_behavior {
    path_pattern     = "/static/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.static_assets_cache_policy.id
    compress               = true
  }

  # Next.js immutable assets (NEVER invalidate)
  ordered_cache_behavior {
    path_pattern     = "/_next/static/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.static_assets_cache_policy.id
    compress               = true
  }

  # Next.js data files (invalidate on deploy)
  ordered_cache_behavior {
    path_pattern     = "/_next/data/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.nextjs_data_cache_policy.id
    compress               = true
  }

  # Images (aggressive caching)
  ordered_cache_behavior {
    path_pattern     = "/_next/image*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.image_cache_policy.id
    compress               = true
  }

  restrictions {
    geo_restriction {
      restriction_type = "blacklist"
      locations        = ["CN"]  # Block China if needed
    }
  }

  viewer_certificate {
    acm_certificate_arn      = "YOUR_ACM_CERTIFICATE_ARN"
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }

  web_acl_id = aws_wafv2_web_acl.main.arn
}

Origin Shield: Regional Caching Layer

origin_shield {
  enabled              = true
  origin_shield_region = "${var.aws_region}"
}

What is Origin Shield?

Additional caching layer between CloudFront edges and your origin
All edge locations in a region hit Origin Shield first
Origin Shield then fetches from your ALB if needed

Impact:

Before: 400+ edge locations hitting your ALB
After: 1 Origin Shield per region hitting your ALB
Result: 80-90% reduction in origin requests
Cost: ~$0.005/10,000 requests (~$10/month for 20M requests)

Server-Timing Headers for Performance Debugging

resource "aws_cloudfront_response_headers_policy" "security_headers_policy" {
  name = "nextjs-security-headers-${var.environment}"

  security_headers_config {
    # ... security headers ...
  }

  server_timing_headers_config {
    enabled       = true
    sampling_rate = 10  # 10% of requests
  }
}

What you get:

Server-Timing: cdn-cache-miss;desc="cache miss"
Server-Timing: cdn-downstream-fbl;dur=50
Server-Timing: cdn-upstream-fbl;dur=100

Metrics provided:

cdn-cache-hit or cdn-cache-miss: Whether request hit cache
cdn-downstream-fbl: Time to first byte from CloudFront to client
cdn-upstream-fbl: Time to first byte from origin to CloudFront

Impact: Essential for diagnosing slow response times and cache hit ratios.

Part 4: WAF Configuration

resource "aws_wafv2_web_acl" "main" {
  provider    = aws.us_east_1  # Must be us-east-1 for CloudFront (global service)
  name        = "${var.app_name}-${var.environment}-waf"
  scope       = "CLOUDFRONT"

  default_action {
    dynamic "allow" {
      for_each = var.environment == "production" ? [1] : []
      content {}
    }
    dynamic "block" {
      for_each = var.environment == "staging" ? [1] : []
      content {}
    }
  }

  # Staging: IP allowlist (office + CI/CD)
  dynamic "rule" {
    for_each = var.environment == "staging" ? [1] : []
    content {
      name     = "StagingAllowedIPs"
      priority = 1

      action {
        allow {}
      }

      statement {
        ip_set_reference_statement {
          arn = aws_wafv2_ip_set.staging_allowed_ips[0].arn
        }
      }

      visibility_config {
        cloudwatch_metrics_enabled = true
        metric_name                = "StagingAllowedIPs"
        sampled_requests_enabled   = true
      }
    }
  }

  # AWS Managed Rules
  rule {
    name     = "AWSManagedRulesCommonRuleSet"
    priority = 2

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesCommonRuleSet"
      sampled_requests_enabled   = true
    }
  }

  rule {
    name     = "AWSManagedRulesSQLiRuleSet"
    priority = 4

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesSQLiRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesSQLiRuleSet"
      sampled_requests_enabled   = true
    }
  }

  # Bot Control with overrides
  rule {
    name     = "AWSManagedRulesBotControlRuleSet"
    priority = 6

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesBotControlRuleSet"
        vendor_name = "AWS"

        managed_rule_group_configs {
          aws_managed_rules_bot_control_rule_set {
            inspection_level = "COMMON"
          }
        }

        # Allow SEO bots, monitoring tools
        rule_action_override {
          name = "CategorySearchEngine"
          action_to_use {
            count {}  # Monitor but don't block
          }
        }

        rule_action_override {
          name = "CategoryMonitoring"
          action_to_use {
            count {}
          }
        }
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "BotControl"
      sampled_requests_enabled   = true
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "nextjs-${var.environment}-waf"
    sampled_requests_enabled   = true
  }
}

WAF costs:

Base: $5/month per web ACL
Rules: $1/month per rule
Requests: $0.60 per 1M requests
Bot Control: $10/month + $1 per 1M requests

For 20M requests/month:

Base + 5 rules: $10
Requests: $12
Bot Control: $30
Total: ~$52/month

Impact: Blocks ~2-5% of malicious traffic automatically.

Part 5: CI/CD Pipeline Deep Dive

Build Optimization with S3 Artifacts

- step:
    name: Build
    script:
      - npm install
      - export NODE_ENV=production
      - export NODE_OPTIONS="--max_old_space_size=49152"  # 48GB heap
      - npm run build
      - export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)

      # Compress and upload to S3
      - tar czf next-build.tar.gz .next/standalone .next/static pages-json
      - aws s3 cp next-build.tar.gz s3://${BUILD_ARTIFACTS_BUCKET}/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/next-build.tar.gz
    artifacts:
      - public/**
      - package.json

Why S3 for artifacts?

Build step runs on 16x instance with 128GB RAM
Docker step runs on instance with Docker service
S3 allows passing large artifacts (100MB+) between steps
Artifacts are cleaned up after deploy

Why 48GB heap?

Next.js build can use 4-8GB for large sites
CI/CD 16x has 128GB RAM
Setting max_old_space_size prevents OOM crashes

Docker Build Step

- step:
    name: Dockerize
    services:
      - docker
    script:
      - export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)

      # Download build artifacts
      - aws s3 cp s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/next-build.tar.gz .
      - tar xzf next-build.tar.gz

      # Build and push Docker image
      - docker login -u ${container-registry_username} -p ${container-registry_password}
      - docker build --no-cache -f ${ENVIRONMENT}.Dockerfile -t org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER} .
      - docker push org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}

      # Clean up S3
      - aws s3 rm --recursive s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/

Why separate build and dockerize?

Build needs lots of CPU/RAM
Dockerize needs Docker service
Can't run both on same step efficiently
S3 is the glue

Terraform Apply

- step:
    name: Terraform Apply
    deployment: Production
    script:
      - curl -LO "https://releases.hashicorp.com/terraform/1.9.5/terraform_1.9.5_linux_amd64.zip"
      - unzip terraform_1.9.5_linux_amd64.zip
      - mv terraform /usr/local/bin/

      - export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)

      - terraform init -backend-config="key=state/${tf_aws_key}.tfstate" -backend-config="bucket=${tf_aws_bucket}"

      - terraform plan -parallelism=10 \
          -var="environment=${ENVIRONMENT}" \
          -var="image=org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}" \
          -out=terraform.tfplan

      - terraform apply -parallelism=10 -auto-approve terraform.tfplan

parallelism=10: Create/update 10 resources at a time (default is 10, but being explicit helps)

ECS Stability Check (The Critical Step)

This bash script ensures deployment succeeded before invalidating CloudFront cache:

#!/bin/bash
set -e

ENVIRONMENT="${DEPLOYMENT_ENV}"
SERVICE_NAME="${REPO_NAME}"
REGION="${var.aws_region}"
CLUSTER_NAME="ecs-nextjs-${ENVIRONMENT}-cluster"
SERVICE_NAME_FULL="${SERVICE_NAME}-${ENVIRONMENT}"
MAX_WAIT_SECONDS=1200  # 20 minutes
POLL_INTERVAL=30

check_circuit_breaker_rollback() {
    local deployment_info=$(aws ecs describe-services \
        --cluster "${CLUSTER_NAME}" \
        --services "${SERVICE_NAME_FULL}" \
        --region "${REGION}" \
        --query 'services[0].deployments[0]' \
        --output json)

    local rollout_state=$(echo "$deployment_info" | jq -r '.rolloutState // "null"')
    local rollout_reason=$(echo "$deployment_info" | jq -r '.rolloutStateReason // "null"')

    if [ "$rollout_state" = "FAILED" ]; then
        echo "❌ Deployment FAILED - rollout state: $rollout_state"
        echo "❌ Rollout reason: $rollout_reason"
        return 1
    fi

    if echo "$rollout_reason" | grep -i "circuit breaker" > /dev/null; then
        echo "❌ Circuit breaker triggered rollback: $rollout_reason"
        return 1
    fi

    return 0
}

check_expected_image() {
    local expected_image_tag="${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}"
    local current_image=$(aws ecs describe-services \
        --cluster "${CLUSTER_NAME}" \
        --services "${SERVICE_NAME_FULL}" \
        --region "${REGION}" \
        --query 'services[0].taskDefinition' \
        --output text)

    local task_image=$(aws ecs describe-task-definition \
        --task-definition "$current_image" \
        --region "${REGION}" \
        --query 'taskDefinition.containerDefinitions[0].image' \
        --output text)

    if echo "$task_image" | grep "$expected_image_tag" > /dev/null; then
        echo "✅ Expected image is deployed: $task_image"
        return 0
    else
        echo "⚠️ Current image ($task_image) doesn't match expected ($expected_image_tag)"
        return 1
    fi
}

# Initial check
if ! check_circuit_breaker_rollback; then
    echo "❌ Deployment already failed - aborting cache invalidation"
    exit 1
fi

# Wait for stability
start_time=$(date +%s)
while true; do
    current_time=$(date +%s)
    elapsed=$((current_time - start_time))

    if [ $elapsed -ge $MAX_WAIT_SECONDS ]; then
        echo "❌ Timeout reached (${MAX_WAIT_SECONDS}s)"
        exit 1
    fi

    # Check for rollback
    if ! check_circuit_breaker_rollback; then
        echo "❌ Deployment was rolled back"
        exit 1
    fi

    # Check if stable
    if timeout 60 aws ecs wait services-stable \
        --cluster "${CLUSTER_NAME}" \
        --services "${SERVICE_NAME_FULL}" \
        --region "${REGION}" 2>/dev/null; then
        echo "✅ ECS service is now stable"
        break
    else
        echo "⏳ Still waiting for stability... (${elapsed}s elapsed)"
        sleep $POLL_INTERVAL
    fi
done

# Final verification
RUNNING_COUNT=$(aws ecs describe-services \
    --cluster "${CLUSTER_NAME}" \
    --services "${SERVICE_NAME_FULL}" \
    --region "${REGION}" \
    --query 'services[0].runningCount' \
    --output text)

DESIRED_COUNT=$(aws ecs describe-services \
    --cluster "${CLUSTER_NAME}" \
    --services "${SERVICE_NAME_FULL}" \
    --region "${REGION}" \
    --query 'services[0].desiredCount' \
    --output text)

if [ "$RUNNING_COUNT" != "$DESIRED_COUNT" ]; then
    echo "❌ Running count ($RUNNING_COUNT) doesn't match desired ($DESIRED_COUNT)"
    exit 1
fi

if ! check_expected_image; then
    echo "❌ Expected image is not deployed"
    exit 1
fi

echo "✅ All checks passed - ready for CloudFront invalidation"

Why this is critical:

Circuit breaker might rollback due to failed health checks
If we invalidate CloudFront cache before rollback, users see errors
This script verifies deployment succeeded before invalidating cache

CloudFront Invalidation Strategy

#!/bin/bash
set -e

output_file="root_urls.txt"
> "$output_file"

# CRITICAL: Never invalidate /_next/static/* (immutable assets)
# This prevents "chunk failed to load" errors
echo "/_next/data/*" >> "$output_file"
echo "/static/*" >> "$output_file"

# Extract URLs from sitemaps
for sm in public/sitemap*.xml; do
  grep '<loc>' "$sm" | \
  sed -n 's|.*<loc>https://example.com\(/[^/]*\).*|\1|p' | \
  sort -u | \
  while read -r line; do
      if [ "$line" = "/" ]; then
          echo "$line" >> "$output_file"
      else
          echo "$line" >> "$output_file"
          echo "${line}/" >> "$output_file"
          echo "${line}/*" >> "$output_file"
      fi
  done
done

echo "/sitemap.xml" >> "$output_file"
echo "/robots.txt" >> "$output_file"
echo "/404" >> "$output_file"

Why three variations (path, path/, path/*):

CloudFront caches /about, /about/, and /about/index.html separately
Invalidating all three ensures consistent behavior

Batch invalidation with retry:

DISTRIBUTION_ID="YOUR_DISTRIBUTION_ID"
BATCH_SIZE=10
TOTAL_BATCHES=$((${#URIS[@]} / BATCH_SIZE + 1))

invalidate_batch() {
    local batch=("$@")

    for attempt in {1..5}; do
        local wait_time=$((2 ** (attempt - 1) * 5))  # Exponential backoff

        if output=$(aws cloudfront create-invalidation --distribution-id "$DISTRIBUTION_ID" --paths "${batch[@]}" 2>&1); then
            echo "✅ Successfully invalidated batch on attempt $attempt"
            return 0
        else
            echo "⚠️ Error on attempt $attempt. Waiting ${wait_time}s before retry"
            sleep $wait_time
        fi
    done

    echo "❌ Failed to invalidate batch after 5 attempts"
    return 1
}

for ((i=0; i<${#URIS[@]}; i+=BATCH_SIZE)); do
    batch=("${URIS[@]:i:BATCH_SIZE}")
    invalidate_batch "${batch[@]}"
    sleep 10  # Rate limiting between batches
done

Why small batches (10)?

CloudFront has rate limits
Small batches = more reliable
Can retry individual batches without losing all progress

Cache Warmup

- step:
    name: Warm Cache
    atlassian-ip-ranges: true  # Whitelist CI/CD IPs
    script:
      - cd cache-warmup
      - python3 -m venv .venv
      - .venv/bin/pip install -r requirements.txt
      - .venv/bin/python cache-warmup.py \
          --mode pages \
          --site-url https://example.com \
          --sitemap ../public/sitemap.xml \
          --concurrent 50 \
          --rps 100

What this does:

Reads sitemap.xml
Sends HEAD requests to all pages
Configurable concurrency and rate limiting
Warms CloudFront edge caches globally

Impact:

First user request after deploy: Cache hit (warm)
Without warmup: Cache miss (cold) = slower

Adjust concurrency/RPS based on your infrastructure capacity.

Part 6: Cost Optimization

Fargate Spot for Staging

capacity_provider_strategy {
  base              = var.environment == "production" ? 5 : 2
  capacity_provider = var.environment == "production" ? "FARGATE" : "FARGATE_SPOT"
  weight            = 1
}

Fargate Spot savings:

Regular Fargate: ~$0.04/vCPU/hour + ~$0.004/GB/hour (varies by region)
Fargate Spot: ~70% cheaper
For staging: Can reduce compute costs from ~$100/month to ~$30/month

When to use Spot:

Staging, development environments
Fault-tolerant workloads
When you can handle occasional interruptions

When NOT to use Spot:

Production (user-facing)
When interruptions are unacceptable

CloudFront Price Class Optimization

price_class = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"

Price classes:

PriceClass_100: North America, Europe (~$0.085/GB)
PriceClass_200: Above + Asia, Africa, Middle East (~$0.100/GB)
PriceClass_All: All edge locations (~$0.120/GB)

For staging with PriceClass_100:

Limited to 50 edge locations
Saves ~30% on data transfer
Perfectly fine for internal testing

S3 Lifecycle Policies

resource "aws_s3_bucket_lifecycle_configuration" "cloudfront_logs" {
  bucket = aws_s3_bucket.cloudfront_logs.id

  rule {
    id     = "delete-old-logs"
    status = "Enabled"

    expiration {
      days = 15
    }
  }
}

Impact:

CloudFront logs: ~10-50MB/day
Without lifecycle: Grows indefinitely
With 15-day retention: Cap at ~750MB
Savings: Minimal cost, prevents bloat

Resource Cleanup in Pipeline

# Clean up S3 artifacts after Docker push
aws s3 rm --recursive s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/

Impact:

Build artifacts: 100-500MB per build
Without cleanup: $0.023/GB/month
100 builds/month = $230/month in storage
With cleanup: $0

Part 7: Performance Metrics

After implementing this architecture, here are real-world performance metrics:

CloudFront Cache Hit Ratio

HTML pages: 95-98%
Static assets: 99%+
Images: 98%+
Overall: 96%+

What this means:

Only 4% of requests hit origin
96% served from edge in <50ms

Response Time Distribution

P50 (Median):

Cache hit: 30-50ms
Cache miss: 150-300ms

P95:

Cache hit: 80-120ms
Cache miss: 400-600ms

P99:

Cache hit: 50-120ms
Cache miss: 800-1200ms

Deployment Metrics

Build time: 5-8 minutes
Docker build: 2-3 minutes
Terraform apply: 1-2 minutes
ECS rollout: 3-5 minutes
Cache invalidation: 2-3 minutes
Total: 13-21 minutes

Zero-downtime deployments: 100% success rate with circuit breaker

Cost Breakdown (Example: 20M requests/month)

Service	Estimated Cost/Month
ECS Fargate (varies by task size/count)	$150-250
Application Load Balancer	$20-30
NAT Gateway	$30-40
CloudFront (varies by data transfer)	$150-200
WAF	$50-70
S3 (logs)	$5-10
CloudWatch Logs	$10-20
Total	$415-620/month

Comparable Vercel cost: $2000-3500/month for similar traffic

Note: Actual costs vary based on task sizing, data transfer, and traffic patterns.

Part 8: Lessons Learned & Best Practices

What Worked Extremely Well

1. Standalone Mode

Single biggest win: 800MB → 350MB images
Faster deployments, lower storage costs
Enable it immediately

2. Circuit Breaker

Saved us from multiple bad deployments
Auto-rollback prevented user impact
Always enable with rollback = true

3. Multiple CloudFront Cache Policies

Different content needs different caching
HTML, static assets, images, API all optimized separately
95%+ cache hit ratio

4. Never Invalidate /_next/static/*

Content-addressed assets are immutable
Old clients need old bundles
Prevents "chunk failed to load" errors

5. ECS Stability Check Before Cache Invalidation

Prevents invalidating cache for failed deployments
Catches rollbacks before users see errors

What We'd Do Differently

1. Start with CloudWatch Container Insights Enabled

We added this later
Essential for debugging performance issues
Enable from day one

2. Implement Server-Timing Headers Earlier

Took months to add
Would have helped diagnose issues faster
10% sampling is plenty

3. Use Terraform Modules from the Start

Our main.tf is 1500+ lines
Should have split into modules earlier
Network, ECS, CloudFront, WAF modules

4. Add CloudFront Logging from Day One

Added later for performance analysis
Invaluable for debugging
S3 costs are minimal

Common Pitfalls to Avoid

1. ALB Timeout Issues

Symptom: Random 502 errors
Cause: keepAliveTimeout < ALB idle timeout (backend closes connections before ALB)
Solution: Set keepAliveTimeout HIGHER than ALB idle timeout (e.g., 35s for 30s ALB, or 65s for 60s ALB)

2. Invalidating /_next/static/*

Symptom: "ChunkLoadError" for users
Cause: Old clients request old bundles, but they're invalidated
Solution: NEVER invalidate content-addressed assets

3. Not Checking Circuit Breaker Before Invalidation

Symptom: Users see errors after "successful" deploy
Cause: Deployment rolled back, but cache was invalidated
Solution: Check ECS deployment status before invalidation

4. CloudFront Security Group Limits

Symptom: "Security group rule limit exceeded"
Cause: CloudFront has 60+ IP ranges, SG has 60-rule limit
Solution: Split across multiple security groups

5. Not Setting Health Check Grace Period

Symptom: Tasks marked unhealthy during startup
Cause: Health checks start before app is ready
Solution: Set grace period to 120s for Next.js

When to Use This Architecture

Use this when:

You're serving 5M+ requests/month (cost savings justify complexity)
You need full control over infrastructure
You have compliance requirements (SOC2, HIPAA, etc.)
You want to fine-tune every aspect of performance
Your team has AWS/DevOps expertise

Stick with Vercel when:

You're under 5M requests/month
You want zero infrastructure management
You need preview deployments for every PR
Your team is small and focused on features
You value convenience over cost optimization

Part 9: Security Best Practices

Environment-Specific WAF Rules

default_action {
  dynamic "allow" {
    for_each = var.environment == "production" ? [1] : []
    content {}
  }
  dynamic "block" {
    for_each = var.environment == "staging" ? [1] : []
    content {}
  }
}

Production: Open with managed rules Staging: Blocked by default, IP allowlist

Security Headers via CloudFront

security_headers_config {
  content_type_options {
    override = true
  }
  frame_options {
    frame_option = "DENY"
    override     = true
  }
  referrer_policy {
    referrer_policy = "strict-origin-when-cross-origin"
    override        = true
  }
  xss_protection {
    mode_block = true
    protection = true
    override   = true
  }
  strict_transport_security {
    access_control_max_age_sec = 31536000
    include_subdomains         = true
    preload                    = true
    override                   = true
  }
}

Impact: A+ rating on securityheaders.com

Secrets Management

Never put secrets in:

Dockerfile
Environment variables in task definition
Terraform files

Use:

AWS Secrets Manager for sensitive values
IAM roles for AWS service access
Environment variables in .env.production (not committed to git)

Part 10: Monitoring & Alerts

Set up CloudWatch alarms for production. Example alarm for high CPU:

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
  alarm_name          = "${var.app_name}-ecs-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  threshold           = 85
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ServiceName = aws_ecs_service.main.name
    ClusterName = aws_ecs_cluster.main.name
  }
}

Set up similar alarms for:

ECS memory utilization (> 85%)
ALB 5xx errors (> 10 per 5min)
ALB response time (> 2s)
CloudFront error rate (> 1%)
CloudFront cache hit ratio (< 85%)
ECS task count (< minimum)

resource "aws_sns_topic" "alerts" {
  name = "${var.app_name}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "devops@example.com"
}

Impact: Alarms detect issues in 5-10 minutes. Add a CloudWatch Dashboard to visualize ECS CPU/memory, ALB response times, and CloudFront cache metrics.

Part 11: Troubleshooting Guide

Issue 1: ECS Tasks Fail to Start

Common causes:

Docker image pull errors (check ECS service events for "CannotPullContainerError")
Insufficient CPU/memory allocation
Health checks failing too quickly

Fix:

Verify image exists and task execution role has ECR permissions
Increase health_check_grace_period_seconds = 180
Check NAT Gateway is working for private subnets

Issue 2: 502 Bad Gateway Errors

Most common cause: ALB timeout mismatch

Fix:

Set KEEP_ALIVE_TIMEOUT > ALB idle_timeout (e.g., 35s for 30s ALB, 65s for 60s ALB)
Set headersTimeout slightly higher than keepAliveTimeout
Check Container Insights for memory issues (increase task memory if needed)

Issue 3: CloudFront Serving Stale Content

Common causes:

NEVER invalidate /_next/static/* (content-addressed assets change on each deploy)
Missing path variations in invalidation (include path, path/, path/*)

Fix:

Only invalidate HTML pages and /_next/data/*
Wait for invalidation to complete before testing
Clear browser cache when debugging

Part 12: Health Check Implementation

Create a simple health check endpoint for ALB:

// pages/api/health.ts
export default function handler(req, res) {
  res.status(200).json({
    status: 'ok',
    timestamp: new Date().toISOString()
  });
}

Configure in Terraform:

health_check {
  path     = "/api/health"
  interval = 15
  timeout  = 5
  healthy_threshold   = 3
  unhealthy_threshold = 3
}

Keep it simple: Health check should respond in <100ms. For advanced monitoring (database checks, external APIs), create a separate /api/health/detailed endpoint.

Conclusion

Self-hosting Next.js on AWS is non-trivial, but with the right architecture, you can achieve:

99.99% uptime with multi-AZ deployment
Sub-120ms P99 response times globally via CloudFront
60-70% cost savings vs Vercel at scale
Zero-downtime deployments with circuit breaker protection
Enterprise-grade security with WAF and managed rules

The key principles:

Use Next.js standalone mode for minimal Docker images
Aggressive CloudFront caching with smart invalidation
Never invalidate content-addressed assets (/_next/static/*)
Verify deployment success before cache invalidation
Auto-scaling on multiple metrics (CPU, memory, requests)
Circuit breaker for automatic rollback
Origin Shield to reduce origin load
Environment-specific optimizations (Spot for staging, different auto-scaling)

This architecture has served millions of requests daily for months with zero downtime and excellent performance. The upfront complexity pays dividends in control, performance, and cost savings.

Next Steps

If you're implementing this architecture:

Start with Terraform modules for each component
Implement the Docker optimization first (standalone mode)
Set up CI/CD with stability checks
Add CloudFront gradually (start with simple caching)
Tune auto-scaling thresholds based on your traffic
Monitor everything with CloudWatch and Server-Timing headers
Iterate on cache policies based on real data

The code examples in this guide are production-tested and ready to adapt for your use case. Happy self-hosting!

TL;DR: Is This Guide For You?

Introduction: Why Self-Host Next.js?

Architecture Overview

Part 1: Infrastructure as Code with Terraform

VPC and Networking Setup

NAT Gateway for Private Subnet Internet Access

Security Groups: The CloudFront Challenge

ECS Cluster with Auto-Scaling

Three-Dimensional Auto-Scaling

Application Load Balancer Configuration

Terraform Variables (variables.tf)

Part 2: Docker Optimization for Next.js

The Standalone Mode Revolution

Production Dockerfile Optimized

The ALB Timeout Problem (And Solution)

dumb-init: The Signal Handling Hero

Part 3: CloudFront CDN Configuration

Cache Policy 1: HTML Pages (Aggressive Caching)

Cache Policy 2: Static Assets (Immutable Content)

Cache Policy 3: Next.js Data Files

Cache Policy 4: Image Optimization

Cache Policy 5: API Routes (No Cache)

CloudFront Distribution with Ordered Cache Behaviors

Origin Shield: Regional Caching Layer

Server-Timing Headers for Performance Debugging

Part 4: WAF Configuration

Part 5: CI/CD Pipeline Deep Dive

Build Optimization with S3 Artifacts

Docker Build Step

Terraform Apply

ECS Stability Check (The Critical Step)

CloudFront Invalidation Strategy

Cache Warmup

Part 6: Cost Optimization

Fargate Spot for Staging

CloudFront Price Class Optimization

S3 Lifecycle Policies

Resource Cleanup in Pipeline

Part 7: Performance Metrics

CloudFront Cache Hit Ratio

Response Time Distribution

Deployment Metrics

Cost Breakdown (Example: 20M requests/month)

Part 8: Lessons Learned & Best Practices

What Worked Extremely Well

What We'd Do Differently

Common Pitfalls to Avoid

When to Use This Architecture

Part 9: Security Best Practices

Environment-Specific WAF Rules

Security Headers via CloudFront

Secrets Management

Part 10: Monitoring & Alerts

SNS Topic for Alerts

Part 11: Troubleshooting Guide

Issue 1: ECS Tasks Fail to Start

Issue 2: 502 Bad Gateway Errors

Issue 3: CloudFront Serving Stale Content

Part 12: Health Check Implementation

Conclusion

Next Steps

Read next

Become a Member