TL;DR: Is This Guide For You?

Use this architecture if:

  • Serving 5M+ requests/month (cost savings justify complexity)
  • Need full infrastructure control (compliance, custom optimizations)
  • Have AWS/DevOps expertise on your team
  • Want 60-70% cost savings vs Vercel at scale
  • Need sub-120ms P99 response times globally

Stick with Vercel if:

  • Under 5M requests/month (managed hosting is more cost-effective)
  • Small team focused on features, not infrastructure
  • Need preview deployments for every PR out-of-the-box
  • Value convenience over cost optimization

What you'll learn:

  • Complete Terraform infrastructure setup (VPC, ECS, CloudFront, WAF)
  • Docker optimization techniques (800MB → 350MB images)
  • Advanced CloudFront caching strategies (95%+ cache hit ratio)
  • Zero-downtime CI/CD pipeline with automatic rollback
  • Production troubleshooting and monitoring

Introduction: Why Self-Host Next.js?

Vercel provides an exceptional developer experience for Next.js applications, but at scale, you might find yourself questioning the economics. When you're serving millions of requests per month, the cost difference between Vercel and self-hosting can be substantial. Beyond cost, self-hosting gives you complete control over your infrastructure, better compliance options, and the ability to fine-tune every aspect of your stack.

But self-hosting Next.js properly is non-trivial. You need to handle:

  • Container orchestration and scaling
  • CDN integration and cache invalidation
  • Zero-downtime deployments
  • Security hardening
  • Performance optimization
  • Cost management

This guide walks through a production-grade architecture that addresses all of these challenges, optimized over months of real-world operation serving millions of requests daily.

Architecture Overview

Our architecture leverages AWS managed services to create a highly available, performant, and cost-effective Next.js hosting platform:

User Request
    ↓
CloudFront CDN (with WAF)
    ↓
Application Load Balancer (HTTPS)
    ↓
ECS Fargate Tasks (Auto-scaling)
    ↓
Next.js Application (Standalone Mode)

Key Components:

  • VPC: Multi-AZ setup with public/private subnets
  • ECS Fargate: Container orchestration without server management
  • Application Load Balancer: HTTPS termination and health checking
  • CloudFront: Global CDN with sophisticated caching policies
  • WAF: Security layer with managed rules and bot protection
  • S3: Build artifact storage and CloudFront logging

This architecture provides:

  • 99.99% availability with multi-AZ deployment
  • Global edge caching via CloudFront's 400+ locations
  • Auto-scaling based on CPU, memory, and request count
  • Zero-downtime deployments with circuit breaker rollback
  • ~350MB Docker images (down from 800MB+)
  • Sub-second P99 response times globally

Part 1: Infrastructure as Code with Terraform

VPC and Networking Setup

Start with a proper VPC foundation spanning multiple availability zones:

resource "aws_vpc" "main" {
  cidr_block = var.environment == "production" ? "10.10.0.0/16" : "10.20.0.0/16"

  tags = {
    Name = "${var.app_name}-${var.environment}-vpc"
  }
}

# Public subnets for ALB (adjust AZs based on your region)
resource "aws_subnet" "public_subnet_1" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.1.0/24" : "10.20.1.0/24"
  availability_zone = "${var.aws_region}a"
  map_public_ip_on_launch = true
}

resource "aws_subnet" "public_subnet_2" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.2.0/24" : "10.20.2.0/24"
  availability_zone = "${var.aws_region}b"
  map_public_ip_on_launch = true
}

# Private subnets for ECS tasks
resource "aws_subnet" "private_subnet_1" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.10.0/24" : "10.20.10.0/24"
  availability_zone = "${var.aws_region}a"
}

resource "aws_subnet" "private_subnet_2" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.environment == "production" ? "10.10.11.0/24" : "10.20.11.0/24"
  availability_zone = "${var.aws_region}b"
}

Why this matters:

  • Multi-AZ deployment: If one availability zone fails, your app stays up
  • Public/private subnet separation: ALB in public, ECS tasks in private for security
  • Different CIDR blocks per environment: Prevents IP conflicts if you ever need VPC peering

NAT Gateway for Private Subnet Internet Access

ECS tasks in private subnets need internet access for pulling Docker images and making external API calls:

resource "aws_eip" "nat" {
  tags = {
    Name = "${var.app_name}-${var.environment}-nat-eip"
  }
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public_subnet_1.id
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }
}

Cost consideration: NAT Gateway costs ~$32/month plus data transfer. For production, this is essential. For development environments, you might consider placing tasks in public subnets to save costs.

Security Groups: The CloudFront Challenge

Here's a gotcha: CloudFront uses 60+ IP ranges globally. AWS security groups have a 60-rule limit per group. The solution? Split CloudFront IPs across multiple security groups:

data "aws_ip_ranges" "cloudfront" {
  regions  = ["global"]
  services = ["cloudfront"]
}

# First 50 CloudFront IPs
resource "aws_security_group" "cloudfront_sg_1" {
  name        = "${var.app_name}-${var.environment}-cf-sg-1"
  description = "Security group for CloudFront IP ranges (1-50)"
  vpc_id      = aws_vpc.main.id

  dynamic "ingress" {
    for_each = slice(data.aws_ip_ranges.cloudfront.cidr_blocks, 0, 50)
    content {
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = [ingress.value]
      description = "Allow HTTPS from CloudFront"
    }
  }
}

# Remaining CloudFront IPs
resource "aws_security_group" "cloudfront_sg_2" {
  name        = "${var.app_name}-${var.environment}-cf-sg-2"
  description = "Security group for CloudFront IP ranges (51+)"
  vpc_id      = aws_vpc.main.id

  dynamic "ingress" {
    for_each = slice(data.aws_ip_ranges.cloudfront.cidr_blocks, 50, length(data.aws_ip_ranges.cloudfront.cidr_blocks))
    content {
      from_port   = 443
      to_port     = 443
      protocol    = "tcp"
      cidr_blocks = [ingress.value]
      description = "Allow HTTPS from CloudFront"
    }
  }
}

# Main ALB security group (for office IPs, monitoring, etc.)
resource "aws_security_group" "alb_sg" {
  name        = "${var.app_name}-${var.environment}-alb-sg"
  description = "Security group for ALB"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["YOUR_OFFICE_IP/32"]  # Optional: direct ALB access
    description = "Allow HTTPS from office"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# ECS tasks security group
resource "aws_security_group" "ecs_sg" {
  name        = "${var.app_name}-${var.environment}-ecs-sg"
  description = "Security group for ECS tasks"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = var.app_port  # e.g., 3000 for Next.js, 8080 for others
    to_port         = var.app_port
    protocol        = "tcp"
    security_groups = [
      aws_security_group.alb_sg.id,
      aws_security_group.cloudfront_sg_1.id,
      aws_security_group.cloudfront_sg_2.id,
    ]
    description = "Allow traffic from ALB"
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Impact: By using aws_ip_ranges data source, your security groups automatically update when AWS adds new CloudFront IPs. This prevents mysterious connection failures months later.

ECS Cluster with Auto-Scaling

ECS Fargate removes the need to manage EC2 instances:

resource "aws_ecs_cluster" "main" {
  name = "ecs-${var.app_name}-${var.environment}-cluster"

  configuration {
    execute_command_configuration {
      logging = "DEFAULT"
    }
  }

  setting {
    name  = "containerInsights"
    value = "enabled"  # Essential for monitoring
  }
}

resource "aws_ecs_service" "main" {
  cluster                            = aws_ecs_cluster.main.arn
  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100
  desired_count                      = var.environment == "production" ? var.prod_task_count : var.staging_task_count
  health_check_grace_period_seconds  = 120
  name                               = "${var.app_name}-${var.environment}"
  task_definition                    = aws_ecs_task_definition.main.arn

  capacity_provider_strategy {
    base              = var.environment == "production" ? var.prod_task_count : var.staging_task_count
    capacity_provider = var.environment == "production" ? "FARGATE" : "FARGATE_SPOT"
    weight            = 1
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true  # Auto-rollback on failed deployments
  }

  load_balancer {
    container_name   = "${var.app_name}-${var.environment}"
    container_port   = var.app_port
    target_group_arn = aws_lb_target_group.main.arn
  }

  network_configuration {
    assign_public_ip = false
    security_groups  = [aws_security_group.ecs_sg.id]
    subnets          = [aws_subnet.private_subnet_1.id, aws_subnet.private_subnet_2.id]
  }
}

Key decisions:

  • FARGATE_SPOT for staging: Save ~70% on compute costs for non-production
  • Circuit breaker enabled: Automatically rolls back bad deployments
  • 120s health check grace period: Gives Next.js time to start up and warm up
  • 200% deployment max, 100% min: Allows full blue/green deployment

Three-Dimensional Auto-Scaling

Scale on CPU, memory, AND request count for comprehensive coverage:

resource "aws_appautoscaling_target" "ecs_service" {
  max_capacity       = var.max_task_count
  min_capacity       = var.min_task_count
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# CPU-based scaling
resource "aws_appautoscaling_policy" "cpu_scaling" {
  name               = "cpu-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = var.cpu_target_value  # e.g., 75.0
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    scale_in_cooldown  = var.scale_in_cooldown  # e.g., 120
    scale_out_cooldown = var.scale_out_cooldown  # e.g., 30
  }
}

# Memory-based scaling
resource "aws_appautoscaling_policy" "memory_scaling" {
  name               = "memory-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service.service_namespace

  target_tracking_scaling_policy_configuration {
    target_value       = var.memory_target_value  # e.g., 80.0
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }
    scale_in_cooldown  = var.scale_in_cooldown
    scale_out_cooldown = var.scale_out_cooldown
  }
}

# Request count-based scaling
resource "aws_appautoscaling_policy" "request_count_scaling" {
  name               = "request-count-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ALBRequestCountPerTarget"
      resource_label         = "${aws_lb.main.arn_suffix}/${aws_lb_target_group.main.arn_suffix}"
    }
    target_value       = var.requests_per_target  # e.g., 1000
    scale_in_cooldown  = var.scale_in_cooldown
    scale_out_cooldown = var.scale_out_cooldown
  }
}

Why three metrics?

  • CPU: Catches compute-intensive operations (image processing, data transformation)
  • Memory: Catches memory leaks or large data operations
  • Request count: Proactively scales before CPU/memory spike from traffic surge

Production vs staging cooldowns:

  • Production: Aggressive scale-out (e.g., 30s), conservative scale-in (e.g., 120s)
  • Staging: Conservative on both to reduce churn and save costs
  • Tune based on your traffic patterns and cost sensitivity

Application Load Balancer Configuration

resource "aws_lb" "main" {
  name               = "nextjs-ecs-${var.environment}"
  internal           = false
  load_balancer_type = "application"
  security_groups = [
    aws_security_group.alb_sg.id,
    aws_security_group.cloudfront_sg_1.id,
    aws_security_group.cloudfront_sg_2.id,
  ]
  subnets = [aws_subnet.public_subnet_1.id, aws_subnet.public_subnet_2.id]

  enable_deletion_protection = false
  enable_http2              = true
  idle_timeout              = 30  # Important for keepAlive tuning

  access_logs {
    bucket  = "your-alb-logs-bucket"
    enabled = true
    prefix  = "alb-logs/${var.environment}"
  }
}

resource "aws_lb_target_group" "nextjs" {
  name        = "nextjs-${var.environment}"
  port        = 3000
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"  # Required for Fargate

  deregistration_delay = 30  # Drain connections for 30s before killing

  health_check {
    interval            = 15   # Check every 15 seconds
    path                = var.health_check_path  # e.g., "/health", "/api/status"
    protocol            = "HTTP"
    healthy_threshold   = 3    # Healthy after 3 successful checks
    unhealthy_threshold = 3    # Unhealthy after 3 failed checks
    timeout             = 5
    matcher             = "200-299"
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-FIPS-2023-04"
  certificate_arn   = "YOUR_ACM_CERTIFICATE_ARN"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.nextjs.arn
  }
}

Health check tuning:

  • 15-second interval with threshold of 3 means unhealthy tasks are removed in 45s
  • New tasks become healthy after 45s (3 successful checks)
  • Balance between fast detection and avoiding false positives
  • Adjust interval and thresholds based on your app startup time

Deregistration delay (30s):

  • Allows in-flight requests to complete before terminating the task
  • Must be less than the deployment health check grace period (120s)

Terraform Variables (variables.tf)

All the Terraform code above uses variables. Key ones you'll need:

variable "environment" { type = string }  # production, staging
variable "app_name" { type = string }
variable "docker_image" { type = string }
variable "domain_name" { type = string }
variable "acm_certificate_arn" { type = string }

# ECS scaling
variable "min_task_count" { default = 2 }
variable "max_task_count" { default = 10 }
variable "cpu_target_value" { default = 75.0 }
variable "memory_target_value" { default = 80.0 }

Example values:

environment = "production"
app_name    = "my-nextjs-app"
docker_image = "registry/app:prod-abc123"
domain_name = "example.com"

Part 2: Docker Optimization for Next.js

The Standalone Mode Revolution

Next.js 12.2+ introduced standalone mode, which dramatically reduces Docker image sizes:

// next.config.js
export default {
  output: 'standalone',  // This is the magic line
  swcMinify: true,
  // ... rest of config
}

What happens:

  • Next.js traces your dependencies and creates .next/standalone with only required node_modules
  • Before: 800MB+ Docker images with full node_modules
  • After: 350-500MB Docker images with minimal dependencies
  • Impact: Faster deploys, lower storage costs, quicker task startup

Production Dockerfile Optimized

FROM node:lts-alpine

# Install dumb-init for proper signal handling
RUN apk add --no-cache dumb-init

WORKDIR /app

# Set production environment
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
ENV HOSTNAME=0.0.0.0

# CRITICAL: Set keepAliveTimeout for ALB compatibility
# ALB idle timeout is 30s, so we set this HIGHER to ensure
# ALB closes connections before the app does
ENV KEEP_ALIVE_TIMEOUT=35000

# Copy standalone server (contains minimal node_modules)
COPY .next/standalone ./

# Copy static assets
COPY .next/static ./.next/static
COPY public ./public

# Copy .env.production for runtime environment variables
COPY .env.production ./.env.production

# Fix hostname to 0.0.0.0 (Docker overrides HOSTNAME env at runtime)
RUN sed -i "s/const hostname = process.env.HOSTNAME || '0.0.0.0'/const hostname = '0.0.0.0'/" server.js

# Set headersTimeout for ALB compatibility
# Inject after line 6 (const __dirname) using ES module import
RUN head -6 server.js > server.js.tmp && \
    printf '\nimport http from '"'"'http'"'"';\n' >> server.js.tmp && \
    printf 'const httpServer = http.Server.prototype;\n' >> server.js.tmp && \
    printf 'const originalListen = httpServer.listen;\n' >> server.js.tmp && \
    printf 'httpServer.listen = function(...args) {\n' >> server.js.tmp && \
    printf '  const result = originalListen.apply(this, args);\n' >> server.js.tmp && \
    printf '  this.headersTimeout = 36000;\n' >> server.js.tmp && \
    printf '  return result;\n' >> server.js.tmp && \
    printf '};\n\n' >> server.js.tmp && \
    tail -n +7 server.js >> server.js.tmp && \
    mv server.js.tmp server.js

EXPOSE 3000

# Health check for ECS (adjust port and path as needed)
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
  CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"

# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "server.js"]

The ALB Timeout Problem (And Solution)

This is a critical issue that will cause mysterious 502 errors if not addressed:

The Problem:

  • ALB has a configurable idle timeout (commonly 30-60 seconds, default 60s)
  • Node.js default keepAliveTimeout is 5 seconds (too short!)
  • If the backend keepAliveTimeout is SHORTER than ALB idle timeout, ALB will try to reuse closed connections
  • This causes 502 errors when ALB sends requests on already-closed connections

The Solution:

  • Set KEEP_ALIVE_TIMEOUT to be HIGHER than ALB idle timeout (e.g., 35000ms for 30s ALB timeout)
  • Set headersTimeout slightly higher than keepAliveTimeout (e.g., 36000ms)
  • This ensures ALB closes idle connections BEFORE the backend does
  • The backend keeps connections open longer, preventing ALB from reusing closed connections

Impact:

  • Before fix: Random 502 errors under load
  • After fix: Zero timeout-related errors

Why the shell script patching?

  • Next.js standalone server.js doesn't expose headersTimeout yet (as of late 2024)
  • KEEP_ALIVE_TIMEOUT is officially supported via env var
  • headersTimeout requires patching server.js
  • This is a temporary workaround until Next.js adds official support

dumb-init: The Signal Handling Hero

ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "server.js"]

Why dumb-init?

  • Docker sends SIGTERM to PID 1 when stopping containers
  • Node.js doesn't handle signals properly as PID 1
  • Without dumb-init: 30s forced kill during deployments
  • With dumb-init: Graceful shutdowns, in-flight requests complete

Impact: Zero dropped requests during deployments.

Part 3: CloudFront CDN Configuration

CloudFront is where the real performance magic happens. Our setup uses 5 different cache policies for different content types.

Cache Policy 1: HTML Pages (Aggressive Caching)

resource "aws_cloudfront_cache_policy" "html_cache_policy" {
  name        = "html-cache-policy-${var.environment}"
  comment     = "Cache policy for HTML pages - 1 year TTL since we invalidate on deploy"

  default_ttl = 31536000  # 1 year
  max_ttl     = 31536000
  min_ttl     = 31536000  # Override origin max-age=0

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    headers_config {
      header_behavior = "none"  # Don't vary on geolocation for HTML
    }

    cookies_config {
      cookie_behavior = "none"
    }

    query_strings_config {
      query_string_behavior = "all"  # UTM params, etc.
    }
  }
}

Why 1 year TTL for HTML?

  • CloudFront caching is safe because we invalidate on every deploy
  • Long TTL = high cache hit ratio = faster response times
  • min_ttl = 31536000 overrides Next.js's Cache-Control: max-age=0

Impact:

  • 95%+ cache hit ratio for HTML pages
  • P50 response time: <50ms globally
  • P99 response time: <120ms globally

Cache Policy 2: Static Assets (Immutable Content)

resource "aws_cloudfront_cache_policy" "static_assets_cache_policy" {
  name    = "static-assets-cache-policy-${var.environment}"
  comment = "For /static/* (invalidated) and /_next/static/* (immutable)"

  default_ttl = 31536000
  max_ttl     = 31536000
  min_ttl     = 0  # Allow immediate updates if needed

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    headers_config {
      header_behavior = "none"
    }

    cookies_config {
      cookie_behavior = "none"
    }

    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

Two types of static assets:

  1. /_next/static/*: Content-addressed (contains BUILD_ID and hash)
    • Files: /_next/static/[BUILD_ID]/[hash].js
    • NEVER invalidate these
    • Old clients need old bundles to prevent "chunk failed to load" errors
  2. /static/*: Public folder assets
    • Can be invalidated on deploy
    • Use versioned filenames when possible

Cache Policy 3: Next.js Data Files

resource "aws_cloudfront_cache_policy" "nextjs_data_cache_policy" {
  name    = "nextjs-data-cache-policy-${var.environment}"
  comment = "For /_next/data/* - invalidated on every deploy"

  default_ttl = 31536000
  max_ttl     = 31536000
  min_ttl     = 0

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

What are data files?

  • /_next/data/[BUILD_ID]/page.json
  • Used by Next.js client-side navigation
  • Must be invalidated on deploy for content updates

Cache Policy 4: Image Optimization

resource "aws_cloudfront_cache_policy" "image_cache_policy" {
  name    = "image-cache-policy-${var.environment}"
  comment = "For Next.js image optimizer - NOT invalidated"

  default_ttl = 31536000
  max_ttl     = 31536000
  min_ttl     = 31536000  # Override Next.js Cache-Control headers

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    headers_config {
      header_behavior = "whitelist"
      headers {
        items = ["Accept"]  # Distinguish WebP/AVIF vs JPEG
      }
    }

    query_strings_config {
      query_string_behavior = "all"  # w, q, url parameters
    }
  }
}

Why vary on Accept header?

  • Modern browsers send Accept: image/avif,image/webp,*/*
  • Old browsers send Accept: */*
  • CloudFront serves appropriate format based on client support
  • Each format cached separately

Why override Cache-Control?

  • Next.js Image Optimization sets short cache times
  • We want aggressive CloudFront caching
  • Images don't change (or use versioned URLs if they do)

Cache Policy 5: API Routes (No Cache)

resource "aws_cloudfront_cache_policy" "no_cache_policy" {
  name    = "no-cache-policy-${var.environment}"
  comment = "Disable caching for API requests"

  default_ttl = 0
  max_ttl     = 0
  min_ttl     = 0

  parameters_in_cache_key_and_forwarded_to_origin {
    headers_config {
      header_behavior = "none"
    }
    cookies_config {
      cookie_behavior = "none"
    }
    query_strings_config {
      query_string_behavior = "none"
    }
  }
}

CloudFront Distribution with Ordered Cache Behaviors

resource "aws_cloudfront_distribution" "main" {
  enabled         = true
  is_ipv6_enabled = true
  http_version    = "http2and3"  # Enable HTTP/3
  price_class     = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"

  origin {
    domain_name = var.environment == "production" ? "alb-prod.example.com" : "alb-staging.example.com"
    origin_id   = "origin-nextjs-${var.environment}"

    origin_shield {
      enabled              = true
      origin_shield_region = "${var.aws_region}"
    }

    custom_origin_config {
      http_port                = 80
      https_port               = 443
      origin_protocol_policy   = "https-only"
      origin_ssl_protocols     = ["TLSv1.2"]
      origin_read_timeout      = 45
      origin_keepalive_timeout = 10
    }
  }

  # Access logging for performance analysis
  logging_config {
    include_cookies = false
    bucket          = aws_s3_bucket.cloudfront_logs.bucket_domain_name
    prefix          = "cloudfront-logs/"
  }

  aliases = var.environment == "production" ? ["example.com"] : ["staging.example.com"]

  # Default cache behavior for HTML pages
  default_cache_behavior {
    target_origin_id       = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    allowed_methods        = ["GET", "HEAD", "OPTIONS"]
    cached_methods         = ["GET", "HEAD"]
    cache_policy_id        = aws_cloudfront_cache_policy.html_cache_policy.id
    compress               = true
    origin_request_policy_id = aws_cloudfront_origin_request_policy.geolocation_policy.id
  }

  # API routes (no cache)
  ordered_cache_behavior {
    path_pattern     = "/api/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS", "POST", "PUT", "PATCH", "DELETE"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.no_cache_policy.id
    compress               = true
    origin_request_policy_id = aws_cloudfront_origin_request_policy.geolocation_policy.id
  }

  # Static assets
  ordered_cache_behavior {
    path_pattern     = "/static/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.static_assets_cache_policy.id
    compress               = true
  }

  # Next.js immutable assets (NEVER invalidate)
  ordered_cache_behavior {
    path_pattern     = "/_next/static/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.static_assets_cache_policy.id
    compress               = true
  }

  # Next.js data files (invalidate on deploy)
  ordered_cache_behavior {
    path_pattern     = "/_next/data/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.nextjs_data_cache_policy.id
    compress               = true
  }

  # Images (aggressive caching)
  ordered_cache_behavior {
    path_pattern     = "/_next/image*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD", "OPTIONS"]
    target_origin_id = "origin-nextjs-${var.environment}"
    viewer_protocol_policy = "redirect-to-https"
    cache_policy_id        = aws_cloudfront_cache_policy.image_cache_policy.id
    compress               = true
  }

  restrictions {
    geo_restriction {
      restriction_type = "blacklist"
      locations        = ["CN"]  # Block China if needed
    }
  }

  viewer_certificate {
    acm_certificate_arn      = "YOUR_ACM_CERTIFICATE_ARN"
    ssl_support_method       = "sni-only"
    minimum_protocol_version = "TLSv1.2_2021"
  }

  web_acl_id = aws_wafv2_web_acl.main.arn
}

Origin Shield: Regional Caching Layer

origin_shield {
  enabled              = true
  origin_shield_region = "${var.aws_region}"
}

What is Origin Shield?

  • Additional caching layer between CloudFront edges and your origin
  • All edge locations in a region hit Origin Shield first
  • Origin Shield then fetches from your ALB if needed

Impact:

  • Before: 400+ edge locations hitting your ALB
  • After: 1 Origin Shield per region hitting your ALB
  • Result: 80-90% reduction in origin requests
  • Cost: ~$0.005/10,000 requests (~$10/month for 20M requests)

Server-Timing Headers for Performance Debugging

resource "aws_cloudfront_response_headers_policy" "security_headers_policy" {
  name = "nextjs-security-headers-${var.environment}"

  security_headers_config {
    # ... security headers ...
  }

  server_timing_headers_config {
    enabled       = true
    sampling_rate = 10  # 10% of requests
  }
}

What you get:

Server-Timing: cdn-cache-miss;desc="cache miss"
Server-Timing: cdn-downstream-fbl;dur=50
Server-Timing: cdn-upstream-fbl;dur=100

Metrics provided:

  • cdn-cache-hit or cdn-cache-miss: Whether request hit cache
  • cdn-downstream-fbl: Time to first byte from CloudFront to client
  • cdn-upstream-fbl: Time to first byte from origin to CloudFront

Impact: Essential for diagnosing slow response times and cache hit ratios.

Part 4: WAF Configuration

resource "aws_wafv2_web_acl" "main" {
  provider    = aws.us_east_1  # Must be us-east-1 for CloudFront (global service)
  name        = "${var.app_name}-${var.environment}-waf"
  scope       = "CLOUDFRONT"

  default_action {
    dynamic "allow" {
      for_each = var.environment == "production" ? [1] : []
      content {}
    }
    dynamic "block" {
      for_each = var.environment == "staging" ? [1] : []
      content {}
    }
  }

  # Staging: IP allowlist (office + CI/CD)
  dynamic "rule" {
    for_each = var.environment == "staging" ? [1] : []
    content {
      name     = "StagingAllowedIPs"
      priority = 1

      action {
        allow {}
      }

      statement {
        ip_set_reference_statement {
          arn = aws_wafv2_ip_set.staging_allowed_ips[0].arn
        }
      }

      visibility_config {
        cloudwatch_metrics_enabled = true
        metric_name                = "StagingAllowedIPs"
        sampled_requests_enabled   = true
      }
    }
  }

  # AWS Managed Rules
  rule {
    name     = "AWSManagedRulesCommonRuleSet"
    priority = 2

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesCommonRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesCommonRuleSet"
      sampled_requests_enabled   = true
    }
  }

  rule {
    name     = "AWSManagedRulesSQLiRuleSet"
    priority = 4

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesSQLiRuleSet"
        vendor_name = "AWS"
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "AWSManagedRulesSQLiRuleSet"
      sampled_requests_enabled   = true
    }
  }

  # Bot Control with overrides
  rule {
    name     = "AWSManagedRulesBotControlRuleSet"
    priority = 6

    override_action {
      none {}
    }

    statement {
      managed_rule_group_statement {
        name        = "AWSManagedRulesBotControlRuleSet"
        vendor_name = "AWS"

        managed_rule_group_configs {
          aws_managed_rules_bot_control_rule_set {
            inspection_level = "COMMON"
          }
        }

        # Allow SEO bots, monitoring tools
        rule_action_override {
          name = "CategorySearchEngine"
          action_to_use {
            count {}  # Monitor but don't block
          }
        }

        rule_action_override {
          name = "CategoryMonitoring"
          action_to_use {
            count {}
          }
        }
      }
    }

    visibility_config {
      cloudwatch_metrics_enabled = true
      metric_name                = "BotControl"
      sampled_requests_enabled   = true
    }
  }

  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "nextjs-${var.environment}-waf"
    sampled_requests_enabled   = true
  }
}

WAF costs:

  • Base: $5/month per web ACL
  • Rules: $1/month per rule
  • Requests: $0.60 per 1M requests
  • Bot Control: $10/month + $1 per 1M requests

For 20M requests/month:

  • Base + 5 rules: $10
  • Requests: $12
  • Bot Control: $30
  • Total: ~$52/month

Impact: Blocks ~2-5% of malicious traffic automatically.

Part 5: CI/CD Pipeline Deep Dive

Build Optimization with S3 Artifacts

- step:
    name: Build
    script:
      - npm install
      - export NODE_ENV=production
      - export NODE_OPTIONS="--max_old_space_size=49152"  # 48GB heap
      - npm run build
      - export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)

      # Compress and upload to S3
      - tar czf next-build.tar.gz .next/standalone .next/static pages-json
      - aws s3 cp next-build.tar.gz s3://${BUILD_ARTIFACTS_BUCKET}/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/next-build.tar.gz
    artifacts:
      - public/**
      - package.json

Why S3 for artifacts?

  • Build step runs on 16x instance with 128GB RAM
  • Docker step runs on instance with Docker service
  • S3 allows passing large artifacts (100MB+) between steps
  • Artifacts are cleaned up after deploy

Why 48GB heap?

  • Next.js build can use 4-8GB for large sites
  • CI/CD 16x has 128GB RAM
  • Setting max_old_space_size prevents OOM crashes

Docker Build Step

- step:
    name: Dockerize
    services:
      - docker
    script:
      - export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)

      # Download build artifacts
      - aws s3 cp s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/next-build.tar.gz .
      - tar xzf next-build.tar.gz

      # Build and push Docker image
      - docker login -u ${container-registry_username} -p ${container-registry_password}
      - docker build --no-cache -f ${ENVIRONMENT}.Dockerfile -t org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER} .
      - docker push org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}

      # Clean up S3
      - aws s3 rm --recursive s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/

Why separate build and dockerize?

  • Build needs lots of CPU/RAM
  • Dockerize needs Docker service
  • Can't run both on same step efficiently
  • S3 is the glue

Terraform Apply

- step:
    name: Terraform Apply
    deployment: Production
    script:
      - curl -LO "https://releases.hashicorp.com/terraform/1.9.5/terraform_1.9.5_linux_amd64.zip"
      - unzip terraform_1.9.5_linux_amd64.zip
      - mv terraform /usr/local/bin/

      - export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)

      - terraform init -backend-config="key=state/${tf_aws_key}.tfstate" -backend-config="bucket=${tf_aws_bucket}"

      - terraform plan -parallelism=10 \
          -var="environment=${ENVIRONMENT}" \
          -var="image=org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}" \
          -out=terraform.tfplan

      - terraform apply -parallelism=10 -auto-approve terraform.tfplan

parallelism=10: Create/update 10 resources at a time (default is 10, but being explicit helps)

ECS Stability Check (The Critical Step)

This bash script ensures deployment succeeded before invalidating CloudFront cache:

#!/bin/bash
set -e

ENVIRONMENT="${DEPLOYMENT_ENV}"
SERVICE_NAME="${REPO_NAME}"
REGION="${var.aws_region}"
CLUSTER_NAME="ecs-nextjs-${ENVIRONMENT}-cluster"
SERVICE_NAME_FULL="${SERVICE_NAME}-${ENVIRONMENT}"
MAX_WAIT_SECONDS=1200  # 20 minutes
POLL_INTERVAL=30

check_circuit_breaker_rollback() {
    local deployment_info=$(aws ecs describe-services \
        --cluster "${CLUSTER_NAME}" \
        --services "${SERVICE_NAME_FULL}" \
        --region "${REGION}" \
        --query 'services[0].deployments[0]' \
        --output json)

    local rollout_state=$(echo "$deployment_info" | jq -r '.rolloutState // "null"')
    local rollout_reason=$(echo "$deployment_info" | jq -r '.rolloutStateReason // "null"')

    if [ "$rollout_state" = "FAILED" ]; then
        echo "❌ Deployment FAILED - rollout state: $rollout_state"
        echo "❌ Rollout reason: $rollout_reason"
        return 1
    fi

    if echo "$rollout_reason" | grep -i "circuit breaker" > /dev/null; then
        echo "❌ Circuit breaker triggered rollback: $rollout_reason"
        return 1
    fi

    return 0
}

check_expected_image() {
    local expected_image_tag="${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}"
    local current_image=$(aws ecs describe-services \
        --cluster "${CLUSTER_NAME}" \
        --services "${SERVICE_NAME_FULL}" \
        --region "${REGION}" \
        --query 'services[0].taskDefinition' \
        --output text)

    local task_image=$(aws ecs describe-task-definition \
        --task-definition "$current_image" \
        --region "${REGION}" \
        --query 'taskDefinition.containerDefinitions[0].image' \
        --output text)

    if echo "$task_image" | grep "$expected_image_tag" > /dev/null; then
        echo "✅ Expected image is deployed: $task_image"
        return 0
    else
        echo "⚠️ Current image ($task_image) doesn't match expected ($expected_image_tag)"
        return 1
    fi
}

# Initial check
if ! check_circuit_breaker_rollback; then
    echo "❌ Deployment already failed - aborting cache invalidation"
    exit 1
fi

# Wait for stability
start_time=$(date +%s)
while true; do
    current_time=$(date +%s)
    elapsed=$((current_time - start_time))

    if [ $elapsed -ge $MAX_WAIT_SECONDS ]; then
        echo "❌ Timeout reached (${MAX_WAIT_SECONDS}s)"
        exit 1
    fi

    # Check for rollback
    if ! check_circuit_breaker_rollback; then
        echo "❌ Deployment was rolled back"
        exit 1
    fi

    # Check if stable
    if timeout 60 aws ecs wait services-stable \
        --cluster "${CLUSTER_NAME}" \
        --services "${SERVICE_NAME_FULL}" \
        --region "${REGION}" 2>/dev/null; then
        echo "✅ ECS service is now stable"
        break
    else
        echo "⏳ Still waiting for stability... (${elapsed}s elapsed)"
        sleep $POLL_INTERVAL
    fi
done

# Final verification
RUNNING_COUNT=$(aws ecs describe-services \
    --cluster "${CLUSTER_NAME}" \
    --services "${SERVICE_NAME_FULL}" \
    --region "${REGION}" \
    --query 'services[0].runningCount' \
    --output text)

DESIRED_COUNT=$(aws ecs describe-services \
    --cluster "${CLUSTER_NAME}" \
    --services "${SERVICE_NAME_FULL}" \
    --region "${REGION}" \
    --query 'services[0].desiredCount' \
    --output text)

if [ "$RUNNING_COUNT" != "$DESIRED_COUNT" ]; then
    echo "❌ Running count ($RUNNING_COUNT) doesn't match desired ($DESIRED_COUNT)"
    exit 1
fi

if ! check_expected_image; then
    echo "❌ Expected image is not deployed"
    exit 1
fi

echo "✅ All checks passed - ready for CloudFront invalidation"

Why this is critical:

  • Circuit breaker might rollback due to failed health checks
  • If we invalidate CloudFront cache before rollback, users see errors
  • This script verifies deployment succeeded before invalidating cache

CloudFront Invalidation Strategy

#!/bin/bash
set -e

output_file="root_urls.txt"
> "$output_file"

# CRITICAL: Never invalidate /_next/static/* (immutable assets)
# This prevents "chunk failed to load" errors
echo "/_next/data/*" >> "$output_file"
echo "/static/*" >> "$output_file"

# Extract URLs from sitemaps
for sm in public/sitemap*.xml; do
  grep '<loc>' "$sm" | \
  sed -n 's|.*<loc>https://example.com\(/[^/]*\).*|\1|p' | \
  sort -u | \
  while read -r line; do
      if [ "$line" = "/" ]; then
          echo "$line" >> "$output_file"
      else
          echo "$line" >> "$output_file"
          echo "${line}/" >> "$output_file"
          echo "${line}/*" >> "$output_file"
      fi
  done
done

echo "/sitemap.xml" >> "$output_file"
echo "/robots.txt" >> "$output_file"
echo "/404" >> "$output_file"

Why three variations (path, path/, path/*):

  • CloudFront caches /about/about/, and /about/index.html separately
  • Invalidating all three ensures consistent behavior

Batch invalidation with retry:

DISTRIBUTION_ID="YOUR_DISTRIBUTION_ID"
BATCH_SIZE=10
TOTAL_BATCHES=$((${#URIS[@]} / BATCH_SIZE + 1))

invalidate_batch() {
    local batch=("$@")

    for attempt in {1..5}; do
        local wait_time=$((2 ** (attempt - 1) * 5))  # Exponential backoff

        if output=$(aws cloudfront create-invalidation --distribution-id "$DISTRIBUTION_ID" --paths "${batch[@]}" 2>&1); then
            echo "✅ Successfully invalidated batch on attempt $attempt"
            return 0
        else
            echo "⚠️ Error on attempt $attempt. Waiting ${wait_time}s before retry"
            sleep $wait_time
        fi
    done

    echo "❌ Failed to invalidate batch after 5 attempts"
    return 1
}

for ((i=0; i<${#URIS[@]}; i+=BATCH_SIZE)); do
    batch=("${URIS[@]:i:BATCH_SIZE}")
    invalidate_batch "${batch[@]}"
    sleep 10  # Rate limiting between batches
done

Why small batches (10)?

  • CloudFront has rate limits
  • Small batches = more reliable
  • Can retry individual batches without losing all progress

Cache Warmup

- step:
    name: Warm Cache
    atlassian-ip-ranges: true  # Whitelist CI/CD IPs
    script:
      - cd cache-warmup
      - python3 -m venv .venv
      - .venv/bin/pip install -r requirements.txt
      - .venv/bin/python cache-warmup.py \
          --mode pages \
          --site-url https://example.com \
          --sitemap ../public/sitemap.xml \
          --concurrent 50 \
          --rps 100

What this does:

  • Reads sitemap.xml
  • Sends HEAD requests to all pages
  • Configurable concurrency and rate limiting
  • Warms CloudFront edge caches globally

Impact:

  • First user request after deploy: Cache hit (warm)
  • Without warmup: Cache miss (cold) = slower

Adjust concurrency/RPS based on your infrastructure capacity.

Part 6: Cost Optimization

Fargate Spot for Staging

capacity_provider_strategy {
  base              = var.environment == "production" ? 5 : 2
  capacity_provider = var.environment == "production" ? "FARGATE" : "FARGATE_SPOT"
  weight            = 1
}

Fargate Spot savings:

  • Regular Fargate: ~$0.04/vCPU/hour + ~$0.004/GB/hour (varies by region)
  • Fargate Spot: ~70% cheaper
  • For staging: Can reduce compute costs from ~$100/month to ~$30/month

When to use Spot:

  • Staging, development environments
  • Fault-tolerant workloads
  • When you can handle occasional interruptions

When NOT to use Spot:

  • Production (user-facing)
  • When interruptions are unacceptable

CloudFront Price Class Optimization

price_class = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"

Price classes:

  • PriceClass_100: North America, Europe (~$0.085/GB)
  • PriceClass_200: Above + Asia, Africa, Middle East (~$0.100/GB)
  • PriceClass_All: All edge locations (~$0.120/GB)

For staging with PriceClass_100:

  • Limited to 50 edge locations
  • Saves ~30% on data transfer
  • Perfectly fine for internal testing

S3 Lifecycle Policies

resource "aws_s3_bucket_lifecycle_configuration" "cloudfront_logs" {
  bucket = aws_s3_bucket.cloudfront_logs.id

  rule {
    id     = "delete-old-logs"
    status = "Enabled"

    expiration {
      days = 15
    }
  }
}

Impact:

  • CloudFront logs: ~10-50MB/day
  • Without lifecycle: Grows indefinitely
  • With 15-day retention: Cap at ~750MB
  • Savings: Minimal cost, prevents bloat

Resource Cleanup in Pipeline

# Clean up S3 artifacts after Docker push
aws s3 rm --recursive s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/

Impact:

  • Build artifacts: 100-500MB per build
  • Without cleanup: $0.023/GB/month
  • 100 builds/month = $230/month in storage
  • With cleanup: $0

Part 7: Performance Metrics

After implementing this architecture, here are real-world performance metrics:

CloudFront Cache Hit Ratio

  • HTML pages: 95-98%
  • Static assets: 99%+
  • Images: 98%+
  • Overall: 96%+

What this means:

  • Only 4% of requests hit origin
  • 96% served from edge in <50ms

Response Time Distribution

P50 (Median):

  • Cache hit: 30-50ms
  • Cache miss: 150-300ms

P95:

  • Cache hit: 80-120ms
  • Cache miss: 400-600ms

P99:

  • Cache hit: 50-120ms
  • Cache miss: 800-1200ms

Deployment Metrics

  • Build time: 5-8 minutes
  • Docker build: 2-3 minutes
  • Terraform apply: 1-2 minutes
  • ECS rollout: 3-5 minutes
  • Cache invalidation: 2-3 minutes
  • Total: 13-21 minutes

Zero-downtime deployments: 100% success rate with circuit breaker

Cost Breakdown (Example: 20M requests/month)

Service Estimated Cost/Month
ECS Fargate (varies by task size/count) $150-250
Application Load Balancer $20-30
NAT Gateway $30-40
CloudFront (varies by data transfer) $150-200
WAF $50-70
S3 (logs) $5-10
CloudWatch Logs $10-20
Total $415-620/month

Comparable Vercel cost: $2000-3500/month for similar traffic

Note: Actual costs vary based on task sizing, data transfer, and traffic patterns.

Part 8: Lessons Learned & Best Practices

What Worked Extremely Well

1. Standalone Mode

  • Single biggest win: 800MB → 350MB images
  • Faster deployments, lower storage costs
  • Enable it immediately

2. Circuit Breaker

  • Saved us from multiple bad deployments
  • Auto-rollback prevented user impact
  • Always enable with rollback = true

3. Multiple CloudFront Cache Policies

  • Different content needs different caching
  • HTML, static assets, images, API all optimized separately
  • 95%+ cache hit ratio

4. Never Invalidate /_next/static/*

  • Content-addressed assets are immutable
  • Old clients need old bundles
  • Prevents "chunk failed to load" errors

5. ECS Stability Check Before Cache Invalidation

  • Prevents invalidating cache for failed deployments
  • Catches rollbacks before users see errors

What We'd Do Differently

1. Start with CloudWatch Container Insights Enabled

  • We added this later
  • Essential for debugging performance issues
  • Enable from day one

2. Implement Server-Timing Headers Earlier

  • Took months to add
  • Would have helped diagnose issues faster
  • 10% sampling is plenty

3. Use Terraform Modules from the Start

  • Our main.tf is 1500+ lines
  • Should have split into modules earlier
  • Network, ECS, CloudFront, WAF modules

4. Add CloudFront Logging from Day One

  • Added later for performance analysis
  • Invaluable for debugging
  • S3 costs are minimal

Common Pitfalls to Avoid

1. ALB Timeout Issues

  • Symptom: Random 502 errors
  • Cause: keepAliveTimeout < ALB idle timeout (backend closes connections before ALB)
  • Solution: Set keepAliveTimeout HIGHER than ALB idle timeout (e.g., 35s for 30s ALB, or 65s for 60s ALB)

2. Invalidating /_next/static/*

  • Symptom: "ChunkLoadError" for users
  • Cause: Old clients request old bundles, but they're invalidated
  • Solution: NEVER invalidate content-addressed assets

3. Not Checking Circuit Breaker Before Invalidation

  • Symptom: Users see errors after "successful" deploy
  • Cause: Deployment rolled back, but cache was invalidated
  • Solution: Check ECS deployment status before invalidation

4. CloudFront Security Group Limits

  • Symptom: "Security group rule limit exceeded"
  • Cause: CloudFront has 60+ IP ranges, SG has 60-rule limit
  • Solution: Split across multiple security groups

5. Not Setting Health Check Grace Period

  • Symptom: Tasks marked unhealthy during startup
  • Cause: Health checks start before app is ready
  • Solution: Set grace period to 120s for Next.js

When to Use This Architecture

Use this when:

  • You're serving 5M+ requests/month (cost savings justify complexity)
  • You need full control over infrastructure
  • You have compliance requirements (SOC2, HIPAA, etc.)
  • You want to fine-tune every aspect of performance
  • Your team has AWS/DevOps expertise

Stick with Vercel when:

  • You're under 5M requests/month
  • You want zero infrastructure management
  • You need preview deployments for every PR
  • Your team is small and focused on features
  • You value convenience over cost optimization

Part 9: Security Best Practices

Environment-Specific WAF Rules

default_action {
  dynamic "allow" {
    for_each = var.environment == "production" ? [1] : []
    content {}
  }
  dynamic "block" {
    for_each = var.environment == "staging" ? [1] : []
    content {}
  }
}

Production: Open with managed rules Staging: Blocked by default, IP allowlist

Security Headers via CloudFront

security_headers_config {
  content_type_options {
    override = true
  }
  frame_options {
    frame_option = "DENY"
    override     = true
  }
  referrer_policy {
    referrer_policy = "strict-origin-when-cross-origin"
    override        = true
  }
  xss_protection {
    mode_block = true
    protection = true
    override   = true
  }
  strict_transport_security {
    access_control_max_age_sec = 31536000
    include_subdomains         = true
    preload                    = true
    override                   = true
  }
}

Impact: A+ rating on securityheaders.com

Secrets Management

Never put secrets in:

  • Dockerfile
  • Environment variables in task definition
  • Terraform files

Use:

  • AWS Secrets Manager for sensitive values
  • IAM roles for AWS service access
  • Environment variables in .env.production (not committed to git)

Part 10: Monitoring & Alerts

Set up CloudWatch alarms for production. Example alarm for high CPU:

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
  alarm_name          = "${var.app_name}-ecs-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  threshold           = 85
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ServiceName = aws_ecs_service.main.name
    ClusterName = aws_ecs_cluster.main.name
  }
}

Set up similar alarms for:

  • ECS memory utilization (> 85%)
  • ALB 5xx errors (> 10 per 5min)
  • ALB response time (> 2s)
  • CloudFront error rate (> 1%)
  • CloudFront cache hit ratio (< 85%)
  • ECS task count (< minimum)

SNS Topic for Alerts

resource "aws_sns_topic" "alerts" {
  name = "${var.app_name}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "devops@example.com"
}

Impact: Alarms detect issues in 5-10 minutes. Add a CloudWatch Dashboard to visualize ECS CPU/memory, ALB response times, and CloudFront cache metrics.

Part 11: Troubleshooting Guide

Issue 1: ECS Tasks Fail to Start

Common causes:

  • Docker image pull errors (check ECS service events for "CannotPullContainerError")
  • Insufficient CPU/memory allocation
  • Health checks failing too quickly

Fix:

  • Verify image exists and task execution role has ECR permissions
  • Increase health_check_grace_period_seconds = 180
  • Check NAT Gateway is working for private subnets

Issue 2: 502 Bad Gateway Errors

Most common cause: ALB timeout mismatch

Fix:

  • Set KEEP_ALIVE_TIMEOUT > ALB idle_timeout (e.g., 35s for 30s ALB, 65s for 60s ALB)
  • Set headersTimeout slightly higher than keepAliveTimeout
  • Check Container Insights for memory issues (increase task memory if needed)

Issue 3: CloudFront Serving Stale Content

Common causes:

  • NEVER invalidate /_next/static/* (content-addressed assets change on each deploy)
  • Missing path variations in invalidation (include path, path/, path/*)

Fix:

  • Only invalidate HTML pages and /_next/data/*
  • Wait for invalidation to complete before testing
  • Clear browser cache when debugging

Part 12: Health Check Implementation

Create a simple health check endpoint for ALB:

// pages/api/health.ts
export default function handler(req, res) {
  res.status(200).json({
    status: 'ok',
    timestamp: new Date().toISOString()
  });
}

Configure in Terraform:

health_check {
  path     = "/api/health"
  interval = 15
  timeout  = 5
  healthy_threshold   = 3
  unhealthy_threshold = 3
}

Keep it simple: Health check should respond in <100ms. For advanced monitoring (database checks, external APIs), create a separate /api/health/detailed endpoint.

Conclusion

Self-hosting Next.js on AWS is non-trivial, but with the right architecture, you can achieve:

  • 99.99% uptime with multi-AZ deployment
  • Sub-120ms P99 response times globally via CloudFront
  • 60-70% cost savings vs Vercel at scale
  • Zero-downtime deployments with circuit breaker protection
  • Enterprise-grade security with WAF and managed rules

The key principles:

  1. Use Next.js standalone mode for minimal Docker images
  2. Aggressive CloudFront caching with smart invalidation
  3. Never invalidate content-addressed assets (/_next/static/*)
  4. Verify deployment success before cache invalidation
  5. Auto-scaling on multiple metrics (CPU, memory, requests)
  6. Circuit breaker for automatic rollback
  7. Origin Shield to reduce origin load
  8. Environment-specific optimizations (Spot for staging, different auto-scaling)

This architecture has served millions of requests daily for months with zero downtime and excellent performance. The upfront complexity pays dividends in control, performance, and cost savings.

Next Steps

If you're implementing this architecture:

  1. Start with Terraform modules for each component
  2. Implement the Docker optimization first (standalone mode)
  3. Set up CI/CD with stability checks
  4. Add CloudFront gradually (start with simple caching)
  5. Tune auto-scaling thresholds based on your traffic
  6. Monitor everything with CloudWatch and Server-Timing headers
  7. Iterate on cache policies based on real data

The code examples in this guide are production-tested and ready to adapt for your use case. Happy self-hosting!