TL;DR: Is This Guide For You?
Use this architecture if:
- Serving 5M+ requests/month (cost savings justify complexity)
- Need full infrastructure control (compliance, custom optimizations)
- Have AWS/DevOps expertise on your team
- Want 60-70% cost savings vs Vercel at scale
- Need sub-120ms P99 response times globally
Stick with Vercel if:
- Under 5M requests/month (managed hosting is more cost-effective)
- Small team focused on features, not infrastructure
- Need preview deployments for every PR out-of-the-box
- Value convenience over cost optimization
What you'll learn:
- Complete Terraform infrastructure setup (VPC, ECS, CloudFront, WAF)
- Docker optimization techniques (800MB → 350MB images)
- Advanced CloudFront caching strategies (95%+ cache hit ratio)
- Zero-downtime CI/CD pipeline with automatic rollback
- Production troubleshooting and monitoring
Introduction: Why Self-Host Next.js?
Vercel provides an exceptional developer experience for Next.js applications, but at scale, you might find yourself questioning the economics. When you're serving millions of requests per month, the cost difference between Vercel and self-hosting can be substantial. Beyond cost, self-hosting gives you complete control over your infrastructure, better compliance options, and the ability to fine-tune every aspect of your stack.
But self-hosting Next.js properly is non-trivial. You need to handle:
- Container orchestration and scaling
- CDN integration and cache invalidation
- Zero-downtime deployments
- Security hardening
- Performance optimization
- Cost management
This guide walks through a production-grade architecture that addresses all of these challenges, optimized over months of real-world operation serving millions of requests daily.
Architecture Overview
Our architecture leverages AWS managed services to create a highly available, performant, and cost-effective Next.js hosting platform:
User Request
↓
CloudFront CDN (with WAF)
↓
Application Load Balancer (HTTPS)
↓
ECS Fargate Tasks (Auto-scaling)
↓
Next.js Application (Standalone Mode)
Key Components:
- VPC: Multi-AZ setup with public/private subnets
- ECS Fargate: Container orchestration without server management
- Application Load Balancer: HTTPS termination and health checking
- CloudFront: Global CDN with sophisticated caching policies
- WAF: Security layer with managed rules and bot protection
- S3: Build artifact storage and CloudFront logging
This architecture provides:
- 99.99% availability with multi-AZ deployment
- Global edge caching via CloudFront's 400+ locations
- Auto-scaling based on CPU, memory, and request count
- Zero-downtime deployments with circuit breaker rollback
- ~350MB Docker images (down from 800MB+)
- Sub-second P99 response times globally
Part 1: Infrastructure as Code with Terraform
VPC and Networking Setup
Start with a proper VPC foundation spanning multiple availability zones:
resource "aws_vpc" "main" {
cidr_block = var.environment == "production" ? "10.10.0.0/16" : "10.20.0.0/16"
tags = {
Name = "${var.app_name}-${var.environment}-vpc"
}
}
# Public subnets for ALB (adjust AZs based on your region)
resource "aws_subnet" "public_subnet_1" {
vpc_id = aws_vpc.main.id
cidr_block = var.environment == "production" ? "10.10.1.0/24" : "10.20.1.0/24"
availability_zone = "${var.aws_region}a"
map_public_ip_on_launch = true
}
resource "aws_subnet" "public_subnet_2" {
vpc_id = aws_vpc.main.id
cidr_block = var.environment == "production" ? "10.10.2.0/24" : "10.20.2.0/24"
availability_zone = "${var.aws_region}b"
map_public_ip_on_launch = true
}
# Private subnets for ECS tasks
resource "aws_subnet" "private_subnet_1" {
vpc_id = aws_vpc.main.id
cidr_block = var.environment == "production" ? "10.10.10.0/24" : "10.20.10.0/24"
availability_zone = "${var.aws_region}a"
}
resource "aws_subnet" "private_subnet_2" {
vpc_id = aws_vpc.main.id
cidr_block = var.environment == "production" ? "10.10.11.0/24" : "10.20.11.0/24"
availability_zone = "${var.aws_region}b"
}
Why this matters:
- Multi-AZ deployment: If one availability zone fails, your app stays up
- Public/private subnet separation: ALB in public, ECS tasks in private for security
- Different CIDR blocks per environment: Prevents IP conflicts if you ever need VPC peering
NAT Gateway for Private Subnet Internet Access
ECS tasks in private subnets need internet access for pulling Docker images and making external API calls:
resource "aws_eip" "nat" {
tags = {
Name = "${var.app_name}-${var.environment}-nat-eip"
}
}
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public_subnet_1.id
}
resource "aws_route_table" "private" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main.id
}
}
Cost consideration: NAT Gateway costs ~$32/month plus data transfer. For production, this is essential. For development environments, you might consider placing tasks in public subnets to save costs.
Security Groups: The CloudFront Challenge
Here's a gotcha: CloudFront uses 60+ IP ranges globally. AWS security groups have a 60-rule limit per group. The solution? Split CloudFront IPs across multiple security groups:
data "aws_ip_ranges" "cloudfront" {
regions = ["global"]
services = ["cloudfront"]
}
# First 50 CloudFront IPs
resource "aws_security_group" "cloudfront_sg_1" {
name = "${var.app_name}-${var.environment}-cf-sg-1"
description = "Security group for CloudFront IP ranges (1-50)"
vpc_id = aws_vpc.main.id
dynamic "ingress" {
for_each = slice(data.aws_ip_ranges.cloudfront.cidr_blocks, 0, 50)
content {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [ingress.value]
description = "Allow HTTPS from CloudFront"
}
}
}
# Remaining CloudFront IPs
resource "aws_security_group" "cloudfront_sg_2" {
name = "${var.app_name}-${var.environment}-cf-sg-2"
description = "Security group for CloudFront IP ranges (51+)"
vpc_id = aws_vpc.main.id
dynamic "ingress" {
for_each = slice(data.aws_ip_ranges.cloudfront.cidr_blocks, 50, length(data.aws_ip_ranges.cloudfront.cidr_blocks))
content {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [ingress.value]
description = "Allow HTTPS from CloudFront"
}
}
}
# Main ALB security group (for office IPs, monitoring, etc.)
resource "aws_security_group" "alb_sg" {
name = "${var.app_name}-${var.environment}-alb-sg"
description = "Security group for ALB"
vpc_id = aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["YOUR_OFFICE_IP/32"] # Optional: direct ALB access
description = "Allow HTTPS from office"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# ECS tasks security group
resource "aws_security_group" "ecs_sg" {
name = "${var.app_name}-${var.environment}-ecs-sg"
description = "Security group for ECS tasks"
vpc_id = aws_vpc.main.id
ingress {
from_port = var.app_port # e.g., 3000 for Next.js, 8080 for others
to_port = var.app_port
protocol = "tcp"
security_groups = [
aws_security_group.alb_sg.id,
aws_security_group.cloudfront_sg_1.id,
aws_security_group.cloudfront_sg_2.id,
]
description = "Allow traffic from ALB"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Impact: By using aws_ip_ranges data source, your security groups automatically update when AWS adds new CloudFront IPs. This prevents mysterious connection failures months later.
ECS Cluster with Auto-Scaling
ECS Fargate removes the need to manage EC2 instances:
resource "aws_ecs_cluster" "main" {
name = "ecs-${var.app_name}-${var.environment}-cluster"
configuration {
execute_command_configuration {
logging = "DEFAULT"
}
}
setting {
name = "containerInsights"
value = "enabled" # Essential for monitoring
}
}
resource "aws_ecs_service" "main" {
cluster = aws_ecs_cluster.main.arn
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
desired_count = var.environment == "production" ? var.prod_task_count : var.staging_task_count
health_check_grace_period_seconds = 120
name = "${var.app_name}-${var.environment}"
task_definition = aws_ecs_task_definition.main.arn
capacity_provider_strategy {
base = var.environment == "production" ? var.prod_task_count : var.staging_task_count
capacity_provider = var.environment == "production" ? "FARGATE" : "FARGATE_SPOT"
weight = 1
}
deployment_circuit_breaker {
enable = true
rollback = true # Auto-rollback on failed deployments
}
load_balancer {
container_name = "${var.app_name}-${var.environment}"
container_port = var.app_port
target_group_arn = aws_lb_target_group.main.arn
}
network_configuration {
assign_public_ip = false
security_groups = [aws_security_group.ecs_sg.id]
subnets = [aws_subnet.private_subnet_1.id, aws_subnet.private_subnet_2.id]
}
}
Key decisions:
- FARGATE_SPOT for staging: Save ~70% on compute costs for non-production
- Circuit breaker enabled: Automatically rolls back bad deployments
- 120s health check grace period: Gives Next.js time to start up and warm up
- 200% deployment max, 100% min: Allows full blue/green deployment
Three-Dimensional Auto-Scaling
Scale on CPU, memory, AND request count for comprehensive coverage:
resource "aws_appautoscaling_target" "ecs_service" {
max_capacity = var.max_task_count
min_capacity = var.min_task_count
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.main.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
# CPU-based scaling
resource "aws_appautoscaling_policy" "cpu_scaling" {
name = "cpu-scaling-policy"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs_service.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs_service.service_namespace
target_tracking_scaling_policy_configuration {
target_value = var.cpu_target_value # e.g., 75.0
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
scale_in_cooldown = var.scale_in_cooldown # e.g., 120
scale_out_cooldown = var.scale_out_cooldown # e.g., 30
}
}
# Memory-based scaling
resource "aws_appautoscaling_policy" "memory_scaling" {
name = "memory-scaling-policy"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs_service.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs_service.service_namespace
target_tracking_scaling_policy_configuration {
target_value = var.memory_target_value # e.g., 80.0
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageMemoryUtilization"
}
scale_in_cooldown = var.scale_in_cooldown
scale_out_cooldown = var.scale_out_cooldown
}
}
# Request count-based scaling
resource "aws_appautoscaling_policy" "request_count_scaling" {
name = "request-count-scaling-policy"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs_service.resource_id
scalable_dimension = aws_appautoscaling_target.ecs_service.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs_service.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb.main.arn_suffix}/${aws_lb_target_group.main.arn_suffix}"
}
target_value = var.requests_per_target # e.g., 1000
scale_in_cooldown = var.scale_in_cooldown
scale_out_cooldown = var.scale_out_cooldown
}
}
Why three metrics?
- CPU: Catches compute-intensive operations (image processing, data transformation)
- Memory: Catches memory leaks or large data operations
- Request count: Proactively scales before CPU/memory spike from traffic surge
Production vs staging cooldowns:
- Production: Aggressive scale-out (e.g., 30s), conservative scale-in (e.g., 120s)
- Staging: Conservative on both to reduce churn and save costs
- Tune based on your traffic patterns and cost sensitivity
Application Load Balancer Configuration
resource "aws_lb" "main" {
name = "nextjs-ecs-${var.environment}"
internal = false
load_balancer_type = "application"
security_groups = [
aws_security_group.alb_sg.id,
aws_security_group.cloudfront_sg_1.id,
aws_security_group.cloudfront_sg_2.id,
]
subnets = [aws_subnet.public_subnet_1.id, aws_subnet.public_subnet_2.id]
enable_deletion_protection = false
enable_http2 = true
idle_timeout = 30 # Important for keepAlive tuning
access_logs {
bucket = "your-alb-logs-bucket"
enabled = true
prefix = "alb-logs/${var.environment}"
}
}
resource "aws_lb_target_group" "nextjs" {
name = "nextjs-${var.environment}"
port = 3000
protocol = "HTTP"
vpc_id = aws_vpc.main.id
target_type = "ip" # Required for Fargate
deregistration_delay = 30 # Drain connections for 30s before killing
health_check {
interval = 15 # Check every 15 seconds
path = var.health_check_path # e.g., "/health", "/api/status"
protocol = "HTTP"
healthy_threshold = 3 # Healthy after 3 successful checks
unhealthy_threshold = 3 # Unhealthy after 3 failed checks
timeout = 5
matcher = "200-299"
}
}
resource "aws_lb_listener" "https" {
load_balancer_arn = aws_lb.main.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-FIPS-2023-04"
certificate_arn = "YOUR_ACM_CERTIFICATE_ARN"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.nextjs.arn
}
}
Health check tuning:
- 15-second interval with threshold of 3 means unhealthy tasks are removed in 45s
- New tasks become healthy after 45s (3 successful checks)
- Balance between fast detection and avoiding false positives
- Adjust interval and thresholds based on your app startup time
Deregistration delay (30s):
- Allows in-flight requests to complete before terminating the task
- Must be less than the deployment health check grace period (120s)
Terraform Variables (variables.tf)
All the Terraform code above uses variables. Key ones you'll need:
variable "environment" { type = string } # production, staging
variable "app_name" { type = string }
variable "docker_image" { type = string }
variable "domain_name" { type = string }
variable "acm_certificate_arn" { type = string }
# ECS scaling
variable "min_task_count" { default = 2 }
variable "max_task_count" { default = 10 }
variable "cpu_target_value" { default = 75.0 }
variable "memory_target_value" { default = 80.0 }
Example values:
environment = "production"
app_name = "my-nextjs-app"
docker_image = "registry/app:prod-abc123"
domain_name = "example.com"
Part 2: Docker Optimization for Next.js
The Standalone Mode Revolution
Next.js 12.2+ introduced standalone mode, which dramatically reduces Docker image sizes:
// next.config.js
export default {
output: 'standalone', // This is the magic line
swcMinify: true,
// ... rest of config
}
What happens:
- Next.js traces your dependencies and creates
.next/standalonewith only required node_modules - Before: 800MB+ Docker images with full node_modules
- After: 350-500MB Docker images with minimal dependencies
- Impact: Faster deploys, lower storage costs, quicker task startup
Production Dockerfile Optimized
FROM node:lts-alpine
# Install dumb-init for proper signal handling
RUN apk add --no-cache dumb-init
WORKDIR /app
# Set production environment
ENV NODE_ENV=production
ENV NEXT_TELEMETRY_DISABLED=1
ENV HOSTNAME=0.0.0.0
# CRITICAL: Set keepAliveTimeout for ALB compatibility
# ALB idle timeout is 30s, so we set this HIGHER to ensure
# ALB closes connections before the app does
ENV KEEP_ALIVE_TIMEOUT=35000
# Copy standalone server (contains minimal node_modules)
COPY .next/standalone ./
# Copy static assets
COPY .next/static ./.next/static
COPY public ./public
# Copy .env.production for runtime environment variables
COPY .env.production ./.env.production
# Fix hostname to 0.0.0.0 (Docker overrides HOSTNAME env at runtime)
RUN sed -i "s/const hostname = process.env.HOSTNAME || '0.0.0.0'/const hostname = '0.0.0.0'/" server.js
# Set headersTimeout for ALB compatibility
# Inject after line 6 (const __dirname) using ES module import
RUN head -6 server.js > server.js.tmp && \
printf '\nimport http from '"'"'http'"'"';\n' >> server.js.tmp && \
printf 'const httpServer = http.Server.prototype;\n' >> server.js.tmp && \
printf 'const originalListen = httpServer.listen;\n' >> server.js.tmp && \
printf 'httpServer.listen = function(...args) {\n' >> server.js.tmp && \
printf ' const result = originalListen.apply(this, args);\n' >> server.js.tmp && \
printf ' this.headersTimeout = 36000;\n' >> server.js.tmp && \
printf ' return result;\n' >> server.js.tmp && \
printf '};\n\n' >> server.js.tmp && \
tail -n +7 server.js >> server.js.tmp && \
mv server.js.tmp server.js
EXPOSE 3000
# Health check for ECS (adjust port and path as needed)
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"
# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "server.js"]
The ALB Timeout Problem (And Solution)
This is a critical issue that will cause mysterious 502 errors if not addressed:
The Problem:
- ALB has a configurable idle timeout (commonly 30-60 seconds, default 60s)
- Node.js default keepAliveTimeout is 5 seconds (too short!)
- If the backend keepAliveTimeout is SHORTER than ALB idle timeout, ALB will try to reuse closed connections
- This causes 502 errors when ALB sends requests on already-closed connections
The Solution:
- Set
KEEP_ALIVE_TIMEOUTto be HIGHER than ALB idle timeout (e.g., 35000ms for 30s ALB timeout) - Set
headersTimeoutslightly higher than keepAliveTimeout (e.g., 36000ms) - This ensures ALB closes idle connections BEFORE the backend does
- The backend keeps connections open longer, preventing ALB from reusing closed connections
Impact:
- Before fix: Random 502 errors under load
- After fix: Zero timeout-related errors
Why the shell script patching?
- Next.js standalone server.js doesn't expose headersTimeout yet (as of late 2024)
- KEEP_ALIVE_TIMEOUT is officially supported via env var
- headersTimeout requires patching server.js
- This is a temporary workaround until Next.js adds official support
dumb-init: The Signal Handling Hero
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "server.js"]
Why dumb-init?
- Docker sends SIGTERM to PID 1 when stopping containers
- Node.js doesn't handle signals properly as PID 1
- Without dumb-init: 30s forced kill during deployments
- With dumb-init: Graceful shutdowns, in-flight requests complete
Impact: Zero dropped requests during deployments.
Part 3: CloudFront CDN Configuration
CloudFront is where the real performance magic happens. Our setup uses 5 different cache policies for different content types.
Cache Policy 1: HTML Pages (Aggressive Caching)
resource "aws_cloudfront_cache_policy" "html_cache_policy" {
name = "html-cache-policy-${var.environment}"
comment = "Cache policy for HTML pages - 1 year TTL since we invalidate on deploy"
default_ttl = 31536000 # 1 year
max_ttl = 31536000
min_ttl = 31536000 # Override origin max-age=0
parameters_in_cache_key_and_forwarded_to_origin {
enable_accept_encoding_brotli = true
enable_accept_encoding_gzip = true
headers_config {
header_behavior = "none" # Don't vary on geolocation for HTML
}
cookies_config {
cookie_behavior = "none"
}
query_strings_config {
query_string_behavior = "all" # UTM params, etc.
}
}
}
Why 1 year TTL for HTML?
- CloudFront caching is safe because we invalidate on every deploy
- Long TTL = high cache hit ratio = faster response times
min_ttl = 31536000overrides Next.js'sCache-Control: max-age=0
Impact:
- 95%+ cache hit ratio for HTML pages
- P50 response time: <50ms globally
- P99 response time: <120ms globally
Cache Policy 2: Static Assets (Immutable Content)
resource "aws_cloudfront_cache_policy" "static_assets_cache_policy" {
name = "static-assets-cache-policy-${var.environment}"
comment = "For /static/* (invalidated) and /_next/static/* (immutable)"
default_ttl = 31536000
max_ttl = 31536000
min_ttl = 0 # Allow immediate updates if needed
parameters_in_cache_key_and_forwarded_to_origin {
enable_accept_encoding_brotli = true
enable_accept_encoding_gzip = true
headers_config {
header_behavior = "none"
}
cookies_config {
cookie_behavior = "none"
}
query_strings_config {
query_string_behavior = "none"
}
}
}
Two types of static assets:
/_next/static/*: Content-addressed (contains BUILD_ID and hash)- Files:
/_next/static/[BUILD_ID]/[hash].js - NEVER invalidate these
- Old clients need old bundles to prevent "chunk failed to load" errors
- Files:
/static/*: Public folder assets- Can be invalidated on deploy
- Use versioned filenames when possible
Cache Policy 3: Next.js Data Files
resource "aws_cloudfront_cache_policy" "nextjs_data_cache_policy" {
name = "nextjs-data-cache-policy-${var.environment}"
comment = "For /_next/data/* - invalidated on every deploy"
default_ttl = 31536000
max_ttl = 31536000
min_ttl = 0
parameters_in_cache_key_and_forwarded_to_origin {
enable_accept_encoding_brotli = true
enable_accept_encoding_gzip = true
query_strings_config {
query_string_behavior = "none"
}
}
}
What are data files?
/_next/data/[BUILD_ID]/page.json- Used by Next.js client-side navigation
- Must be invalidated on deploy for content updates
Cache Policy 4: Image Optimization
resource "aws_cloudfront_cache_policy" "image_cache_policy" {
name = "image-cache-policy-${var.environment}"
comment = "For Next.js image optimizer - NOT invalidated"
default_ttl = 31536000
max_ttl = 31536000
min_ttl = 31536000 # Override Next.js Cache-Control headers
parameters_in_cache_key_and_forwarded_to_origin {
enable_accept_encoding_brotli = true
enable_accept_encoding_gzip = true
headers_config {
header_behavior = "whitelist"
headers {
items = ["Accept"] # Distinguish WebP/AVIF vs JPEG
}
}
query_strings_config {
query_string_behavior = "all" # w, q, url parameters
}
}
}
Why vary on Accept header?
- Modern browsers send
Accept: image/avif,image/webp,*/* - Old browsers send
Accept: */* - CloudFront serves appropriate format based on client support
- Each format cached separately
Why override Cache-Control?
- Next.js Image Optimization sets short cache times
- We want aggressive CloudFront caching
- Images don't change (or use versioned URLs if they do)
Cache Policy 5: API Routes (No Cache)
resource "aws_cloudfront_cache_policy" "no_cache_policy" {
name = "no-cache-policy-${var.environment}"
comment = "Disable caching for API requests"
default_ttl = 0
max_ttl = 0
min_ttl = 0
parameters_in_cache_key_and_forwarded_to_origin {
headers_config {
header_behavior = "none"
}
cookies_config {
cookie_behavior = "none"
}
query_strings_config {
query_string_behavior = "none"
}
}
}
CloudFront Distribution with Ordered Cache Behaviors
resource "aws_cloudfront_distribution" "main" {
enabled = true
is_ipv6_enabled = true
http_version = "http2and3" # Enable HTTP/3
price_class = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"
origin {
domain_name = var.environment == "production" ? "alb-prod.example.com" : "alb-staging.example.com"
origin_id = "origin-nextjs-${var.environment}"
origin_shield {
enabled = true
origin_shield_region = "${var.aws_region}"
}
custom_origin_config {
http_port = 80
https_port = 443
origin_protocol_policy = "https-only"
origin_ssl_protocols = ["TLSv1.2"]
origin_read_timeout = 45
origin_keepalive_timeout = 10
}
}
# Access logging for performance analysis
logging_config {
include_cookies = false
bucket = aws_s3_bucket.cloudfront_logs.bucket_domain_name
prefix = "cloudfront-logs/"
}
aliases = var.environment == "production" ? ["example.com"] : ["staging.example.com"]
# Default cache behavior for HTML pages
default_cache_behavior {
target_origin_id = "origin-nextjs-${var.environment}"
viewer_protocol_policy = "redirect-to-https"
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD"]
cache_policy_id = aws_cloudfront_cache_policy.html_cache_policy.id
compress = true
origin_request_policy_id = aws_cloudfront_origin_request_policy.geolocation_policy.id
}
# API routes (no cache)
ordered_cache_behavior {
path_pattern = "/api/*"
allowed_methods = ["GET", "HEAD", "OPTIONS", "POST", "PUT", "PATCH", "DELETE"]
cached_methods = ["GET", "HEAD"]
target_origin_id = "origin-nextjs-${var.environment}"
viewer_protocol_policy = "redirect-to-https"
cache_policy_id = aws_cloudfront_cache_policy.no_cache_policy.id
compress = true
origin_request_policy_id = aws_cloudfront_origin_request_policy.geolocation_policy.id
}
# Static assets
ordered_cache_behavior {
path_pattern = "/static/*"
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD", "OPTIONS"]
target_origin_id = "origin-nextjs-${var.environment}"
viewer_protocol_policy = "redirect-to-https"
cache_policy_id = aws_cloudfront_cache_policy.static_assets_cache_policy.id
compress = true
}
# Next.js immutable assets (NEVER invalidate)
ordered_cache_behavior {
path_pattern = "/_next/static/*"
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD", "OPTIONS"]
target_origin_id = "origin-nextjs-${var.environment}"
viewer_protocol_policy = "redirect-to-https"
cache_policy_id = aws_cloudfront_cache_policy.static_assets_cache_policy.id
compress = true
}
# Next.js data files (invalidate on deploy)
ordered_cache_behavior {
path_pattern = "/_next/data/*"
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD", "OPTIONS"]
target_origin_id = "origin-nextjs-${var.environment}"
viewer_protocol_policy = "redirect-to-https"
cache_policy_id = aws_cloudfront_cache_policy.nextjs_data_cache_policy.id
compress = true
}
# Images (aggressive caching)
ordered_cache_behavior {
path_pattern = "/_next/image*"
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD", "OPTIONS"]
target_origin_id = "origin-nextjs-${var.environment}"
viewer_protocol_policy = "redirect-to-https"
cache_policy_id = aws_cloudfront_cache_policy.image_cache_policy.id
compress = true
}
restrictions {
geo_restriction {
restriction_type = "blacklist"
locations = ["CN"] # Block China if needed
}
}
viewer_certificate {
acm_certificate_arn = "YOUR_ACM_CERTIFICATE_ARN"
ssl_support_method = "sni-only"
minimum_protocol_version = "TLSv1.2_2021"
}
web_acl_id = aws_wafv2_web_acl.main.arn
}
Origin Shield: Regional Caching Layer
origin_shield {
enabled = true
origin_shield_region = "${var.aws_region}"
}
What is Origin Shield?
- Additional caching layer between CloudFront edges and your origin
- All edge locations in a region hit Origin Shield first
- Origin Shield then fetches from your ALB if needed
Impact:
- Before: 400+ edge locations hitting your ALB
- After: 1 Origin Shield per region hitting your ALB
- Result: 80-90% reduction in origin requests
- Cost: ~$0.005/10,000 requests (~$10/month for 20M requests)
Server-Timing Headers for Performance Debugging
resource "aws_cloudfront_response_headers_policy" "security_headers_policy" {
name = "nextjs-security-headers-${var.environment}"
security_headers_config {
# ... security headers ...
}
server_timing_headers_config {
enabled = true
sampling_rate = 10 # 10% of requests
}
}
What you get:
Server-Timing: cdn-cache-miss;desc="cache miss"
Server-Timing: cdn-downstream-fbl;dur=50
Server-Timing: cdn-upstream-fbl;dur=100
Metrics provided:
cdn-cache-hitorcdn-cache-miss: Whether request hit cachecdn-downstream-fbl: Time to first byte from CloudFront to clientcdn-upstream-fbl: Time to first byte from origin to CloudFront
Impact: Essential for diagnosing slow response times and cache hit ratios.
Part 4: WAF Configuration
resource "aws_wafv2_web_acl" "main" {
provider = aws.us_east_1 # Must be us-east-1 for CloudFront (global service)
name = "${var.app_name}-${var.environment}-waf"
scope = "CLOUDFRONT"
default_action {
dynamic "allow" {
for_each = var.environment == "production" ? [1] : []
content {}
}
dynamic "block" {
for_each = var.environment == "staging" ? [1] : []
content {}
}
}
# Staging: IP allowlist (office + CI/CD)
dynamic "rule" {
for_each = var.environment == "staging" ? [1] : []
content {
name = "StagingAllowedIPs"
priority = 1
action {
allow {}
}
statement {
ip_set_reference_statement {
arn = aws_wafv2_ip_set.staging_allowed_ips[0].arn
}
}
visibility_config {
cloudwatch_metrics_enabled = true
metric_name = "StagingAllowedIPs"
sampled_requests_enabled = true
}
}
}
# AWS Managed Rules
rule {
name = "AWSManagedRulesCommonRuleSet"
priority = 2
override_action {
none {}
}
statement {
managed_rule_group_statement {
name = "AWSManagedRulesCommonRuleSet"
vendor_name = "AWS"
}
}
visibility_config {
cloudwatch_metrics_enabled = true
metric_name = "AWSManagedRulesCommonRuleSet"
sampled_requests_enabled = true
}
}
rule {
name = "AWSManagedRulesSQLiRuleSet"
priority = 4
override_action {
none {}
}
statement {
managed_rule_group_statement {
name = "AWSManagedRulesSQLiRuleSet"
vendor_name = "AWS"
}
}
visibility_config {
cloudwatch_metrics_enabled = true
metric_name = "AWSManagedRulesSQLiRuleSet"
sampled_requests_enabled = true
}
}
# Bot Control with overrides
rule {
name = "AWSManagedRulesBotControlRuleSet"
priority = 6
override_action {
none {}
}
statement {
managed_rule_group_statement {
name = "AWSManagedRulesBotControlRuleSet"
vendor_name = "AWS"
managed_rule_group_configs {
aws_managed_rules_bot_control_rule_set {
inspection_level = "COMMON"
}
}
# Allow SEO bots, monitoring tools
rule_action_override {
name = "CategorySearchEngine"
action_to_use {
count {} # Monitor but don't block
}
}
rule_action_override {
name = "CategoryMonitoring"
action_to_use {
count {}
}
}
}
}
visibility_config {
cloudwatch_metrics_enabled = true
metric_name = "BotControl"
sampled_requests_enabled = true
}
}
visibility_config {
cloudwatch_metrics_enabled = true
metric_name = "nextjs-${var.environment}-waf"
sampled_requests_enabled = true
}
}
WAF costs:
- Base: $5/month per web ACL
- Rules: $1/month per rule
- Requests: $0.60 per 1M requests
- Bot Control: $10/month + $1 per 1M requests
For 20M requests/month:
- Base + 5 rules: $10
- Requests: $12
- Bot Control: $30
- Total: ~$52/month
Impact: Blocks ~2-5% of malicious traffic automatically.
Part 5: CI/CD Pipeline Deep Dive
Build Optimization with S3 Artifacts
- step:
name: Build
script:
- npm install
- export NODE_ENV=production
- export NODE_OPTIONS="--max_old_space_size=49152" # 48GB heap
- npm run build
- export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)
# Compress and upload to S3
- tar czf next-build.tar.gz .next/standalone .next/static pages-json
- aws s3 cp next-build.tar.gz s3://${BUILD_ARTIFACTS_BUCKET}/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/next-build.tar.gz
artifacts:
- public/**
- package.json
Why S3 for artifacts?
- Build step runs on 16x instance with 128GB RAM
- Docker step runs on instance with Docker service
- S3 allows passing large artifacts (100MB+) between steps
- Artifacts are cleaned up after deploy
Why 48GB heap?
- Next.js build can use 4-8GB for large sites
- CI/CD 16x has 128GB RAM
- Setting max_old_space_size prevents OOM crashes
Docker Build Step
- step:
name: Dockerize
services:
- docker
script:
- export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)
# Download build artifacts
- aws s3 cp s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/next-build.tar.gz .
- tar xzf next-build.tar.gz
# Build and push Docker image
- docker login -u ${container-registry_username} -p ${container-registry_password}
- docker build --no-cache -f ${ENVIRONMENT}.Dockerfile -t org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER} .
- docker push org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}
# Clean up S3
- aws s3 rm --recursive s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/
Why separate build and dockerize?
- Build needs lots of CPU/RAM
- Dockerize needs Docker service
- Can't run both on same step efficiently
- S3 is the glue
Terraform Apply
- step:
name: Terraform Apply
deployment: Production
script:
- curl -LO "https://releases.hashicorp.com/terraform/1.9.5/terraform_1.9.5_linux_amd64.zip"
- unzip terraform_1.9.5_linux_amd64.zip
- mv terraform /usr/local/bin/
- export COMMIT_SHORT=$(echo $BITBUCKET_COMMIT | cut -c1-7)
- terraform init -backend-config="key=state/${tf_aws_key}.tfstate" -backend-config="bucket=${tf_aws_bucket}"
- terraform plan -parallelism=10 \
-var="environment=${ENVIRONMENT}" \
-var="image=org/nextjs:${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}" \
-out=terraform.tfplan
- terraform apply -parallelism=10 -auto-approve terraform.tfplan
parallelism=10: Create/update 10 resources at a time (default is 10, but being explicit helps)
ECS Stability Check (The Critical Step)
This bash script ensures deployment succeeded before invalidating CloudFront cache:
#!/bin/bash
set -e
ENVIRONMENT="${DEPLOYMENT_ENV}"
SERVICE_NAME="${REPO_NAME}"
REGION="${var.aws_region}"
CLUSTER_NAME="ecs-nextjs-${ENVIRONMENT}-cluster"
SERVICE_NAME_FULL="${SERVICE_NAME}-${ENVIRONMENT}"
MAX_WAIT_SECONDS=1200 # 20 minutes
POLL_INTERVAL=30
check_circuit_breaker_rollback() {
local deployment_info=$(aws ecs describe-services \
--cluster "${CLUSTER_NAME}" \
--services "${SERVICE_NAME_FULL}" \
--region "${REGION}" \
--query 'services[0].deployments[0]' \
--output json)
local rollout_state=$(echo "$deployment_info" | jq -r '.rolloutState // "null"')
local rollout_reason=$(echo "$deployment_info" | jq -r '.rolloutStateReason // "null"')
if [ "$rollout_state" = "FAILED" ]; then
echo "❌ Deployment FAILED - rollout state: $rollout_state"
echo "❌ Rollout reason: $rollout_reason"
return 1
fi
if echo "$rollout_reason" | grep -i "circuit breaker" > /dev/null; then
echo "❌ Circuit breaker triggered rollback: $rollout_reason"
return 1
fi
return 0
}
check_expected_image() {
local expected_image_tag="${ENVIRONMENT}-${COMMIT_SHORT}-${BUILD_NUMBER}"
local current_image=$(aws ecs describe-services \
--cluster "${CLUSTER_NAME}" \
--services "${SERVICE_NAME_FULL}" \
--region "${REGION}" \
--query 'services[0].taskDefinition' \
--output text)
local task_image=$(aws ecs describe-task-definition \
--task-definition "$current_image" \
--region "${REGION}" \
--query 'taskDefinition.containerDefinitions[0].image' \
--output text)
if echo "$task_image" | grep "$expected_image_tag" > /dev/null; then
echo "✅ Expected image is deployed: $task_image"
return 0
else
echo "⚠️ Current image ($task_image) doesn't match expected ($expected_image_tag)"
return 1
fi
}
# Initial check
if ! check_circuit_breaker_rollback; then
echo "❌ Deployment already failed - aborting cache invalidation"
exit 1
fi
# Wait for stability
start_time=$(date +%s)
while true; do
current_time=$(date +%s)
elapsed=$((current_time - start_time))
if [ $elapsed -ge $MAX_WAIT_SECONDS ]; then
echo "❌ Timeout reached (${MAX_WAIT_SECONDS}s)"
exit 1
fi
# Check for rollback
if ! check_circuit_breaker_rollback; then
echo "❌ Deployment was rolled back"
exit 1
fi
# Check if stable
if timeout 60 aws ecs wait services-stable \
--cluster "${CLUSTER_NAME}" \
--services "${SERVICE_NAME_FULL}" \
--region "${REGION}" 2>/dev/null; then
echo "✅ ECS service is now stable"
break
else
echo "⏳ Still waiting for stability... (${elapsed}s elapsed)"
sleep $POLL_INTERVAL
fi
done
# Final verification
RUNNING_COUNT=$(aws ecs describe-services \
--cluster "${CLUSTER_NAME}" \
--services "${SERVICE_NAME_FULL}" \
--region "${REGION}" \
--query 'services[0].runningCount' \
--output text)
DESIRED_COUNT=$(aws ecs describe-services \
--cluster "${CLUSTER_NAME}" \
--services "${SERVICE_NAME_FULL}" \
--region "${REGION}" \
--query 'services[0].desiredCount' \
--output text)
if [ "$RUNNING_COUNT" != "$DESIRED_COUNT" ]; then
echo "❌ Running count ($RUNNING_COUNT) doesn't match desired ($DESIRED_COUNT)"
exit 1
fi
if ! check_expected_image; then
echo "❌ Expected image is not deployed"
exit 1
fi
echo "✅ All checks passed - ready for CloudFront invalidation"
Why this is critical:
- Circuit breaker might rollback due to failed health checks
- If we invalidate CloudFront cache before rollback, users see errors
- This script verifies deployment succeeded before invalidating cache
CloudFront Invalidation Strategy
#!/bin/bash
set -e
output_file="root_urls.txt"
> "$output_file"
# CRITICAL: Never invalidate /_next/static/* (immutable assets)
# This prevents "chunk failed to load" errors
echo "/_next/data/*" >> "$output_file"
echo "/static/*" >> "$output_file"
# Extract URLs from sitemaps
for sm in public/sitemap*.xml; do
grep '<loc>' "$sm" | \
sed -n 's|.*<loc>https://example.com\(/[^/]*\).*|\1|p' | \
sort -u | \
while read -r line; do
if [ "$line" = "/" ]; then
echo "$line" >> "$output_file"
else
echo "$line" >> "$output_file"
echo "${line}/" >> "$output_file"
echo "${line}/*" >> "$output_file"
fi
done
done
echo "/sitemap.xml" >> "$output_file"
echo "/robots.txt" >> "$output_file"
echo "/404" >> "$output_file"
Why three variations (path, path/, path/*):
- CloudFront caches
/about,/about/, and/about/index.htmlseparately - Invalidating all three ensures consistent behavior
Batch invalidation with retry:
DISTRIBUTION_ID="YOUR_DISTRIBUTION_ID"
BATCH_SIZE=10
TOTAL_BATCHES=$((${#URIS[@]} / BATCH_SIZE + 1))
invalidate_batch() {
local batch=("$@")
for attempt in {1..5}; do
local wait_time=$((2 ** (attempt - 1) * 5)) # Exponential backoff
if output=$(aws cloudfront create-invalidation --distribution-id "$DISTRIBUTION_ID" --paths "${batch[@]}" 2>&1); then
echo "✅ Successfully invalidated batch on attempt $attempt"
return 0
else
echo "⚠️ Error on attempt $attempt. Waiting ${wait_time}s before retry"
sleep $wait_time
fi
done
echo "❌ Failed to invalidate batch after 5 attempts"
return 1
}
for ((i=0; i<${#URIS[@]}; i+=BATCH_SIZE)); do
batch=("${URIS[@]:i:BATCH_SIZE}")
invalidate_batch "${batch[@]}"
sleep 10 # Rate limiting between batches
done
Why small batches (10)?
- CloudFront has rate limits
- Small batches = more reliable
- Can retry individual batches without losing all progress
Cache Warmup
- step:
name: Warm Cache
atlassian-ip-ranges: true # Whitelist CI/CD IPs
script:
- cd cache-warmup
- python3 -m venv .venv
- .venv/bin/pip install -r requirements.txt
- .venv/bin/python cache-warmup.py \
--mode pages \
--site-url https://example.com \
--sitemap ../public/sitemap.xml \
--concurrent 50 \
--rps 100
What this does:
- Reads sitemap.xml
- Sends HEAD requests to all pages
- Configurable concurrency and rate limiting
- Warms CloudFront edge caches globally
Impact:
- First user request after deploy: Cache hit (warm)
- Without warmup: Cache miss (cold) = slower
Adjust concurrency/RPS based on your infrastructure capacity.
Part 6: Cost Optimization
Fargate Spot for Staging
capacity_provider_strategy {
base = var.environment == "production" ? 5 : 2
capacity_provider = var.environment == "production" ? "FARGATE" : "FARGATE_SPOT"
weight = 1
}
Fargate Spot savings:
- Regular Fargate: ~$0.04/vCPU/hour + ~$0.004/GB/hour (varies by region)
- Fargate Spot: ~70% cheaper
- For staging: Can reduce compute costs from ~$100/month to ~$30/month
When to use Spot:
- Staging, development environments
- Fault-tolerant workloads
- When you can handle occasional interruptions
When NOT to use Spot:
- Production (user-facing)
- When interruptions are unacceptable
CloudFront Price Class Optimization
price_class = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"
Price classes:
PriceClass_100: North America, Europe (~$0.085/GB)PriceClass_200: Above + Asia, Africa, Middle East (~$0.100/GB)PriceClass_All: All edge locations (~$0.120/GB)
For staging with PriceClass_100:
- Limited to 50 edge locations
- Saves ~30% on data transfer
- Perfectly fine for internal testing
S3 Lifecycle Policies
resource "aws_s3_bucket_lifecycle_configuration" "cloudfront_logs" {
bucket = aws_s3_bucket.cloudfront_logs.id
rule {
id = "delete-old-logs"
status = "Enabled"
expiration {
days = 15
}
}
}
Impact:
- CloudFront logs: ~10-50MB/day
- Without lifecycle: Grows indefinitely
- With 15-day retention: Cap at ~750MB
- Savings: Minimal cost, prevents bloat
Resource Cleanup in Pipeline
# Clean up S3 artifacts after Docker push
aws s3 rm --recursive s3://build-artifacts/${ENVIRONMENT}/${COMMIT_SHORT}-${BUILD_NUMBER}/
Impact:
- Build artifacts: 100-500MB per build
- Without cleanup: $0.023/GB/month
- 100 builds/month = $230/month in storage
- With cleanup: $0
Part 7: Performance Metrics
After implementing this architecture, here are real-world performance metrics:
CloudFront Cache Hit Ratio
- HTML pages: 95-98%
- Static assets: 99%+
- Images: 98%+
- Overall: 96%+
What this means:
- Only 4% of requests hit origin
- 96% served from edge in <50ms
Response Time Distribution
P50 (Median):
- Cache hit: 30-50ms
- Cache miss: 150-300ms
P95:
- Cache hit: 80-120ms
- Cache miss: 400-600ms
P99:
- Cache hit: 50-120ms
- Cache miss: 800-1200ms
Deployment Metrics
- Build time: 5-8 minutes
- Docker build: 2-3 minutes
- Terraform apply: 1-2 minutes
- ECS rollout: 3-5 minutes
- Cache invalidation: 2-3 minutes
- Total: 13-21 minutes
Zero-downtime deployments: 100% success rate with circuit breaker
Cost Breakdown (Example: 20M requests/month)
| Service | Estimated Cost/Month |
|---|---|
| ECS Fargate (varies by task size/count) | $150-250 |
| Application Load Balancer | $20-30 |
| NAT Gateway | $30-40 |
| CloudFront (varies by data transfer) | $150-200 |
| WAF | $50-70 |
| S3 (logs) | $5-10 |
| CloudWatch Logs | $10-20 |
| Total | $415-620/month |
Comparable Vercel cost: $2000-3500/month for similar traffic
Note: Actual costs vary based on task sizing, data transfer, and traffic patterns.
Part 8: Lessons Learned & Best Practices
What Worked Extremely Well
1. Standalone Mode
- Single biggest win: 800MB → 350MB images
- Faster deployments, lower storage costs
- Enable it immediately
2. Circuit Breaker
- Saved us from multiple bad deployments
- Auto-rollback prevented user impact
- Always enable with
rollback = true
3. Multiple CloudFront Cache Policies
- Different content needs different caching
- HTML, static assets, images, API all optimized separately
- 95%+ cache hit ratio
4. Never Invalidate /_next/static/*
- Content-addressed assets are immutable
- Old clients need old bundles
- Prevents "chunk failed to load" errors
5. ECS Stability Check Before Cache Invalidation
- Prevents invalidating cache for failed deployments
- Catches rollbacks before users see errors
What We'd Do Differently
1. Start with CloudWatch Container Insights Enabled
- We added this later
- Essential for debugging performance issues
- Enable from day one
2. Implement Server-Timing Headers Earlier
- Took months to add
- Would have helped diagnose issues faster
- 10% sampling is plenty
3. Use Terraform Modules from the Start
- Our main.tf is 1500+ lines
- Should have split into modules earlier
- Network, ECS, CloudFront, WAF modules
4. Add CloudFront Logging from Day One
- Added later for performance analysis
- Invaluable for debugging
- S3 costs are minimal
Common Pitfalls to Avoid
1. ALB Timeout Issues
- Symptom: Random 502 errors
- Cause: keepAliveTimeout < ALB idle timeout (backend closes connections before ALB)
- Solution: Set keepAliveTimeout HIGHER than ALB idle timeout (e.g., 35s for 30s ALB, or 65s for 60s ALB)
2. Invalidating /_next/static/*
- Symptom: "ChunkLoadError" for users
- Cause: Old clients request old bundles, but they're invalidated
- Solution: NEVER invalidate content-addressed assets
3. Not Checking Circuit Breaker Before Invalidation
- Symptom: Users see errors after "successful" deploy
- Cause: Deployment rolled back, but cache was invalidated
- Solution: Check ECS deployment status before invalidation
4. CloudFront Security Group Limits
- Symptom: "Security group rule limit exceeded"
- Cause: CloudFront has 60+ IP ranges, SG has 60-rule limit
- Solution: Split across multiple security groups
5. Not Setting Health Check Grace Period
- Symptom: Tasks marked unhealthy during startup
- Cause: Health checks start before app is ready
- Solution: Set grace period to 120s for Next.js
When to Use This Architecture
Use this when:
- You're serving 5M+ requests/month (cost savings justify complexity)
- You need full control over infrastructure
- You have compliance requirements (SOC2, HIPAA, etc.)
- You want to fine-tune every aspect of performance
- Your team has AWS/DevOps expertise
Stick with Vercel when:
- You're under 5M requests/month
- You want zero infrastructure management
- You need preview deployments for every PR
- Your team is small and focused on features
- You value convenience over cost optimization
Part 9: Security Best Practices
Environment-Specific WAF Rules
default_action {
dynamic "allow" {
for_each = var.environment == "production" ? [1] : []
content {}
}
dynamic "block" {
for_each = var.environment == "staging" ? [1] : []
content {}
}
}
Production: Open with managed rules Staging: Blocked by default, IP allowlist
Security Headers via CloudFront
security_headers_config {
content_type_options {
override = true
}
frame_options {
frame_option = "DENY"
override = true
}
referrer_policy {
referrer_policy = "strict-origin-when-cross-origin"
override = true
}
xss_protection {
mode_block = true
protection = true
override = true
}
strict_transport_security {
access_control_max_age_sec = 31536000
include_subdomains = true
preload = true
override = true
}
}
Impact: A+ rating on securityheaders.com
Secrets Management
Never put secrets in:
- Dockerfile
- Environment variables in task definition
- Terraform files
Use:
- AWS Secrets Manager for sensitive values
- IAM roles for AWS service access
- Environment variables in
.env.production(not committed to git)
Part 10: Monitoring & Alerts
Set up CloudWatch alarms for production. Example alarm for high CPU:
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
alarm_name = "${var.app_name}-ecs-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
threshold = 85
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ServiceName = aws_ecs_service.main.name
ClusterName = aws_ecs_cluster.main.name
}
}
Set up similar alarms for:
- ECS memory utilization (> 85%)
- ALB 5xx errors (> 10 per 5min)
- ALB response time (> 2s)
- CloudFront error rate (> 1%)
- CloudFront cache hit ratio (< 85%)
- ECS task count (< minimum)
SNS Topic for Alerts
resource "aws_sns_topic" "alerts" {
name = "${var.app_name}-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = "devops@example.com"
}
Impact: Alarms detect issues in 5-10 minutes. Add a CloudWatch Dashboard to visualize ECS CPU/memory, ALB response times, and CloudFront cache metrics.
Part 11: Troubleshooting Guide
Issue 1: ECS Tasks Fail to Start
Common causes:
- Docker image pull errors (check ECS service events for "CannotPullContainerError")
- Insufficient CPU/memory allocation
- Health checks failing too quickly
Fix:
- Verify image exists and task execution role has ECR permissions
- Increase
health_check_grace_period_seconds = 180 - Check NAT Gateway is working for private subnets
Issue 2: 502 Bad Gateway Errors
Most common cause: ALB timeout mismatch
Fix:
- Set
KEEP_ALIVE_TIMEOUT> ALBidle_timeout(e.g., 35s for 30s ALB, 65s for 60s ALB) - Set
headersTimeoutslightly higher thankeepAliveTimeout - Check Container Insights for memory issues (increase task memory if needed)
Issue 3: CloudFront Serving Stale Content
Common causes:
- NEVER invalidate
/_next/static/*(content-addressed assets change on each deploy) - Missing path variations in invalidation (include path, path/, path/*)
Fix:
- Only invalidate HTML pages and
/_next/data/* - Wait for invalidation to complete before testing
- Clear browser cache when debugging
Part 12: Health Check Implementation
Create a simple health check endpoint for ALB:
// pages/api/health.ts
export default function handler(req, res) {
res.status(200).json({
status: 'ok',
timestamp: new Date().toISOString()
});
}
Configure in Terraform:
health_check {
path = "/api/health"
interval = 15
timeout = 5
healthy_threshold = 3
unhealthy_threshold = 3
}
Keep it simple: Health check should respond in <100ms. For advanced monitoring (database checks, external APIs), create a separate /api/health/detailed endpoint.
Conclusion
Self-hosting Next.js on AWS is non-trivial, but with the right architecture, you can achieve:
- 99.99% uptime with multi-AZ deployment
- Sub-120ms P99 response times globally via CloudFront
- 60-70% cost savings vs Vercel at scale
- Zero-downtime deployments with circuit breaker protection
- Enterprise-grade security with WAF and managed rules
The key principles:
- Use Next.js standalone mode for minimal Docker images
- Aggressive CloudFront caching with smart invalidation
- Never invalidate content-addressed assets (
/_next/static/*) - Verify deployment success before cache invalidation
- Auto-scaling on multiple metrics (CPU, memory, requests)
- Circuit breaker for automatic rollback
- Origin Shield to reduce origin load
- Environment-specific optimizations (Spot for staging, different auto-scaling)
This architecture has served millions of requests daily for months with zero downtime and excellent performance. The upfront complexity pays dividends in control, performance, and cost savings.
Next Steps
If you're implementing this architecture:
- Start with Terraform modules for each component
- Implement the Docker optimization first (standalone mode)
- Set up CI/CD with stability checks
- Add CloudFront gradually (start with simple caching)
- Tune auto-scaling thresholds based on your traffic
- Monitor everything with CloudWatch and Server-Timing headers
- Iterate on cache policies based on real data
The code examples in this guide are production-tested and ready to adapt for your use case. Happy self-hosting!