Reference architecture. This repo is a portfolio piece.
terraform applyis wired but intentionally gated behind manualworkflow_dispatch— there is no live AWS account behind it. The CI pipeline (lint, validate, security scan) runs fully credential-less. To deploy it yourself, see ADR-005.
A production-grade AWS platform running a multi-tier e-commerce backend. Built to demonstrate the infrastructure patterns I reach for on day one: immutable infrastructure, GitOps delivery, defense-in-depth security, and operability as a first-class concern.
Simulated workload: 10,000 concurrent users, Black Friday traffic spikes (10×), sub-100ms API p99, 99.95% monthly uptime SLO.
graph TB
subgraph Internet
U[Users] --> CF[CloudFront]
CF --> ALB[Application Load Balancer]
end
subgraph VPC ["VPC (10.0.0.0/16) — 3 AZs"]
subgraph Public ["Public Subnets"]
ALB
NAT[NAT Gateways ×3]
end
subgraph Private ["Private Subnets — EKS Nodes"]
direction TB
EKS[EKS Cluster]
EKS --> API[Go API Pods]
EKS --> WEB[React Static Pods]
EKS --> PROM[Prometheus Stack]
end
subgraph Data ["Database Subnets — no route to internet"]
RDS[(Aurora PostgreSQL\nMulti-AZ)]
CACHE[(ElastiCache\nRedis cluster)]
end
end
subgraph AWS ["AWS Services"]
ECR[ECR — container images]
SM[Secrets Manager]
CW[CloudWatch Logs]
S3TF[S3 — Terraform state]
DDB[DynamoDB — state lock]
end
ALB --> API
ALB --> WEB
API --> RDS
API --> CACHE
API --> SM
EKS --> ECR
EKS --> CW
| Layer | Technology | Why |
|---|---|---|
| IaC | Terraform 1.9 | Mature ecosystem, provider breadth, team familiarity (ADR-002) |
| Compute | EKS 1.30 (managed node groups) | Kubernetes portability, better bin-packing than ECS (ADR-001) |
| Database | Aurora PostgreSQL 15 | Multi-AZ failover ~30s, storage auto-scaling, compatible with RDS Proxy |
| Networking | Custom VPC, 3-tier subnet design | Blast-radius isolation; database subnets have zero internet route |
| Ingress | AWS Load Balancer Controller + ALB | Native integration with WAF, ACM, and target group deregistration |
| Observability | kube-prometheus-stack | Prometheus + Grafana + Alertmanager for metrics and alerting; sub-second query latency vs. CloudWatch Metrics Insights. Application logs are stdout/stderr; centralized log aggregation is deferred (ADR-003) |
| CI/CD | GitHub Actions + OIDC | No static AWS keys anywhere; per-job short-lived credentials |
| Secret management | AWS Secrets Manager + IRSA | Pod-level IAM, no secret injection via env vars |
| Container registry | ECR with image scanning | Integrated vuln scanning; lifecycle policies keep costs down |
| CDN | CloudFront | Edge caching for static assets; WAF rules at the edge |
# Prerequisites: AWS CLI, Terraform 1.9+, kubectl, helm, go 1.22+, node 20+
make dev-up # Bootstrap dev environment end-to-end
make dev-down # Tear it all down
make plan ENV=staging # Preview staging changes
make apply ENV=prod # Apply to prod (requires manual approval in CI)- Validates AWS credentials and required tools
- Creates S3 backend bucket + DynamoDB lock table for dev
- Runs
terraform init && terraform apply -auto-approveinterraform/environments/dev - Builds and pushes Docker images to ECR
- Deploys kube-prometheus-stack via Helm
- Deploys the application via Helm
- Prints the ALB DNS name and Grafana URL
This models the infrastructure for an e-commerce backend serving ~10,000 concurrent users with these characteristics:
- Read-heavy: product catalog reads 95%, writes 5%
- Spiky: daily 8am–10pm traffic, flash sales at unpredictable times
- Latency-sensitive: checkout API must be < 200ms p99 or revenue drops
- Data-sensitive: PCI scope for payment metadata; GDPR for EU users
The Go API simulates catalog browsing, cart management, and order submission. Postgres handles orders and user accounts. Redis (not wired up in this scaffold) handles session state and catalog caching.
NAT Gateway per AZ vs. single NAT: Higher cost (~$100/mo extra) in exchange for no cross-AZ NAT traffic during an AZ failure. For a latency-sensitive checkout flow, cross-AZ hops during a partial outage are unacceptable.
Aurora over RDS single-instance: Aurora's shared storage architecture means failover is ~30s (vs. ~60–120s for standard Multi-AZ RDS). For checkout, that's the difference between a user retry and an abandoned cart.
kube-prometheus-stack over CloudWatch-only: CloudWatch Metrics Insights queries are slow (15s+) for incident response. Prometheus queries are sub-second. We alert entirely from Prometheus rather than logs — metric-based alerts are faster and lower-cardinality than log-based ones. Application logs are stdout/stderr (kubectl logs); a centralized log aggregator (Loki, CloudWatch Container Insights, or an OTLP collector) is the obvious next addition when log-based correlation becomes a frequent need. Full reasoning in ADR-003.
Spot instances for non-critical node groups: ~70% cost reduction for batch/observability workloads. API pods run exclusively on on-demand nodes with PodDisruptionBudgets. Spot interruptions are handled by the Node Termination Handler.
No service mesh (yet): Istio/Linkerd adds operational complexity that isn't justified until we need mutual TLS between every pod or weighted traffic splitting. We get mTLS at the ALB→pod boundary via ACM and plan to revisit at 50+ services.
RDS secret auto-rotation deferred: Auto-rotation requires a VPC-resident Lambda that can reach both the Aurora endpoint and the Secrets Manager API. Shipping a placeholder ZIP that doesn't exist on disk causes terraform apply to fail and misrepresents the security posture. Rotation is manual via the rotation runbook until the AWS SAR rotation template is wired in.
| Scenario | Detection | Automated Response | Manual Steps | RTO |
|---|---|---|---|---|
| EKS node failure | Node NotReady alert → PagerDuty | Cluster Autoscaler replaces node; pods reschedule | Verify replacement node healthy | < 5 min |
| AZ outage | CloudWatch AZ health + Prometheus node count | ALB health checks drain; EKS reschedules to remaining AZs | None for stateless; validate RDS failover | < 10 min |
| Aurora primary failure | RDS Events → SNS | Aurora auto-promotes replica; DNS CNAME flips | Update Secrets Manager endpoint if using static DNS | ~30s DB, < 5 min app |
| Ingress degraded (ALB 5xx spike) | Prometheus alert IngressHighErrorRate |
None — requires investigation | See runbook | 15 min |
| Bad deploy | Deployment rollout stalled alert | None automatic | kubectl rollout undo deployment/api or Helm rollback |
< 5 min |
| Secrets leak | GuardDuty + manual | None | Rotate in Secrets Manager; pods pick up on next restart via IRSA | < 30 min |
Runbooks: node failure · RDS failover · ingress degraded
The current design hits predictable ceilings. Here's the upgrade path, in order of ROI:
- Add Redis caching in front of Postgres (catalog reads drop from DB to cache; ~10× read throughput immediately)
- RDS Proxy (connection pooling; EKS pods × goroutines × 20 connections = DB exhaustion without it at scale)
- Read replicas for Aurora (route all GET queries to reader endpoint; writer handles only mutations)
- Horizontal pod autoscaling on API (CPU-based HPA already wired: min 2 / max 20 replicas, scales up at 70% average CPU utilization — adds 2 pods per 60s after a 30s stabilization window, removes 1 pod per 60s behind a 5-minute scale-down window to prevent flapping)
- CloudFront for API responses (cache product catalog at edge; TTL=60s is fine for catalog data)
- Multi-region active-passive (Route 53 health check failover to a second region; data replication via Aurora Global Database)
- Event-driven order processing (SQS → Lambda/EKS consumers; decouple checkout from fulfillment pipeline)
What doesn't scale: the current single Aurora writer for high-write workloads. At 1M users with significant write volume, we'd need to evaluate Vitess (MySQL sharding) or partition the data model (orders to Aurora, catalog to DynamoDB, sessions to ElastiCache).
.
├── .github/
│ └── workflows/ # CI/CD: terraform-plan, terraform-apply, app-build, tflint-and-checkov
├── app/
│ ├── api/ # Go HTTP API — health endpoint, structured logging, Dockerfile
│ ├── db/ # PostgreSQL migrations (Flyway-compatible naming)
│ └── web/ # React frontend — nginx-served static build, Dockerfile
├── diagrams/ # Mermaid source (.mmd) + rendered PNG architecture diagrams
├── docs/
│ ├── adr/ # Architecture Decision Records — ADR-001 through ADR-004
│ └── runbooks/ # Incident response playbooks (node, RDS failover, ingress, secret rotation)
├── kubernetes/
│ ├── app/ # Helm values for the application chart
│ └── monitoring/ # kube-prometheus-stack Helm values (Prometheus, Grafana, Alertmanager)
├── terraform/
│ ├── environments/ # Per-environment root modules: dev / staging / prod
│ └── modules/ # Reusable modules: vpc, eks, rds, monitoring
├── ARCHITECTURE.md # Narrative architecture deep-dive
├── CHANGELOG.md # Keep a Changelog format release history
├── CODEOWNERS # GitHub code-review ownership assignments
├── CONTRIBUTING.md # Branch strategy, commit conventions, local checks, PR checklist
├── LICENSE # MIT License
├── Makefile # Developer shortcuts: dev-up, dev-down, plan, apply, lint, checkov
└── SECURITY.md # Vulnerability disclosure policy and security design principles
- No static AWS credentials anywhere — GitHub Actions uses OIDC; pods use IRSA
- Secrets Manager for all secrets — no plaintext in Terraform state, env vars, or ConfigMaps
- Private subnets for all compute — EKS nodes and RDS have no public IPs
- Database subnets are air-gapped — no NAT, no internet route, security group allows only EKS node SG
- IMDSv2 enforced on all EC2 nodes — hop limit=1 blocks SSRF-based metadata theft
- tflint + Checkov run on every PR — policy violations block merge
- ECR image scanning on push — critical CVEs alert before deployment
Branch → PR → CI must pass → one approval → squash merge.
Terraform apply is manual — there is no real AWS account backing this portfolio repo, so terraform-apply.yml is triggered exclusively via workflow_dispatch (Actions → Terraform Apply → Run workflow). This keeps the lint/plan badges green without attempting OIDC auth against a non-existent account. When wiring this to a real account, set AWS_ACCOUNT_ID and AWS_ACCOUNT_PROD_ID as repository variables and the workflow is ready to go.
