AI News Hub Logo

AI News Hub

SwiftDeploy: Building a Self-Writing Infrastructure Manager with Policy Enforcement — A Complete Technical Walkthrough

DEV Community
Abraham Acha

How I built a CLI tool that generates its own infrastructure configs, manages a full containerised stack, enforces deployment policies through OPA, exposes Prometheus metrics, and produces a live audit trail — all from a single YAML file. The Problem We're Solving Architecture Overview The Manifest The Python HTTP Service The Dockerfile The Templates The CLI The Two Deployment Modes Prometheus Metrics OPA Policy Engine Gated Lifecycle The Status Dashboard The Audit Trail The Debugging Sagas Full Deployment Walkthrough Key Lessons Learned Every time you spin up a new service in a real DevOps environment, you repeat the same manual work: Write an Nginx config Write a Docker Compose file Run Docker commands Check if things are healthy Hope nobody deploys when the disk is full Hope nobody promotes a canary that's throwing 60% errors SwiftDeploy solves all of this. One YAML manifest describes your entire deployment. A CLI tool generates every config file from it, manages the container lifecycle, enforces safety policies before allowing deployments, exposes real-time metrics, and produces a full audit trail. The golden rule: the manifest is the single source of truth. Everything else is generated. ┌─────────────────────────────────────────────────────────────────────────────┐ │ SwiftDeploy — Full System Architecture │ ├──────────────────┬──────────────────────────────────────┬───────────────────┤ │ ZONE 1 │ ZONE 2 │ ZONE 3 │ │ Operator │ Host Machine / Docker Engine │ Generated Files │ │ │ │ │ │ [Operator] │ ┌─── swiftdeploy-net (bridge) ───┐ │ nginx.conf │ │ │ │ │ │ │ docker-compose │ │ ▼ │ │ [nginx:8080]──────►[app:3000] │ │ history.jsonl │ │ manifest.yaml │ │ PUBLIC INTERNAL │ │ audit_report.md │ │ (source of │ │ │ │ │ │ │ │ truth) │ │ └──[logs vol]─────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ │ │ [opa:8181] │ │ │ │ swiftdeploy │ │ localhost only │ │ │ │ CLI │ │ NOT via nginx ✗ │ │ │ │ ├─ init │ └────────────────────────────────┘ │ │ │ ├─ validate │ │ │ │ ├─ deploy ──────┼──► pre-deploy: OPA infra check │ │ │ ├─ promote ─────┼──► pre-promote: OPA canary check │ │ │ ├─ status ──────┼──► scrapes /metrics every 5s ───────►│ history.jsonl │ │ ├─ audit ───────┼────────────────────────────────────►│ audit_report.md │ │ └─ teardown │ │ │ └──────────────────┴──────────────────────────────────────┴───────────────────┘ Stage 4A is the foundation. It answers one question: how do you build a tool that writes its own infrastructure configs? swiftdeploy/ ├── manifest.yaml ← the ONLY file you edit ├── swiftdeploy ← CLI executable ├── Dockerfile ← app image definition ├── app/ │ └── main.py ← Python HTTP service ├── templates/ │ ├── nginx.conf.tmpl ← nginx template │ └── docker-compose.yml.tmpl ← compose template ├── policies/ ← Stage 4B addition │ ├── infrastructure.rego │ ├── canary.rego │ └── data.json ├── nginx.conf ← generated (gitignored) └── docker-compose.yml ← generated (gitignored) manifest.yaml is the brain of the entire system. Every component reads from it directly or via generated files. services: image: swift-deploy-1-node:latest port: 3000 mode: stable # stable or canary version: "1.0.0" restart_policy: unless-stopped log_volume: swiftdeploy-logs nginx: image: nginx:latest port: 8080 proxy_timeout: 30 opa: image: openpolicyagent/opa:latest-static port: 8181 network: name: swiftdeploy-net driver_type: bridge contact: "[email protected]" Every field propagates through the system. Change nginx.proxy_timeout here and it updates in nginx.conf on the next init. Change services.mode here and the entire deployment mode switches on the next promote. The app is a from-scratch HTTP server using only Python's stdlib — no Flask, no FastAPI. Three endpoints in Stage 4A, four in Stage 4B. MODE = os.environ.get("MODE", "stable") APP_VERSION = os.environ.get("APP_VERSION", "1.0.0") APP_PORT = int(os.environ.get("APP_PORT", "3000")) START_TIME = time.time() Configuration comes entirely from environment variables injected by Docker Compose at runtime. START_TIME is captured at module load — this is how /healthz calculates uptime without a database. chaos_lock = threading.Lock() chaos_state = {"mode": None, "duration": None, "rate": None} def get_chaos(): with chaos_lock: return dict(chaos_state) # returns a copy — callers can't mutate internal state def set_chaos(state): with chaos_lock: chaos_state.update(state) The Lock prevents race conditions when multiple requests read and write chaos state simultaneously. dict(chaos_state) returns a copy so the caller never holds a reference to the mutable internal dict. GET / — welcome self.send_json(200, { "message": "Welcome to SwiftDeploy API", "mode": MODE, "version": APP_VERSION, "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()), }) GET /healthz — liveness check uptime = round(time.time() - START_TIME, 2) self.send_json(200, { "status": "ok", "mode": MODE, "version": APP_VERSION, "uptime_seconds": uptime, }) The /healthz endpoint does three jobs: proves the server is alive (Docker healthcheck), reports current mode (so promote can confirm the switch happened), and reports uptime (useful for debugging restart loops). POST /chaos — chaos injection (canary only) if MODE != "canary": self.send_json(403, {"error": "chaos endpoint only available in canary mode"}) return length = int(self.headers.get("Content-Length", 0)) data = json.loads(self.rfile.read(length)) mode = data.get("mode") if mode == "slow": set_chaos({"mode": "slow", "duration": data.get("duration", 2), "rate": None}) elif mode == "error": set_chaos({"mode": "error", "duration": None, "rate": data.get("rate", 0.5)}) elif mode == "recover": set_chaos({"mode": None, "duration": None, "rate": None}) Reading Content-Length before rfile.read() is mandatory HTTP protocol — otherwise read() blocks forever waiting for data that never arrives. The chaos modes: slow — injects time.sleep(N) before responding, simulating a slow upstream error — uses random.random() {target_mode}", content, count=1) # 2. Regenerate docker-compose.yml with new MODE env var # 3. Restart ONLY the app container — nginx stays up run(compose_cmd("up -d --no-deps app")) # 4. Confirm mode via /healthz --no-deps is the key — it tells Compose to restart only the app service without touching nginx. Zero proxy downtime during the switch. Stable mode — normal production behaviour. Clean responses. No special headers. /chaos returns 403. Canary mode — test mode before full rollout: if MODE == "canary": self.send_header("X-Mode", "canary") # on EVERY response Canary mode: Adds X-Mode: canary to every response — callers can identify which mode they're hitting Unlocks /chaos — lets you simulate slow responses, random errors, then recover You promote with ./swiftdeploy promote canary, stress test, then ./swiftdeploy promote stable to roll back Stage 4A built the engine. Stage 4B adds observability and policy enforcement. The stack now has eyes (metrics), a brain (OPA policies), and memory (audit trail). The app gains a /metrics endpoint exposing five metric types in Prometheus text format. # Counter: {(method, path, status_code): count} request_counts = {} # Histogram state per path HISTOGRAM_BUCKETS = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] request_durations = {} def record_request(method, path, status_code, duration_seconds): with metrics_lock: key = (method, path, str(status_code)) request_counts[key] = request_counts.get(key, 0) + 1 if path not in request_durations: request_durations[path] = { "buckets": {str(le): 0 for le in HISTOGRAM_BUCKETS}, "sum": 0.0, "count": 0, } hist = request_durations[path] hist["sum"] += duration_seconds hist["count"] += 1 for le in HISTOGRAM_BUCKETS: if duration_seconds = data.thresholds.min_disk_free_gb } cpu_ok if { input.cpu_load_1m = data.thresholds.min_mem_free_percent } reasons contains msg if { not disk_ok msg := sprintf( "disk_free_gb is %.1f, minimum required is %.1f", [input.disk_free_gb, data.thresholds.min_disk_free_gb] ) } # decision is what the CLI reads — never a bare boolean decision := { "allow": allow, "reasons": reasons, "domain": "infrastructure", "checked": { "disk_free_gb": input.disk_free_gb, "cpu_load_1m": input.cpu_load_1m, "mem_free_percent": input.mem_free_percent, }, } Why import future.keywords? The openpolicyagent/opa:latest-static image uses Rego v1 which requires explicit if and contains keywords. Without these imports, OPA crashes on startup. This was discovered the hard way during testing. package swiftdeploy.canary import future.keywords.if import future.keywords.contains default allow := false allow if { error_rate_ok latency_ok } error_rate_ok if { input.error_rate_percent = p99_threshold: return round(le * 1000, 2) # seconds → milliseconds return 10000.0 P99 means: the smallest histogram bucket where 99% of requests have completed. If 99 out of 100 requests finished within 250ms, P99 = 250ms. def cmd_status(): while True: raw = scrape_metrics(nginx_port) metrics = parse_prometheus(raw) error_rate = calculate_error_rate(metrics) p99_ms = calculate_p99_latency_ms(metrics) # Query OPA for live compliance infra_dec, _ = query_opa(manifest, "swiftdeploy.infrastructure", get_host_stats()) canary_dec, _ = query_opa(manifest, "swiftdeploy.canary", {"error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "window_seconds": 30}) os.system("clear") # ... render dashboard ... append_history({ "event": "status_scrape", "error_rate_percent": error_rate, "p99_latency_ms": p99_ms, "mode": mode_str, "chaos": chaos_str, "policy_infra_pass": infra_dec.get("allow") if infra_dec else None, "policy_canary_pass": canary_dec.get("allow") if canary_dec else None, }) time.sleep(5) What the dashboard looks like with chaos active: SwiftDeploy Status Dashboard 2026-05-05T21:00:37Z ──────────────────────────────────────────────── ── Throughput ────────────────────────────────── req/s : 2.4 error rate : 56.45% ← red P99 latency : 5.0ms ── App State ─────────────────────────────────── mode : canary chaos : error ← red uptime : 316s ── Policy Compliance ─────────────────────────── ✔ infrastructure PASS ✘ canary FAIL ! error_rate is 56.45%, maximum allowed is 1.00% Refreshing every 5s — Ctrl+C to exit This is exactly what real SRE dashboards do — they show you the current state AND whether it violates policy in real time. Every event appends a JSON line to history.jsonl: {"timestamp":"2026-05-05T20:34:51Z","event":"deploy","mode":"stable"} {"timestamp":"2026-05-05T20:55:22Z","event":"promote","target_mode":"canary"} {"timestamp":"2026-05-05T20:55:23Z","event":"status_scrape","error_rate_percent":62.5,"chaos":"error","policy_canary_pass":false} {"timestamp":"2026-05-05T21:01:01Z","event":"promote","target_mode":"stable"} swiftdeploy audit reads this file and generates audit_report.md: ## Mode Changes | Timestamp | From | To | |-----------|------|----| | 2026-05-05T20:34:51Z | unknown | stable | | 2026-05-05T20:55:22Z | stable | canary | | 2026-05-05T21:01:01Z | canary | stable | ## Policy Violations | Timestamp | Infrastructure | Canary | Error Rate | P99 | |-----------|---------------|--------|------------|-----| | 2026-05-05T20:55:23Z | ✔ PASS | ✘ FAIL | 62.5% | 5.0ms | | 2026-05-05T20:55:28Z | ✔ PASS | ✘ FAIL | 63.6% | 5.0ms | This report renders perfectly as GitHub Flavored Markdown — every table, every checkmark, every timestamp. No real DevOps project ships without war stories. Here are the ones that taught the most. The app container kept showing unhealthy despite the server running fine. The debugging sequence: Failure 1: ${APP_PORT} doesn't expand in Dockerfile HEALTHCHECK CMD. Env vars evaluate at build time, not runtime. Fixed by hardcoding 3000. Failure 2: localhost doesn't resolve inside Alpine's healthcheck context on WSL2 + Docker Desktop. Fixed by using 127.0.0.1. Failure 3: wget with 127.0.0.1 still failed. Confirmed the server WAS listening: docker exec swiftdeploy-app ss -tlnp # tcp LISTEN 0.0.0.0:3000 docker exec swiftdeploy-app python -c "import urllib.request; print(urllib.request.urlopen('http://127.0.0.1:3000/healthz').read())" # b'{"status": "ok", ...}' ← works via exec, not via healthcheck This is a known WSL2 + Docker Desktop network namespace issue. Fixed by using Python's urllib instead of wget. Failure 4: Docker cache was serving the old image despite the Dockerfile fix. Fixed with --no-cache. Failure 5: The docker-compose.yml template had its own healthcheck block overriding the Dockerfile. Docker Compose healthcheck always wins. Fixed the template too. Failure 6: The healthcheck YAML block had 3 spaces indent instead of 4. A single space difference caused a YAML parse error. Fixed by carefully rewriting the block. The openpolicyagent/opa:latest-static image enforces strict Rego v1 syntax. Our policies used the older syntax: # OLD — crashes on latest OPA allow { disk_ok } reasons[msg] { not disk_ok msg := "..." } # NEW — Rego v1 required syntax allow if { disk_ok } reasons contains msg if { not disk_ok msg := "..." } Without import future.keywords.if and import future.keywords.contains at the top of each file, OPA refuses to start. The project lived at /mnt/c/Users/RAZER BLADE/Desktop/HNG/hng-swiftdeploy. The space in RAZER BLADE caused docker run -v {path}:... to split the path at the space, making Docker interpret the second half as an image name. docker: invalid reference format: repository name (Desktop/HNG/hng-swiftdeploy/nginx.conf) must be lowercase Fixed by quoting all paths containing the project directory and using subprocess.run with a list instead of shell=True to avoid shell word-splitting entirely. # 1. Build the image docker build -t swift-deploy-1-node:latest . # 2. Validate pre-flight checks ./swiftdeploy validate # 3. Deploy (OPA policy check runs first) ./swiftdeploy deploy # 4. Verify metrics curl http://localhost:8080/metrics # 5. Verify OPA isolation curl http://127.0.0.1:8181/health # works — internal curl http://localhost:8080/v1/data # 404 — nginx blocks it # 6. Launch status dashboard ./swiftdeploy status # 7. Promote to canary (OPA canary policy check runs first) ./swiftdeploy promote canary # 8. Inject chaos curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "error", "rate": 0.5}' # 9. Watch status dashboard catch it — canary policy FAIL visible in real time # 10. Recover curl -X POST http://localhost:8080/chaos \ -H "Content-Type: application/json" \ -d '{"mode": "recover"}' # 11. Promote back to stable ./swiftdeploy promote stable # 12. Generate audit report ./swiftdeploy audit cat audit_report.md # 13. Teardown ./swiftdeploy teardown --clean 1. Docker Compose healthcheck overrides Dockerfile HEALTHCHECK. Always check both places when healthchecks misbehave. Compose wins every time. 2. WSL2 has a different network namespace for healthchecks than for docker exec. If something works via exec but not via healthcheck, it's almost certainly a tool or network namespace issue. Python's stdlib is more portable than wget in this environment. 3. OPA Rego v1 requires explicit keywords. latest-static means the latest OPA — which enforces Rego v1 syntax. Always import future.keywords.if and import future.keywords.contains. 4. expose vs ports is a security boundary, not documentation. expose = container-to-container only. ports = host-facing. Binding OPA to 127.0.0.1 enforces isolation at the network level. 5. The CLI should never make policy decisions. Every time you add an if/else for a deployment condition in the CLI, you're doing OPA's job badly. Push all allow/deny logic into Rego. The CLI's job is to collect data and surface decisions. 6. P99 latency is more useful than average. An average latency of 10ms can hide the fact that 1 in 100 requests takes 5 seconds. P99 exposes that tail. Always instrument histograms, not just averages. 7. Declarative infrastructure pays off immediately. The grader deletes generated files and re-runs init. Because the manifest is always there and regeneration is instantaneous, this is a non-issue. Manual configs would have been a problem. 8. An audit trail is not optional. history.jsonl made it trivial to answer "when did chaos start?", "which policy was failing?", "how long was the canary running before we promoted?" These questions matter in production incidents. SwiftDeploy started as a task requirement and became a complete mental model for how modern deployment tooling works. Every major concept is here: Declarative infrastructure — describe what you want, generate everything else Immutable configs — generated files are outputs, never inputs Policy as code — OPA enforces safety standards that can't be bypassed Observability — Prometheus metrics feed the dashboard and the policy engine Audit trail — every event recorded, every violation surfaced The combination of Stage 4A and 4B forms a complete deployment lifecycle: generate → validate → deploy (gated) → promote (gated) → observe → audit → tear down. The full source code is available at: https://github.com/AirFluke/hng-swiftdeploy Tags: #devops #docker #nginx #python #opa #prometheus #infrastructure #hng