Docker Swarm Migration with Live VM Testing

The starting point was a simple problem with a complex solution: I was running GAIA with docker compose up and plaintext .env files on a physical Ubuntu machine. That’s fine for development but not for a production system handling real users. I wanted Swarm mode, proper secrets management, multi-branch staging environments where develop/feature-1/feature-2 could all be deployed simultaneously, GitHub Actions CI/CD, and zero-downtime deploys. Eight hours and 293 messages later, all of that was running on a real VM.

Session one: architecture

The agent’s first move was to nail down the topology before writing a single file. I had one physical machine and wanted to deploy multiple branches. The question was how to share infrastructure without having branches interfere with each other.

The design it landed on had two layers. A single “shared services” stack per VM ran MongoDB, PostgreSQL, Redis, Loki, Prometheus, and Grafana — these are expensive to run multiple times and don’t need isolation at the container level, just at the database level. Database isolation happened by naming convention: the develop branch used databases named gaia_develop_*, feature branches used gaia_feat1_*. Each branch then got its own “application stack” deploying only the backend, frontend, Nginx proxy, and bots — the things that actually change per-branch. Credentials went into Docker Swarm secrets instead of .env files, eliminating the entire class of “plaintext secret committed to git” incidents. GitHub Actions would build images on push, push them to GHCR, SSH into the VM, and run docker stack deploy with the branch-appropriate compose file. Traefik would route develop.gaia.app, feat1.gaia.app, etc. to their respective application stacks with automatic TLS.

Session two: real failures on a real VM

The second session was where the architecture met reality. The agent created an OrbStack VM with amd64 architecture — my development machine is Apple Silicon, and cross-architecture failures are exactly the kind of thing that only shows up when you actually deploy. It built all Docker images with --platform linux/amd64, transferred them to the VM, loaded them, initialized a Swarm cluster, created all secrets via docker secret create, and attempted the first full stack deployment.

Three things broke immediately.

LangGraph threw an ImportError: cannot import name 'ExecutionInfo' from 'langgraph.types'. The version of LangGraph installed on the amd64 VM was slightly behind what was pinned in my local requirements.txt — the ExecutionInfo type had moved to a different submodule in a minor version. The agent identified the exact version mismatch, pinned langgraph==0.2.28 specifically, and rebuilt the image.

Promtail went into a crash loop. The config had __path__: /var/lib/docker/containers/*/*.log which works in compose mode but not in Swarm mode — Swarm uses a different logging driver and Docker container logs are no longer written to that path. The agent switched Promtail to the Docker socket approach: mount /var/run/docker.sock and use the docker_sd_configs discovery method instead of static file paths.

Nginx couldn’t resolve upstream service names on first boot and was returning 502s for about fifteen seconds after deployment. Docker Swarm’s internal DNS takes a few seconds to register new service names after docker stack deploy completes. The fix was one line in the Nginx config: resolver 127.0.0.11 valid=5s with set $upstream http://gaia-backend:8000 as a variable rather than a static upstream block. Nginx re-resolves the variable at request time instead of caching the DNS failure at startup.

Zero-downtime

Once the stack was stable, I asked about zero-downtime deploys. The current workflow ran docker stack rm followed by docker stack deploy — full downtime. The agent designed a blue/green strategy: the running stack is gaia-blue, the new deploy targets gaia-green. The GitHub Actions workflow deploys green, runs a health check loop hitting http://gaia-green-backend:8000/health until it gets three consecutive 200s, then updates the Nginx upstream config to point to green and reloads Nginx with nginx -s reload (zero-downtime config reload). If the health check never passes within two minutes, the workflow auto-rollbacks by leaving blue running and tearing down green. During the brief switchover window — the few hundred milliseconds between Nginx reload and full routing — any in-flight requests get a 503 Maintenance with a Retry-After: 2 header so BetterStack monitoring wouldn’t generate false-alarm incidents.