diff options
Diffstat (limited to 'Omni/Deploy/PLAN.md')
| -rw-r--r-- | Omni/Deploy/PLAN.md | 299 |
1 files changed, 299 insertions, 0 deletions
diff --git a/Omni/Deploy/PLAN.md b/Omni/Deploy/PLAN.md new file mode 100644 index 0000000..1870ebd --- /dev/null +++ b/Omni/Deploy/PLAN.md @@ -0,0 +1,299 @@ +# Mini-PaaS Deployment System + +## Overview + +A pull-based deployment system that allows deploying Nix-built services without full NixOS rebuilds. Services are defined in a manifest, pulled from an S3 binary cache, and managed as systemd units with Caddy for reverse proxying. + +## Problem Statement + +Current deployment (`push.sh` + full NixOS rebuild) is slow and heavyweight: +- Every service change requires rebuilding the entire NixOS configuration +- Adding a new service requires modifying Biz.nix and doing a full rebuild +- Deploy time from "code ready" to "running in prod" is too long + +## Goals + +1. **Fast deploys**: Update a single service in <5 minutes without touching others +2. **Independent services**: Deploy services without NixOS rebuild +3. **Add services dynamically**: New services via manifest, no NixOS changes needed +4. **Maintain NixOS for base OS**: Keep NixOS for infra (Postgres, SSH, firewall) +5. **Clear scale-up path**: Single host now, easy migration to Nomad later + +## Key Design Decisions + +1. **Nix closures, not Docker**: Deploy Nix store paths directly, not containers. Simpler, no Docker daemon needed. Use systemd hardening for isolation. + +2. **Pull-based, not push-based**: Target host polls S3 for manifest changes every 5 min. No SSH needed for deploys, just update manifest. + +3. **Caddy, not nginx**: Caddy has admin API for dynamic route updates and automatic HTTPS. No config file regeneration needed. + +4. **Separation of concerns**: + - `bild`: Build tool, adds `--cache` flag to sign+push closures + - `push.sh`: Deploy orchestrator, handles both NixOS and service deploys + - `deployer`: Runs on target, polls manifest, manages services + +5. **Out-of-band secrets**: Secrets stored in `/var/lib/biz-secrets/*.env`, manifest only references paths. No secrets in S3. + +6. **Nix profiles for rollback**: Each service gets a Nix profile, enabling `nix-env --rollback`. + +## Relevant Existing Files + +- `Omni/Bild.hs` - Build tool, modify to add `--cache` flag +- `Omni/Bild.nix` - Nix build library, has `bild.run` for building packages +- `Omni/Ide/push.sh` - Current deploy script, enhance for service deploys +- `Biz.nix` - Current NixOS config for biz host +- `Biz/Packages.nix` - Builds all Biz packages +- `Biz/PodcastItLater/Web.nix` - Example NixOS service module (to be replaced) +- `Biz/PodcastItLater/Web.py` - Example Python service (deploy target) +- `Omni/Os/Base.nix` - Base NixOS config, add S3 substituter here + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ DEV MACHINE │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ push.sh <target> │ │ +│ │ │ │ +│ │ if target.nix: (NixOS deploy - existing behavior) │ │ +│ │ bild <target> │ │ +│ │ nix copy --to ssh://host │ │ +│ │ ssh host switch-to-configuration │ │ +│ │ │ │ +│ │ else: (Service deploy - new behavior) │ │ +│ │ bild <target> --cache ──▶ sign + push closure to S3 │ │ +│ │ update manifest.json in S3 with new storePath │ │ +│ │ (deployer on target will pick up changes) │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ Separation of concerns: │ +│ - bild: Build + sign + push to S3 cache (--cache flag) │ +│ - push.sh: Orchestrates deploy, updates manifest, handles both modes │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ DO SPACES (S3 BINARY CACHE) - PRIVATE │ +│ │ +│ /nar/*.nar.xz ← Compressed Nix store paths │ +│ /*.narinfo ← Metadata + signatures │ +│ /nix-cache-info ← Cache metadata │ +│ /manifest.json ← Current deployment state │ +│ /manifests/ ← Historical manifests for rollback │ +│ manifest-<ts>.json │ +│ │ +│ Authentication: AWS credentials (Spaces access key) │ +│ - Dev machine: write access for pushing │ +│ - Target host: read access for pulling │ +└─────────────────────────────────────────────────────────────────────────────┘ + │ + poll every 5 min + ▼ +┌─────────────────────────────────────────────────────────────────────────────┐ +│ TARGET HOST (biz) │ +│ │ +│ ┌──────────────────────────────────────────────────────────────────────┐ │ +│ │ biz-deployer │ │ +│ │ (Python systemd service, runs every 5 min via timer) │ │ +│ │ │ │ +│ │ 1. Fetch manifest.json from S3 │ │ +│ │ 2. Compare to local state │ │ +│ │ 3. For changed services: │ │ +│ │ - nix copy --from s3://... <storePath> │ │ +│ │ - Generate systemd unit file │ │ +│ │ - Create GC root │ │ +│ │ - systemctl daemon-reload && restart │ │ +│ │ 4. Update Caddy routes via API │ │ +│ │ 5. Save local state │ │ +│ └──────────────────────────────────────────────────────────────────────┘ │ +│ │ +│ Directories: │ +│ - /var/lib/biz-deployer/services/*.service (generated units) │ +│ - /var/lib/biz-deployer/state.json (local state) │ +│ - /var/lib/biz-secrets/*.env (secret env files) │ +│ - /nix/var/nix/gcroots/biz/* (GC roots) │ +│ │ +│ NixOS manages: │ +│ - Base OS, SSH, firewall │ +│ - Caddy with admin API enabled │ +│ - PostgreSQL, Redis (infra services) │ +│ - biz-deployer service itself │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +## Components + +### 1. S3 Binary Cache (DO Spaces) + +**Bucket**: `omni-nix-cache` (private) +**Region**: `nyc3` (or nearest) + +**Credentials**: +- Dev machine: `~/.aws/credentials` with `[digitalocean]` profile +- Target host: `/root/.aws/credentials` with same profile + +**Signing key**: +- Generate: `nix-store --generate-binary-cache-key omni-cache cache-priv-key.pem cache-pub-key.pem` +- Private key: `~/.config/nix/cache-priv-key.pem` (dev machine only) +- Public key: Added to target's `nix.settings.trusted-public-keys` + +**S3 URL format**: +``` +s3://omni-nix-cache?profile=digitalocean&scheme=https&endpoint=nyc3.digitaloceanspaces.com +``` + +### 2. Manifest Schema (v1) + +```json +{ + "version": 1, + "generation": "2025-01-15T12:34:56Z", + "services": [ + { + "name": "podcastitlater-web", + "artifact": { + "type": "nix-closure", + "storePath": "/nix/store/abc123-podcastitlater-web-1.2.3" + }, + "hosts": ["biz"], + "exec": { + "command": "podcastitlater-web", + "user": "pil-web", + "group": "pil" + }, + "env": { + "PORT": "8000", + "AREA": "Live", + "DATA_DIR": "/var/podcastitlater", + "BASE_URL": "https://podcastitlater.com" + }, + "envFile": "/var/lib/biz-secrets/podcastitlater-web.env", + "http": { + "domain": "podcastitlater.com", + "path": "/", + "internalPort": 8000 + }, + "systemd": { + "after": ["network-online.target", "postgresql.service"], + "requires": [], + "restart": "on-failure", + "restartSec": 5 + }, + "hardening": { + "dynamicUser": false, + "privateTmp": true, + "protectSystem": "strict", + "protectHome": true + }, + "revision": "abc123def" + } + ] +} +``` + +### 3. Deployer Service (Omni/Deploy/Deployer.py) + +Python service that: +- Polls manifest from S3 +- Pulls Nix closures +- Generates systemd units +- Updates Caddy via API +- Manages GC roots +- Tracks local state + +### 4. NixOS Module (Omni/Deploy/Deployer.nix) + +Configures: +- biz-deployer systemd service + timer +- Caddy with admin API +- S3 substituter configuration +- Required directories and permissions + +### 5. Bild Integration (Omni/Bild.hs) + +New `--cache` flag that: +1. Builds the target +2. Signs the closure with cache key (using NIX_CACHE_KEY env var) +3. Pushes to S3 cache +4. Outputs the store path for push.sh to use + +Does NOT update manifest - that's push.sh's responsibility. + +### 6. Push.sh Enhancement (Omni/Ide/push.sh) + +Detect deploy mode from target extension: +- `.nix` → NixOS deploy (existing behavior) +- `.py`, `.hs`, etc. → Service deploy (new behavior) + +For service deploys: +1. Call `bild <target> --cache` +2. Capture store path from bild output +3. Fetch current manifest.json from S3 +4. Archive current manifest to manifests/manifest-<timestamp>.json +5. Update manifest with new storePath for this service +6. Upload new manifest.json to S3 +7. Deployer on target picks up change within 5 minutes + +## Migration Path + +### Phase 1: Infrastructure Setup +1. Create DO Spaces bucket +2. Generate signing keys +3. Configure S3 substituter on target +4. Deploy base deployer service (empty manifest) + +### Phase 2: Migrate First Service +1. Choose non-critical service (e.g., podcastitlater-worker) +2. Add to manifest with different port +3. Verify via staging route +4. Flip Caddy to new service +5. Disable old NixOS-managed service + +### Phase 3: Migrate Remaining Services +- Repeat Phase 2 for each service +- Order: worker → web → storybook + +### Phase 4: Cleanup +- Remove service-specific NixOS modules +- Simplify Biz.nix to base OS only + +## Rollback Strategy + +1. Each deploy archives current manifest to `/manifests/manifest-<ts>.json` +2. Rollback = copy old manifest back to `manifest.json` +3. Deployer sees new generation, converges to old state +4. GC roots keep old closures alive (last 5 versions per service) + +## Scale-up Path + +| Stage | Hosts | Changes | +|-------|-------|---------| +| Current | 1 | Full architecture as described | +| 2-3 hosts | 2-3 | Add `hosts` filtering, each host runs deployer | +| 4+ hosts | 4+ | Consider Nomad with nix-nomad for job definitions | + +## Security Considerations + +- S3 bucket is private (authenticated reads/writes) +- Signing key never leaves dev machine +- Secrets stored out-of-band in `/var/lib/biz-secrets/` +- systemd hardening for service isolation +- Deployer validates manifest schema before applying + +## File Locations + +``` +Omni/ + Deploy/ + PLAN.md # This document + Deployer.py # Main deployer service + Deployer.nix # NixOS module + Manifest.py # Manifest schema/validation + Systemd.py # Unit file generation + Caddy.py # Caddy API integration + S3.py # S3 operations (for deployer) + Bild.hs # Add --cache flag for sign+push + Ide/ + push.sh # Enhanced: NixOS deploy OR service deploy + manifest update +``` |
