summaryrefslogtreecommitdiff
path: root/Omni/Deploy/PLAN.md
diff options
context:
space:
mode:
authorBen Sima <ben@bensima.com>2025-12-16 08:06:09 -0500
committerBen Sima <ben@bensima.com>2025-12-16 08:06:09 -0500
commita7dcb30c7a465d9fce72b7fc3e605470b2b59814 (patch)
tree57a6436de34062773483dbd0cb745ac103c6bb48 /Omni/Deploy/PLAN.md
parent4caefe45756fdc21df990b8d6e826c40db1b9c78 (diff)
feat(deploy): Complete mini-PaaS deployment system (t-266)
- Add Omni/Deploy/ with Manifest, Deployer, Systemd, Caddy modules - Manifest CLI: show, update, add-service, list, rollback commands - Deployer: polls S3 manifest, pulls closures, manages systemd units - Caddy integration for dynamic reverse proxy routes - bild: auto-cache to S3, outputs STORE_PATH for push.sh - push.sh: supports both NixOS and service deploys - Biz.nix: simplified to base OS + deployer only - Services (podcastitlater-web/worker) now deployer-managed - Documentation: README.md with operations guide
Diffstat (limited to 'Omni/Deploy/PLAN.md')
-rw-r--r--Omni/Deploy/PLAN.md299
1 files changed, 299 insertions, 0 deletions
diff --git a/Omni/Deploy/PLAN.md b/Omni/Deploy/PLAN.md
new file mode 100644
index 0000000..1870ebd
--- /dev/null
+++ b/Omni/Deploy/PLAN.md
@@ -0,0 +1,299 @@
+# Mini-PaaS Deployment System
+
+## Overview
+
+A pull-based deployment system that allows deploying Nix-built services without full NixOS rebuilds. Services are defined in a manifest, pulled from an S3 binary cache, and managed as systemd units with Caddy for reverse proxying.
+
+## Problem Statement
+
+Current deployment (`push.sh` + full NixOS rebuild) is slow and heavyweight:
+- Every service change requires rebuilding the entire NixOS configuration
+- Adding a new service requires modifying Biz.nix and doing a full rebuild
+- Deploy time from "code ready" to "running in prod" is too long
+
+## Goals
+
+1. **Fast deploys**: Update a single service in <5 minutes without touching others
+2. **Independent services**: Deploy services without NixOS rebuild
+3. **Add services dynamically**: New services via manifest, no NixOS changes needed
+4. **Maintain NixOS for base OS**: Keep NixOS for infra (Postgres, SSH, firewall)
+5. **Clear scale-up path**: Single host now, easy migration to Nomad later
+
+## Key Design Decisions
+
+1. **Nix closures, not Docker**: Deploy Nix store paths directly, not containers. Simpler, no Docker daemon needed. Use systemd hardening for isolation.
+
+2. **Pull-based, not push-based**: Target host polls S3 for manifest changes every 5 min. No SSH needed for deploys, just update manifest.
+
+3. **Caddy, not nginx**: Caddy has admin API for dynamic route updates and automatic HTTPS. No config file regeneration needed.
+
+4. **Separation of concerns**:
+ - `bild`: Build tool, adds `--cache` flag to sign+push closures
+ - `push.sh`: Deploy orchestrator, handles both NixOS and service deploys
+ - `deployer`: Runs on target, polls manifest, manages services
+
+5. **Out-of-band secrets**: Secrets stored in `/var/lib/biz-secrets/*.env`, manifest only references paths. No secrets in S3.
+
+6. **Nix profiles for rollback**: Each service gets a Nix profile, enabling `nix-env --rollback`.
+
+## Relevant Existing Files
+
+- `Omni/Bild.hs` - Build tool, modify to add `--cache` flag
+- `Omni/Bild.nix` - Nix build library, has `bild.run` for building packages
+- `Omni/Ide/push.sh` - Current deploy script, enhance for service deploys
+- `Biz.nix` - Current NixOS config for biz host
+- `Biz/Packages.nix` - Builds all Biz packages
+- `Biz/PodcastItLater/Web.nix` - Example NixOS service module (to be replaced)
+- `Biz/PodcastItLater/Web.py` - Example Python service (deploy target)
+- `Omni/Os/Base.nix` - Base NixOS config, add S3 substituter here
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ DEV MACHINE │
+│ │
+│ ┌─────────────────────────────────────────────────────────────────────┐ │
+│ │ push.sh <target> │ │
+│ │ │ │
+│ │ if target.nix: (NixOS deploy - existing behavior) │ │
+│ │ bild <target> │ │
+│ │ nix copy --to ssh://host │ │
+│ │ ssh host switch-to-configuration │ │
+│ │ │ │
+│ │ else: (Service deploy - new behavior) │ │
+│ │ bild <target> --cache ──▶ sign + push closure to S3 │ │
+│ │ update manifest.json in S3 with new storePath │ │
+│ │ (deployer on target will pick up changes) │ │
+│ └─────────────────────────────────────────────────────────────────────┘ │
+│ │
+│ Separation of concerns: │
+│ - bild: Build + sign + push to S3 cache (--cache flag) │
+│ - push.sh: Orchestrates deploy, updates manifest, handles both modes │
+└─────────────────────────────────────────────────────────────────────────────┘
+ │
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ DO SPACES (S3 BINARY CACHE) - PRIVATE │
+│ │
+│ /nar/*.nar.xz ← Compressed Nix store paths │
+│ /*.narinfo ← Metadata + signatures │
+│ /nix-cache-info ← Cache metadata │
+│ /manifest.json ← Current deployment state │
+│ /manifests/ ← Historical manifests for rollback │
+│ manifest-<ts>.json │
+│ │
+│ Authentication: AWS credentials (Spaces access key) │
+│ - Dev machine: write access for pushing │
+│ - Target host: read access for pulling │
+└─────────────────────────────────────────────────────────────────────────────┘
+ │
+ poll every 5 min
+ ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│ TARGET HOST (biz) │
+│ │
+│ ┌──────────────────────────────────────────────────────────────────────┐ │
+│ │ biz-deployer │ │
+│ │ (Python systemd service, runs every 5 min via timer) │ │
+│ │ │ │
+│ │ 1. Fetch manifest.json from S3 │ │
+│ │ 2. Compare to local state │ │
+│ │ 3. For changed services: │ │
+│ │ - nix copy --from s3://... <storePath> │ │
+│ │ - Generate systemd unit file │ │
+│ │ - Create GC root │ │
+│ │ - systemctl daemon-reload && restart │ │
+│ │ 4. Update Caddy routes via API │ │
+│ │ 5. Save local state │ │
+│ └──────────────────────────────────────────────────────────────────────┘ │
+│ │
+│ Directories: │
+│ - /var/lib/biz-deployer/services/*.service (generated units) │
+│ - /var/lib/biz-deployer/state.json (local state) │
+│ - /var/lib/biz-secrets/*.env (secret env files) │
+│ - /nix/var/nix/gcroots/biz/* (GC roots) │
+│ │
+│ NixOS manages: │
+│ - Base OS, SSH, firewall │
+│ - Caddy with admin API enabled │
+│ - PostgreSQL, Redis (infra services) │
+│ - biz-deployer service itself │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Components
+
+### 1. S3 Binary Cache (DO Spaces)
+
+**Bucket**: `omni-nix-cache` (private)
+**Region**: `nyc3` (or nearest)
+
+**Credentials**:
+- Dev machine: `~/.aws/credentials` with `[digitalocean]` profile
+- Target host: `/root/.aws/credentials` with same profile
+
+**Signing key**:
+- Generate: `nix-store --generate-binary-cache-key omni-cache cache-priv-key.pem cache-pub-key.pem`
+- Private key: `~/.config/nix/cache-priv-key.pem` (dev machine only)
+- Public key: Added to target's `nix.settings.trusted-public-keys`
+
+**S3 URL format**:
+```
+s3://omni-nix-cache?profile=digitalocean&scheme=https&endpoint=nyc3.digitaloceanspaces.com
+```
+
+### 2. Manifest Schema (v1)
+
+```json
+{
+ "version": 1,
+ "generation": "2025-01-15T12:34:56Z",
+ "services": [
+ {
+ "name": "podcastitlater-web",
+ "artifact": {
+ "type": "nix-closure",
+ "storePath": "/nix/store/abc123-podcastitlater-web-1.2.3"
+ },
+ "hosts": ["biz"],
+ "exec": {
+ "command": "podcastitlater-web",
+ "user": "pil-web",
+ "group": "pil"
+ },
+ "env": {
+ "PORT": "8000",
+ "AREA": "Live",
+ "DATA_DIR": "/var/podcastitlater",
+ "BASE_URL": "https://podcastitlater.com"
+ },
+ "envFile": "/var/lib/biz-secrets/podcastitlater-web.env",
+ "http": {
+ "domain": "podcastitlater.com",
+ "path": "/",
+ "internalPort": 8000
+ },
+ "systemd": {
+ "after": ["network-online.target", "postgresql.service"],
+ "requires": [],
+ "restart": "on-failure",
+ "restartSec": 5
+ },
+ "hardening": {
+ "dynamicUser": false,
+ "privateTmp": true,
+ "protectSystem": "strict",
+ "protectHome": true
+ },
+ "revision": "abc123def"
+ }
+ ]
+}
+```
+
+### 3. Deployer Service (Omni/Deploy/Deployer.py)
+
+Python service that:
+- Polls manifest from S3
+- Pulls Nix closures
+- Generates systemd units
+- Updates Caddy via API
+- Manages GC roots
+- Tracks local state
+
+### 4. NixOS Module (Omni/Deploy/Deployer.nix)
+
+Configures:
+- biz-deployer systemd service + timer
+- Caddy with admin API
+- S3 substituter configuration
+- Required directories and permissions
+
+### 5. Bild Integration (Omni/Bild.hs)
+
+New `--cache` flag that:
+1. Builds the target
+2. Signs the closure with cache key (using NIX_CACHE_KEY env var)
+3. Pushes to S3 cache
+4. Outputs the store path for push.sh to use
+
+Does NOT update manifest - that's push.sh's responsibility.
+
+### 6. Push.sh Enhancement (Omni/Ide/push.sh)
+
+Detect deploy mode from target extension:
+- `.nix` → NixOS deploy (existing behavior)
+- `.py`, `.hs`, etc. → Service deploy (new behavior)
+
+For service deploys:
+1. Call `bild <target> --cache`
+2. Capture store path from bild output
+3. Fetch current manifest.json from S3
+4. Archive current manifest to manifests/manifest-<timestamp>.json
+5. Update manifest with new storePath for this service
+6. Upload new manifest.json to S3
+7. Deployer on target picks up change within 5 minutes
+
+## Migration Path
+
+### Phase 1: Infrastructure Setup
+1. Create DO Spaces bucket
+2. Generate signing keys
+3. Configure S3 substituter on target
+4. Deploy base deployer service (empty manifest)
+
+### Phase 2: Migrate First Service
+1. Choose non-critical service (e.g., podcastitlater-worker)
+2. Add to manifest with different port
+3. Verify via staging route
+4. Flip Caddy to new service
+5. Disable old NixOS-managed service
+
+### Phase 3: Migrate Remaining Services
+- Repeat Phase 2 for each service
+- Order: worker → web → storybook
+
+### Phase 4: Cleanup
+- Remove service-specific NixOS modules
+- Simplify Biz.nix to base OS only
+
+## Rollback Strategy
+
+1. Each deploy archives current manifest to `/manifests/manifest-<ts>.json`
+2. Rollback = copy old manifest back to `manifest.json`
+3. Deployer sees new generation, converges to old state
+4. GC roots keep old closures alive (last 5 versions per service)
+
+## Scale-up Path
+
+| Stage | Hosts | Changes |
+|-------|-------|---------|
+| Current | 1 | Full architecture as described |
+| 2-3 hosts | 2-3 | Add `hosts` filtering, each host runs deployer |
+| 4+ hosts | 4+ | Consider Nomad with nix-nomad for job definitions |
+
+## Security Considerations
+
+- S3 bucket is private (authenticated reads/writes)
+- Signing key never leaves dev machine
+- Secrets stored out-of-band in `/var/lib/biz-secrets/`
+- systemd hardening for service isolation
+- Deployer validates manifest schema before applying
+
+## File Locations
+
+```
+Omni/
+ Deploy/
+ PLAN.md # This document
+ Deployer.py # Main deployer service
+ Deployer.nix # NixOS module
+ Manifest.py # Manifest schema/validation
+ Systemd.py # Unit file generation
+ Caddy.py # Caddy API integration
+ S3.py # S3 operations (for deployer)
+ Bild.hs # Add --cache flag for sign+push
+ Ide/
+ push.sh # Enhanced: NixOS deploy OR service deploy + manifest update
+```