summaryrefslogtreecommitdiff
path: root/Omni/Deploy/PLAN.md
blob: 1870ebd30ebf10b1ae90b7570036b3b475056c1d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
# Mini-PaaS Deployment System

## Overview

A pull-based deployment system that allows deploying Nix-built services without full NixOS rebuilds. Services are defined in a manifest, pulled from an S3 binary cache, and managed as systemd units with Caddy for reverse proxying.

## Problem Statement

Current deployment (`push.sh` + full NixOS rebuild) is slow and heavyweight:
- Every service change requires rebuilding the entire NixOS configuration
- Adding a new service requires modifying Biz.nix and doing a full rebuild
- Deploy time from "code ready" to "running in prod" is too long

## Goals

1. **Fast deploys**: Update a single service in <5 minutes without touching others
2. **Independent services**: Deploy services without NixOS rebuild
3. **Add services dynamically**: New services via manifest, no NixOS changes needed
4. **Maintain NixOS for base OS**: Keep NixOS for infra (Postgres, SSH, firewall)
5. **Clear scale-up path**: Single host now, easy migration to Nomad later

## Key Design Decisions

1. **Nix closures, not Docker**: Deploy Nix store paths directly, not containers. Simpler, no Docker daemon needed. Use systemd hardening for isolation.

2. **Pull-based, not push-based**: Target host polls S3 for manifest changes every 5 min. No SSH needed for deploys, just update manifest.

3. **Caddy, not nginx**: Caddy has admin API for dynamic route updates and automatic HTTPS. No config file regeneration needed.

4. **Separation of concerns**:
   - `bild`: Build tool, adds `--cache` flag to sign+push closures
   - `push.sh`: Deploy orchestrator, handles both NixOS and service deploys
   - `deployer`: Runs on target, polls manifest, manages services

5. **Out-of-band secrets**: Secrets stored in `/var/lib/biz-secrets/*.env`, manifest only references paths. No secrets in S3.

6. **Nix profiles for rollback**: Each service gets a Nix profile, enabling `nix-env --rollback`.

## Relevant Existing Files

- `Omni/Bild.hs` - Build tool, modify to add `--cache` flag
- `Omni/Bild.nix` - Nix build library, has `bild.run` for building packages
- `Omni/Ide/push.sh` - Current deploy script, enhance for service deploys
- `Biz.nix` - Current NixOS config for biz host
- `Biz/Packages.nix` - Builds all Biz packages
- `Biz/PodcastItLater/Web.nix` - Example NixOS service module (to be replaced)
- `Biz/PodcastItLater/Web.py` - Example Python service (deploy target)
- `Omni/Os/Base.nix` - Base NixOS config, add S3 substituter here

## Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              DEV MACHINE                                     │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │                           push.sh <target>                           │   │
│  │                                                                      │   │
│  │  if target.nix:     (NixOS deploy - existing behavior)              │   │
│  │    bild <target>                                                     │   │
│  │    nix copy --to ssh://host                                         │   │
│  │    ssh host switch-to-configuration                                 │   │
│  │                                                                      │   │
│  │  else:              (Service deploy - new behavior)                 │   │
│  │    bild <target> --cache  ──▶  sign + push closure to S3           │   │
│  │    update manifest.json in S3 with new storePath                    │   │
│  │    (deployer on target will pick up changes)                        │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Separation of concerns:                                                     │
│  - bild: Build + sign + push to S3 cache (--cache flag)                     │
│  - push.sh: Orchestrates deploy, updates manifest, handles both modes       │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                    DO SPACES (S3 BINARY CACHE) - PRIVATE                    │
│                                                                              │
│   /nar/*.nar.xz           ← Compressed Nix store paths                      │
│   /*.narinfo              ← Metadata + signatures                           │
│   /nix-cache-info         ← Cache metadata                                  │
│   /manifest.json          ← Current deployment state                        │
│   /manifests/             ← Historical manifests for rollback               │
│     manifest-<ts>.json                                                       │
│                                                                              │
│   Authentication: AWS credentials (Spaces access key)                       │
│   - Dev machine: write access for pushing                                   │
│   - Target host: read access for pulling                                    │
└─────────────────────────────────────────────────────────────────────────────┘
                                      │
                              poll every 5 min
                                      ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                              TARGET HOST (biz)                              │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                         biz-deployer                                  │  │
│  │  (Python systemd service, runs every 5 min via timer)                │  │
│  │                                                                       │  │
│  │  1. Fetch manifest.json from S3                                      │  │
│  │  2. Compare to local state                                           │  │
│  │  3. For changed services:                                            │  │
│  │     - nix copy --from s3://... <storePath>                          │  │
│  │     - Generate systemd unit file                                     │  │
│  │     - Create GC root                                                 │  │
│  │     - systemctl daemon-reload && restart                            │  │
│  │  4. Update Caddy routes via API                                      │  │
│  │  5. Save local state                                                 │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  Directories:                                                                │
│  - /var/lib/biz-deployer/services/*.service  (generated units)             │
│  - /var/lib/biz-deployer/state.json          (local state)                 │
│  - /var/lib/biz-secrets/*.env                (secret env files)            │
│  - /nix/var/nix/gcroots/biz/*                (GC roots)                    │
│                                                                              │
│  NixOS manages:                                                              │
│  - Base OS, SSH, firewall                                                   │
│  - Caddy with admin API enabled                                             │
│  - PostgreSQL, Redis (infra services)                                       │
│  - biz-deployer service itself                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Components

### 1. S3 Binary Cache (DO Spaces)

**Bucket**: `omni-nix-cache` (private)
**Region**: `nyc3` (or nearest)

**Credentials**:
- Dev machine: `~/.aws/credentials` with `[digitalocean]` profile
- Target host: `/root/.aws/credentials` with same profile

**Signing key**:
- Generate: `nix-store --generate-binary-cache-key omni-cache cache-priv-key.pem cache-pub-key.pem`
- Private key: `~/.config/nix/cache-priv-key.pem` (dev machine only)
- Public key: Added to target's `nix.settings.trusted-public-keys`

**S3 URL format**:
```
s3://omni-nix-cache?profile=digitalocean&scheme=https&endpoint=nyc3.digitaloceanspaces.com
```

### 2. Manifest Schema (v1)

```json
{
  "version": 1,
  "generation": "2025-01-15T12:34:56Z",
  "services": [
    {
      "name": "podcastitlater-web",
      "artifact": {
        "type": "nix-closure",
        "storePath": "/nix/store/abc123-podcastitlater-web-1.2.3"
      },
      "hosts": ["biz"],
      "exec": {
        "command": "podcastitlater-web",
        "user": "pil-web",
        "group": "pil"
      },
      "env": {
        "PORT": "8000",
        "AREA": "Live",
        "DATA_DIR": "/var/podcastitlater",
        "BASE_URL": "https://podcastitlater.com"
      },
      "envFile": "/var/lib/biz-secrets/podcastitlater-web.env",
      "http": {
        "domain": "podcastitlater.com",
        "path": "/",
        "internalPort": 8000
      },
      "systemd": {
        "after": ["network-online.target", "postgresql.service"],
        "requires": [],
        "restart": "on-failure",
        "restartSec": 5
      },
      "hardening": {
        "dynamicUser": false,
        "privateTmp": true,
        "protectSystem": "strict",
        "protectHome": true
      },
      "revision": "abc123def"
    }
  ]
}
```

### 3. Deployer Service (Omni/Deploy/Deployer.py)

Python service that:
- Polls manifest from S3
- Pulls Nix closures
- Generates systemd units
- Updates Caddy via API
- Manages GC roots
- Tracks local state

### 4. NixOS Module (Omni/Deploy/Deployer.nix)

Configures:
- biz-deployer systemd service + timer
- Caddy with admin API
- S3 substituter configuration
- Required directories and permissions

### 5. Bild Integration (Omni/Bild.hs)

New `--cache` flag that:
1. Builds the target
2. Signs the closure with cache key (using NIX_CACHE_KEY env var)
3. Pushes to S3 cache
4. Outputs the store path for push.sh to use

Does NOT update manifest - that's push.sh's responsibility.

### 6. Push.sh Enhancement (Omni/Ide/push.sh)

Detect deploy mode from target extension:
- `.nix` → NixOS deploy (existing behavior)
- `.py`, `.hs`, etc. → Service deploy (new behavior)

For service deploys:
1. Call `bild <target> --cache`
2. Capture store path from bild output
3. Fetch current manifest.json from S3
4. Archive current manifest to manifests/manifest-<timestamp>.json
5. Update manifest with new storePath for this service
6. Upload new manifest.json to S3
7. Deployer on target picks up change within 5 minutes

## Migration Path

### Phase 1: Infrastructure Setup
1. Create DO Spaces bucket
2. Generate signing keys
3. Configure S3 substituter on target
4. Deploy base deployer service (empty manifest)

### Phase 2: Migrate First Service
1. Choose non-critical service (e.g., podcastitlater-worker)
2. Add to manifest with different port
3. Verify via staging route
4. Flip Caddy to new service
5. Disable old NixOS-managed service

### Phase 3: Migrate Remaining Services
- Repeat Phase 2 for each service
- Order: worker → web → storybook

### Phase 4: Cleanup
- Remove service-specific NixOS modules
- Simplify Biz.nix to base OS only

## Rollback Strategy

1. Each deploy archives current manifest to `/manifests/manifest-<ts>.json`
2. Rollback = copy old manifest back to `manifest.json`
3. Deployer sees new generation, converges to old state
4. GC roots keep old closures alive (last 5 versions per service)

## Scale-up Path

| Stage | Hosts | Changes |
|-------|-------|---------|
| Current | 1 | Full architecture as described |
| 2-3 hosts | 2-3 | Add `hosts` filtering, each host runs deployer |
| 4+ hosts | 4+ | Consider Nomad with nix-nomad for job definitions |

## Security Considerations

- S3 bucket is private (authenticated reads/writes)
- Signing key never leaves dev machine
- Secrets stored out-of-band in `/var/lib/biz-secrets/`
- systemd hardening for service isolation
- Deployer validates manifest schema before applying

## File Locations

```
Omni/
  Deploy/
    PLAN.md              # This document
    Deployer.py          # Main deployer service
    Deployer.nix         # NixOS module
    Manifest.py          # Manifest schema/validation
    Systemd.py           # Unit file generation
    Caddy.py             # Caddy API integration
    S3.py                # S3 operations (for deployer)
  Bild.hs                # Add --cache flag for sign+push
  Ide/
    push.sh              # Enhanced: NixOS deploy OR service deploy + manifest update
```