Biz/PodcastItLater.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198

# PodcastItLater MVP Implementation Prompt

You are implementing a two-service MVP system called "PodcastItLater" that converts web articles to podcast episodes via email submission. This follows a monorepo namespace structure where all files live under `Biz/PodcastItLater/`.

## Code Organization & Structure
- **Primary files**:
  - `Biz/PodcastItLater/Web.py` - web service (ludic app, routes, webhook)
  - `Biz/PodcastItLater/Worker.py` - background processor
  - `Biz/PodcastItLater/Models.py` - database schema and data access
- **Keep code in as few files as possible following monorepo conventions**
- **Namespaces are always capitalized** (this is a Python project but follows the Haskell-style namespace hierarchy)

## Technical Requirements

### Core Libraries
```python
# Required dependencies
import ludic  # web framework (see provided docs)
import trafilatura  # content extraction
import openai  # tts api
import boto3  # s3 uploads
import feedgen  # rss generation
import sqlite3  # database
import pydub  # audio manipulation if needed
```

### Database Schema
```sql
-- Queue table for job processing
CREATE TABLE IF NOT EXISTS queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url TEXT,
    email TEXT,
    status TEXT DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    error_message TEXT
);

-- Episodes table for completed podcasts
CREATE TABLE IF NOT EXISTS episodes (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    content_length INTEGER,
    audio_url TEXT NOT NULL,
    duration INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

## Service 1: Web Frontend (`Biz/PodcastItLater/Web.py`)

### Responsibilities
- Serve ludic + htmx web interface
- Handle mailgun webhook for email submissions
- Provide manual article submission form
- Display processing queue status
- Serve RSS podcast feed
- Basic podcast player for testing

### Required Routes
```python
@app.route("/")
def index():
    # Simple form to submit article URL
    # Display recent episodes and queue status
    # Use htmx for dynamic updates

@app.route("/submit", methods=["POST"])
def submit_article():
    # Handle manual form submission
    # Insert into queue table
    # Return htmx response with status

@app.route("/webhook/mailgun", methods=["POST"])
def mailgun_webhook():
    # Parse email, extract URLs from body
    # Insert into queue table
    # Verify webhook signature for security

@app.route("/feed.xml")
def rss_feed():
    # Generate RSS from episodes table
    # Use feedgen library

@app.route("/status")
def queue_status():
    # HTMX endpoint for live queue updates
    # Return current queue + recent episodes
```

### RSS Feed Metadata (hardcoded)
```python
RSS_CONFIG = {
    "title": "Ben's Article Podcast",
    "description": "Web articles converted to audio",
    "author": "Ben Sima",
    "language": "en-US",
    "base_url": "https://your-domain.com"  # configure via env var
}
```

## Service 2: Background Worker (`Biz/PodcastItLater/Worker.py`)

### Responsibilities
- Poll queue table every 30 seconds
- Extract article content using trafilatura
- Convert text to speech via OpenAI TTS
- Upload audio files to S3-compatible storage
- Update episodes table with completed episodes
- Handle errors with retry logic (3 attempts max)

### Processing Pipeline
```python
def process_article(queue_item):
    """Complete article processing pipeline"""
    try:
        # 1. Extract content with trafilatura
        content = extract_article_content(queue_item.url)

        # 2. Generate audio with OpenAI TTS
        audio_file = text_to_speech(content)

        # 3. Upload to S3
        audio_url = upload_to_s3(audio_file)

        # 4. Create episode record
        create_episode(title, audio_url, duration)

        # 5. Mark queue item as complete
        mark_complete(queue_item.id)

    except Exception as e:
        handle_error(queue_item.id, str(e))
```

### Configuration via Environment Variables
```python
# Required environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
S3_ENDPOINT = os.getenv("S3_ENDPOINT")  # Digital Ocean Spaces
S3_BUCKET = os.getenv("S3_BUCKET")
S3_ACCESS_KEY = os.getenv("S3_ACCESS_KEY")
S3_SECRET_KEY = os.getenv("S3_SECRET_KEY")
MAILGUN_WEBHOOK_KEY = os.getenv("MAILGUN_WEBHOOK_KEY")
```

## Email Processing Logic
- Parse email body for first HTTP/HTTPS URL found
- If no URL found, treat entire email body as article content
- Store original email in queue record for debugging

## Error Handling Strategy
- Log all errors but continue processing
- Failed jobs marked with 'error' status and error message
- Retry logic: 3 attempts with exponential backoff
- Graceful degradation when external services fail

## Audio Configuration
- **Format**: MP3, 128kbps
- **TTS Voice**: OpenAI default voice (can add voice selection later)
- **File naming**: `episode_{timestamp}_{id}.mp3`

## HTMX Frontend Behavior
- Auto-refresh queue status every 30 seconds
- Form submission without page reload
- Simple progress indicators for processing jobs
- Basic audio player for testing episodes

## Testing Requirements

Create tests covering:
- Article content extraction accuracy
- TTS API integration (with mocking)
- S3 upload/download functionality
- RSS feed generation and XML validation
- Email webhook parsing and security
- Database operations and data integrity
- End-to-end submission workflow

## Success Criteria
The MVP should successfully:
1. Receive article submissions via email webhook
2. Extract clean article content
3. Convert text to high-quality audio
4. Store audio in S3-compatible storage
5. Generate valid RSS podcast feed
6. Provide basic web interface for monitoring
7. Handle errors gracefully without crashing

## Implementation Notes
- Start with Web.py service first, then Worker.py
- Use simple polling rather than complex job queues
- Focus on reliability over performance for MVP
- Keep total code under 300-400 lines
- Use reasonable defaults everywhere possible
- Prioritize working code over perfect code

Implement this as a robust, deployable MVP that can handle real-world article processing workloads while maintaining simplicity.