summaryrefslogtreecommitdiff
path: root/Biz/PodcastItLater.md
blob: c3d1708e955bb63bd1d01a0179f2c01b160bff02 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# PodcastItLater

A service that converts web articles to podcast episodes via email submission or web interface. Users can submit articles and receive them as audio episodes in their personal podcast feed.

## Current Implementation Status

### Architecture
- **Web Service** (`Biz/PodcastItLater/Web.py`) - Ludic web app with HTMX interface
- **Background Worker** (`Biz/PodcastItLater/Worker.py`) - Processes articles to audio
- **Core/Database** (`Biz/PodcastItLater/Core.py`) - Shared database operations

### Features Implemented

#### User Management
- Email-based registration/login (no passwords)
- Session-based authentication
- Personal RSS feed tokens
- User-specific data isolation

#### Article Processing
- Manual URL submission via web form
- Content extraction with trafilatura
- LLM-powered text preparation for natural speech
- OpenAI TTS conversion with chunking for long articles
- S3-compatible storage (Digital Ocean Spaces)

#### Web Interface
- Login/logout functionality
- Submit article form
- Live queue status updates (HTMX)
- Recent episodes with audio player
- Personal RSS feed URL display
- Admin queue view with retry/delete actions

#### RSS Feeds
- Personalized feeds at `/feed/{user_token}.xml`
- User-specific episode filtering
- Customized feed titles based on user email

### Database Schema
```sql
-- Users table
CREATE TABLE users (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    email TEXT UNIQUE NOT NULL,
    token TEXT UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Queue table with user support
CREATE TABLE queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    url TEXT,
    email TEXT,
    user_id INTEGER REFERENCES users(id),
    status TEXT DEFAULT 'pending',
    retry_count INTEGER DEFAULT 0,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    error_message TEXT
);

-- Episodes table with user support
CREATE TABLE episodes (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    title TEXT NOT NULL,
    content_length INTEGER,
    audio_url TEXT NOT NULL,
    duration INTEGER,
    user_id INTEGER REFERENCES users(id),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

## Phase 3: Path to Paid Product

### Immediate Priorities

#### 1. Usage Limits & Billing Infrastructure
- Add usage tracking to users table (articles_processed, audio_minutes)
- Implement free tier limits (e.g., 10 articles/month)
- Add subscription status and tier to users
- Integrate Stripe for payments
- Create billing webhook handlers

#### 2. Enhanced User Experience
- Implement article preview/editing before conversion
- Add voice selection options
- Support for multiple TTS providers (cost optimization)
- Batch processing for multiple URLs

#### 3. Content Quality Improvements
- Better handling of different article types (news, blogs, research papers)
- Improved code block and technical content handling
- Table/chart description generation
- Multi-language support
- Custom intro/outro options

#### 4. Admin & Analytics
- Admin dashboard for monitoring all users
- Usage analytics and metrics
- Cost tracking per user
- System health monitoring
- Automated error alerting

### Technical Improvements Needed

#### Security & Reliability
- Add rate limiting per user
- Implement proper API authentication (not just session-based)
- Add request signing for webhook security
- Backup and disaster recovery for database
- Queue persistence across worker restarts

#### Performance & Scalability
- Move from SQLite to PostgreSQL
- Implement proper job queue (Redis/RabbitMQ)
- Add caching layer for processed articles
- CDN for audio file delivery
- Horizontal scaling for workers

#### Code Quality
- Add comprehensive test suite
- API documentation
- Error tracking (Sentry)
- Structured logging with correlation IDs
- Configuration management (not just env vars)

### Pricing Model Considerations
- Free tier: 5-10 articles/month, basic voice
- Personal: $5-10/month, 50 articles, voice selection
- Pro: $20-30/month, unlimited articles, priority processing
- API access for developers

### MVP for Paid Launch
1. Stripe integration with subscription management
2. Usage tracking and enforcement
3. Email notifications
4. Basic admin dashboard
5. Improved error handling and retry logic
6. PostgreSQL migration
7. Basic API with authentication

### Environment Variables Required
```bash
# Current
OPENAI_API_KEY=
S3_ENDPOINT=
S3_BUCKET=
S3_ACCESS_KEY=
S3_SECRET_KEY=
BASE_URL=
DATA_DIR=  # Used by both Web and Worker services
SESSION_SECRET=
PORT=

# Needed for paid version
STRIPE_SECRET_KEY=
STRIPE_WEBHOOK_SECRET=
STRIPE_PRICE_ID_PERSONAL=
STRIPE_PRICE_ID_PRO=
SENDGRID_API_KEY=  # for transactional emails
SENTRY_DSN=
REDIS_URL=
```

### Next Implementation Steps
1. Create `Biz/PodcastItLater/Billing.py` for Stripe integration
2. Add usage tracking to Core.py database operations
3. Implement email notifications in Worker.py
4. Create admin interface endpoints in Web.py
5. Add comprehensive error handling and logging
6. Write test suite
7. Create deployment configuration

## Test Plan

### Overview
The test suite will ensure reliability and correctness of all components before launching the paid product. Tests will be organized into three main categories matching the architecture: Core (database), Web (frontend/API), and Worker (background processing).

### Test Structure
Tests will be placed in the same file as the code they test, following the pattern established in the codebase. Each module will contain its test classes nearby the functionality that class is testing:

- `Biz/PodcastItLater/Core.py` - Contains database logic and TestDatabase, TestUserManagement, TestQueueOperations, TestEpisodeManagement classes
- `Biz/PodcastItLater/Web.py` - Contains web interface and TestAuthentication, TestArticleSubmission, TestRSSFeed, TestAdminInterface classes
- `Biz/PodcastItLater/Worker.py` - Contains background worker and TestArticleExtraction, TestTextToSpeech, TestJobProcessing classes

Each file will follow this pattern:
```python
# Main code implementation
class Database:
    ...

# Test class next to the class it is testing
class TestDatabase(Test.TestCase):
    """Test the Database class."""

    def test_init_db(self) -> None:
        """Verify all tables and indexes are created correctly."""
        ...
```

This keeps tests close to the code they test, making it easier to maintain and understand the relationship between implementation and tests.

### Core Tests (Core.py)

#### TestDatabase
- `test_init_db` - Verify all tables and indexes are created correctly
- `test_connection_context_manager` - Ensure connections are properly closed
- `test_migration_idempotency` - Verify migrations can run multiple times safely

#### TestUserManagement
- `test_create_user` - Create user with unique email and token
- `test_create_duplicate_user` - Verify duplicate emails return existing user
- `test_get_user_by_email` - Retrieve user by email
- `test_get_user_by_token` - Retrieve user by RSS token
- `test_get_user_by_id` - Retrieve user by ID
- `test_invalid_user_lookups` - Verify None returned for non-existent users
- `test_token_uniqueness` - Ensure tokens are cryptographically unique

#### TestQueueOperations
- `test_add_to_queue` - Add job with user association
- `test_get_pending_jobs` - Retrieve jobs in correct order
- `test_update_job_status` - Update status and error messages
- `test_retry_job` - Reset failed jobs for retry
- `test_delete_job` - Remove jobs from queue
- `test_get_retryable_jobs` - Find jobs eligible for retry
- `test_user_queue_isolation` - Ensure users only see their own jobs
- `test_status_counts` - Verify status aggregation queries

#### TestEpisodeManagement
- `test_create_episode` - Create episode with user association
- `test_get_recent_episodes` - Retrieve episodes in reverse chronological order
- `test_get_user_episodes` - Ensure user isolation for episodes
- `test_episode_metadata` - Verify duration and content_length storage

### Web Tests (Web.py)

#### TestAuthentication
- `test_login_new_user` - Auto-create user on first login
- `test_login_existing_user` - Login with existing email
- `test_login_invalid_email` - Reject malformed emails
- `test_session_persistence` - Verify session across requests
- `test_protected_routes` - Ensure auth required for user actions

#### TestArticleSubmission
- `test_submit_valid_url` - Accept well-formed URLs
- `test_submit_invalid_url` - Reject malformed URLs
- `test_submit_without_auth` - Reject unauthenticated submissions
- `test_submit_creates_job` - Verify job creation in database
- `test_htmx_response` - Ensure proper HTMX response format

#### TestRSSFeed
- `test_feed_generation` - Generate valid RSS XML
- `test_feed_user_isolation` - Only show user's episodes
- `test_feed_invalid_token` - Return 404 for bad tokens
- `test_feed_metadata` - Verify personalized feed titles
- `test_feed_episode_order` - Ensure reverse chronological order
- `test_feed_enclosures` - Verify audio URLs and metadata


#### TestAdminInterface
- `test_queue_status_view` - Verify queue display
- `test_retry_action` - Test retry button functionality
- `test_delete_action` - Test delete button functionality
- `test_user_data_isolation` - Ensure users only see own data
- `test_status_summary` - Verify status counts display

### Worker Tests (Worker.py)

#### TestArticleExtraction
- `test_extract_valid_article` - Extract from well-formed HTML
- `test_extract_missing_title` - Handle articles without titles
- `test_extract_empty_content` - Handle empty articles
- `test_extract_network_error` - Handle connection failures
- `test_extract_timeout` - Handle slow responses
- `test_content_sanitization` - Remove unwanted elements

#### TestTextToSpeech
- `test_tts_generation` - Generate audio from text
- `test_tts_chunking` - Handle long articles with chunking
- `test_tts_empty_text` - Handle empty input
- `test_tts_special_characters` - Handle unicode and special chars
- `test_llm_text_preparation` - Verify LLM editing
- `test_llm_failure_fallback` - Handle LLM API failures
- `test_chunk_concatenation` - Verify audio joining

#### TestJobProcessing
- `test_process_job_success` - Complete pipeline execution
- `test_process_job_extraction_failure` - Handle bad URLs
- `test_process_job_tts_failure` - Handle TTS errors
- `test_process_job_s3_failure` - Handle upload errors
- `test_job_retry_logic` - Verify exponential backoff
- `test_max_retries` - Stop after max attempts
- `test_concurrent_processing` - Handle multiple jobs

### Integration Tests

#### TestEndToEnd
- `test_web_to_podcast` - Full pipeline from web submission
- `test_multiple_users` - Concurrent multi-user scenarios
- `test_error_recovery` - System recovery from failures

### Test Infrastructure

#### Fixtures and Mocks
- Mock OpenAI API responses
- Mock S3/Digital Ocean Spaces
- In-memory SQLite for fast tests
- Test data generators for articles

#### Test Configuration
- Separate test database
- Mock external services by default
- Optional integration tests with real services
- Test coverage reporting
- Performance benchmarks for TTS chunking

### Testing Best Practices
1. Each test should be independent and idempotent
2. Use descriptive test names that explain the scenario
3. Test both happy paths and error conditions
4. Mock external services to avoid dependencies
5. Use fixtures for common test data
6. Measure test coverage (aim for >80%)
7. Run tests in CI/CD pipeline
8. Keep tests fast (< 30 seconds total)

### Pre-Launch Testing Checklist
- [x] All unit tests passing
- [ ] Integration tests with real services
- [ ] Load testing (100 concurrent users)
- [ ] Security testing (SQL injection, XSS)
- [ ] RSS feed validation
- [ ] Audio quality verification
- [ ] Error handling and logging
- [ ] Database backup/restore
- [ ] User data isolation verification
- [ ] Billing integration tests (when implemented)