From 0b005c192b2c141c7f6c9bff4a0702361814c21d Mon Sep 17 00:00:00 2001 From: Ben Sima Date: Wed, 13 Aug 2025 13:36:30 -0400 Subject: Prototype PodcastItLater This implements a working prototype of PodcastItLater. It basically just works for a single user currently, but the articles are nice to listen to and this is something that we can start to build with. --- Biz/PodcastItLater.md | 487 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 320 insertions(+), 167 deletions(-) (limited to 'Biz/PodcastItLater.md') diff --git a/Biz/PodcastItLater.md b/Biz/PodcastItLater.md index bb65082..89fc9b5 100644 --- a/Biz/PodcastItLater.md +++ b/Biz/PodcastItLater.md @@ -1,198 +1,351 @@ -# PodcastItLater MVP Implementation Prompt - -You are implementing a two-service MVP system called "PodcastItLater" that converts web articles to podcast episodes via email submission. This follows a monorepo namespace structure where all files live under `Biz/PodcastItLater/`. - -## Code Organization & Structure -- **Primary files**: - - `Biz/PodcastItLater/Web.py` - web service (ludic app, routes, webhook) - - `Biz/PodcastItLater/Worker.py` - background processor - - `Biz/PodcastItLater/Models.py` - database schema and data access -- **Keep code in as few files as possible following monorepo conventions** -- **Namespaces are always capitalized** (this is a Python project but follows the Haskell-style namespace hierarchy) - -## Technical Requirements - -### Core Libraries -```python -# Required dependencies -import ludic # web framework (see provided docs) -import trafilatura # content extraction -import openai # tts api -import boto3 # s3 uploads -import feedgen # rss generation -import sqlite3 # database -import pydub # audio manipulation if needed -``` +# PodcastItLater + +A service that converts web articles to podcast episodes via email submission or web interface. Users can submit articles and receive them as audio episodes in their personal podcast feed. + +## Current Implementation Status + +### Architecture +- **Web Service** (`Biz/PodcastItLater/Web.py`) - Ludic web app with HTMX interface +- **Background Worker** (`Biz/PodcastItLater/Worker.py`) - Processes articles to audio +- **Core/Database** (`Biz/PodcastItLater/Core.py`) - Shared database operations + +### Features Implemented + +#### User Management +- Email-based registration/login (no passwords) +- Auto-create users on first email submission +- Session-based authentication +- Personal RSS feed tokens +- User-specific data isolation + +#### Article Processing +- Email submission via Mailgun webhook +- Manual URL submission via web form +- Content extraction with trafilatura +- LLM-powered text preparation for natural speech +- OpenAI TTS conversion with chunking for long articles +- S3-compatible storage (Digital Ocean Spaces) + +#### Web Interface +- Login/logout functionality +- Submit article form +- Live queue status updates (HTMX) +- Recent episodes with audio player +- Personal RSS feed URL display +- Admin queue view with retry/delete actions + +#### RSS Feeds +- Personalized feeds at `/feed/{user_token}.xml` +- User-specific episode filtering +- Customized feed titles based on user email ### Database Schema ```sql --- Queue table for job processing -CREATE TABLE IF NOT EXISTS queue ( +-- Users table +CREATE TABLE users ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + email TEXT UNIQUE NOT NULL, + token TEXT UNIQUE NOT NULL, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); + +-- Queue table with user support +CREATE TABLE queue ( id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT, email TEXT, + user_id INTEGER REFERENCES users(id), status TEXT DEFAULT 'pending', + retry_count INTEGER DEFAULT 0, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, error_message TEXT ); --- Episodes table for completed podcasts -CREATE TABLE IF NOT EXISTS episodes ( +-- Episodes table with user support +CREATE TABLE episodes ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, content_length INTEGER, audio_url TEXT NOT NULL, duration INTEGER, + user_id INTEGER REFERENCES users(id), created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); ``` -## Service 1: Web Frontend (`Biz/PodcastItLater/Web.py`) - -### Responsibilities -- Serve ludic + htmx web interface -- Handle mailgun webhook for email submissions -- Provide manual article submission form -- Display processing queue status -- Serve RSS podcast feed -- Basic podcast player for testing - -### Required Routes -```python -@app.route("/") -def index(): - # Simple form to submit article URL - # Display recent episodes and queue status - # Use htmx for dynamic updates - -@app.route("/submit", methods=["POST"]) -def submit_article(): - # Handle manual form submission - # Insert into queue table - # Return htmx response with status - -@app.route("/webhook/mailgun", methods=["POST"]) -def mailgun_webhook(): - # Parse email, extract URLs from body - # Insert into queue table - # Verify webhook signature for security - -@app.route("/feed.xml") -def rss_feed(): - # Generate RSS from episodes table - # Use feedgen library - -@app.route("/status") -def queue_status(): - # HTMX endpoint for live queue updates - # Return current queue + recent episodes -``` - -### RSS Feed Metadata (hardcoded) -```python -RSS_CONFIG = { - "title": "Ben's Article Podcast", - "description": "Web articles converted to audio", - "author": "Ben Sima", - "language": "en-US", - "base_url": "https://your-domain.com" # configure via env var -} +## Phase 3: Path to Paid Product + +### Immediate Priorities + +#### 1. Usage Limits & Billing Infrastructure +- Add usage tracking to users table (articles_processed, audio_minutes) +- Implement free tier limits (e.g., 10 articles/month) +- Add subscription status and tier to users +- Integrate Stripe for payments +- Create billing webhook handlers + +#### 2. Enhanced User Experience +- Implement article preview/editing before conversion +- Add voice selection options +- Support for multiple TTS providers (cost optimization) +- Batch processing for multiple URLs + +#### 3. Content Quality Improvements +- Better handling of different article types (news, blogs, research papers) +- Improved code block and technical content handling +- Table/chart description generation +- Multi-language support +- Custom intro/outro options + +#### 4. Admin & Analytics +- Admin dashboard for monitoring all users +- Usage analytics and metrics +- Cost tracking per user +- System health monitoring +- Automated error alerting + +### Technical Improvements Needed + +#### Security & Reliability +- Add rate limiting per user +- Implement proper API authentication (not just session-based) +- Add request signing for webhook security +- Backup and disaster recovery for database +- Queue persistence across worker restarts + +#### Performance & Scalability +- Move from SQLite to PostgreSQL +- Implement proper job queue (Redis/RabbitMQ) +- Add caching layer for processed articles +- CDN for audio file delivery +- Horizontal scaling for workers + +#### Code Quality +- Add comprehensive test suite +- API documentation +- Error tracking (Sentry) +- Structured logging with correlation IDs +- Configuration management (not just env vars) + +### Pricing Model Considerations +- Free tier: 5-10 articles/month, basic voice +- Personal: $5-10/month, 50 articles, voice selection +- Pro: $20-30/month, unlimited articles, priority processing +- API access for developers + +### MVP for Paid Launch +1. Stripe integration with subscription management +2. Usage tracking and enforcement +3. Email notifications +4. Basic admin dashboard +5. Improved error handling and retry logic +6. PostgreSQL migration +7. Basic API with authentication + +### Environment Variables Required +```bash +# Current +OPENAI_API_KEY= +S3_ENDPOINT= +S3_BUCKET= +S3_ACCESS_KEY= +S3_SECRET_KEY= +MAILGUN_WEBHOOK_KEY= +BASE_URL= +DATABASE_PATH= # Used by both Web and Worker services +SESSION_SECRET= +PORT= + +# Needed for paid version +STRIPE_SECRET_KEY= +STRIPE_WEBHOOK_SECRET= +STRIPE_PRICE_ID_PERSONAL= +STRIPE_PRICE_ID_PRO= +SENDGRID_API_KEY= # for transactional emails +SENTRY_DSN= +REDIS_URL= +DATABASE_URL= # PostgreSQL ``` -## Service 2: Background Worker (`Biz/PodcastItLater/Worker.py`) +### Next Implementation Steps +1. Create `Biz/PodcastItLater/Billing.py` for Stripe integration +2. Add usage tracking to Core.py database operations +3. Implement email notifications in Worker.py +4. Create admin interface endpoints in Web.py +5. Add comprehensive error handling and logging +6. Write test suite +7. Create deployment configuration -### Responsibilities -- Poll queue table every 30 seconds -- Extract article content using trafilatura -- Convert text to speech via OpenAI TTS -- Upload audio files to S3-compatible storage -- Update episodes table with completed episodes -- Handle errors with retry logic (3 attempts max) +## Test Plan -### Processing Pipeline -```python -def process_article(queue_item): - """Complete article processing pipeline""" - try: - # 1. Extract content with trafilatura - content = extract_article_content(queue_item.url) +### Overview +The test suite will ensure reliability and correctness of all components before launching the paid product. Tests will be organized into three main categories matching the architecture: Core (database), Web (frontend/API), and Worker (background processing). - # 2. Generate audio with OpenAI TTS - audio_file = text_to_speech(content) +### Test Structure +Tests will be placed in the same file as the code they test, following the pattern established in the codebase. Each module will contain its test classes nearby the functionality that class is testing: - # 3. Upload to S3 - audio_url = upload_to_s3(audio_file) +- `Biz/PodcastItLater/Core.py` - Contains database logic and TestDatabase, TestUserManagement, TestQueueOperations, TestEpisodeManagement classes +- `Biz/PodcastItLater/Web.py` - Contains web interface and TestAuthentication, TestArticleSubmission, TestRSSFeed, TestWebhook, TestAdminInterface classes +- `Biz/PodcastItLater/Worker.py` - Contains background worker and TestArticleExtraction, TestTextToSpeech, TestJobProcessing classes - # 4. Create episode record - create_episode(title, audio_url, duration) +Each file will follow this pattern: +```python +# Main code implementation +class Database: + ... - # 5. Mark queue item as complete - mark_complete(queue_item.id) +# Test class next to the class it is testing +class TestDatabase(Test.TestCase): + """Test the Database class.""" - except Exception as e: - handle_error(queue_item.id, str(e)) -``` - -### Configuration via Environment Variables -```python -# Required environment variables -OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") -S3_ENDPOINT = os.getenv("S3_ENDPOINT") # Digital Ocean Spaces -S3_BUCKET = os.getenv("S3_BUCKET") -S3_ACCESS_KEY = os.getenv("S3_ACCESS_KEY") -S3_SECRET_KEY = os.getenv("S3_SECRET_KEY") -MAILGUN_WEBHOOK_KEY = os.getenv("MAILGUN_WEBHOOK_KEY") + def test_init_db(self) -> None: + """Verify all tables and indexes are created correctly.""" + ... ``` -## Email Processing Logic -- Parse email body for first HTTP/HTTPS URL found -- If no URL found, treat entire email body as article content -- Store original email in queue record for debugging - -## Error Handling Strategy -- Log all errors but continue processing -- Failed jobs marked with 'error' status and error message -- Retry logic: 3 attempts with exponential backoff -- Graceful degradation when external services fail - -## Audio Configuration -- **Format**: MP3, 128kbps -- **TTS Voice**: OpenAI default voice (can add voice selection later) -- **File naming**: `episode_{timestamp}_{id}.mp3` - -## HTMX Frontend Behavior -- Auto-refresh queue status every 30 seconds -- Form submission without page reload -- Simple progress indicators for processing jobs -- Basic audio player for testing episodes - -## Testing Requirements - -Create tests covering: -- Article content extraction accuracy -- TTS API integration (with mocking) -- S3 upload/download functionality -- RSS feed generation and XML validation -- Email webhook parsing and security -- Database operations and data integrity -- End-to-end submission workflow - -## Success Criteria -The MVP should successfully: -1. Receive article submissions via email webhook -2. Extract clean article content -3. Convert text to high-quality audio -4. Store audio in S3-compatible storage -5. Generate valid RSS podcast feed -6. Provide basic web interface for monitoring -7. Handle errors gracefully without crashing - -## Implementation Notes -- Start with Web.py service first, then Worker.py -- Use simple polling rather than complex job queues -- Focus on reliability over performance for MVP -- Keep total code under 300-400 lines -- Use reasonable defaults everywhere possible -- Prioritize working code over perfect code - -Implement this as a robust, deployable MVP that can handle real-world article processing workloads while maintaining simplicity. +This keeps tests close to the code they test, making it easier to maintain and understand the relationship between implementation and tests. + +### Core Tests (Core.py) + +#### TestDatabase +- `test_init_db` - Verify all tables and indexes are created correctly +- `test_connection_context_manager` - Ensure connections are properly closed +- `test_migration_idempotency` - Verify migrations can run multiple times safely + +#### TestUserManagement +- `test_create_user` - Create user with unique email and token +- `test_create_duplicate_user` - Verify duplicate emails return existing user +- `test_get_user_by_email` - Retrieve user by email +- `test_get_user_by_token` - Retrieve user by RSS token +- `test_get_user_by_id` - Retrieve user by ID +- `test_invalid_user_lookups` - Verify None returned for non-existent users +- `test_token_uniqueness` - Ensure tokens are cryptographically unique + +#### TestQueueOperations +- `test_add_to_queue` - Add job with user association +- `test_get_pending_jobs` - Retrieve jobs in correct order +- `test_update_job_status` - Update status and error messages +- `test_retry_job` - Reset failed jobs for retry +- `test_delete_job` - Remove jobs from queue +- `test_get_retryable_jobs` - Find jobs eligible for retry +- `test_user_queue_isolation` - Ensure users only see their own jobs +- `test_status_counts` - Verify status aggregation queries + +#### TestEpisodeManagement +- `test_create_episode` - Create episode with user association +- `test_get_recent_episodes` - Retrieve episodes in reverse chronological order +- `test_get_user_episodes` - Ensure user isolation for episodes +- `test_episode_metadata` - Verify duration and content_length storage + +### Web Tests (Web.py) + +#### TestAuthentication +- `test_login_new_user` - Auto-create user on first login +- `test_login_existing_user` - Login with existing email +- `test_login_invalid_email` - Reject malformed emails +- `test_session_persistence` - Verify session across requests +- `test_protected_routes` - Ensure auth required for user actions + +#### TestArticleSubmission +- `test_submit_valid_url` - Accept well-formed URLs +- `test_submit_invalid_url` - Reject malformed URLs +- `test_submit_without_auth` - Reject unauthenticated submissions +- `test_submit_creates_job` - Verify job creation in database +- `test_htmx_response` - Ensure proper HTMX response format + +#### TestRSSFeed +- `test_feed_generation` - Generate valid RSS XML +- `test_feed_user_isolation` - Only show user's episodes +- `test_feed_invalid_token` - Return 404 for bad tokens +- `test_feed_metadata` - Verify personalized feed titles +- `test_feed_episode_order` - Ensure reverse chronological order +- `test_feed_enclosures` - Verify audio URLs and metadata + +#### TestWebhook +- `test_mailgun_signature_valid` - Accept valid signatures +- `test_mailgun_signature_invalid` - Reject invalid signatures +- `test_webhook_url_extraction` - Extract URLs from email body +- `test_webhook_auto_create_user` - Create user on first email +- `test_webhook_multiple_urls` - Handle emails with multiple URLs +- `test_webhook_no_urls` - Handle emails without URLs gracefully + +#### TestAdminInterface +- `test_queue_status_view` - Verify queue display +- `test_retry_action` - Test retry button functionality +- `test_delete_action` - Test delete button functionality +- `test_user_data_isolation` - Ensure users only see own data +- `test_status_summary` - Verify status counts display + +### Worker Tests (Worker.py) + +#### TestArticleExtraction +- `test_extract_valid_article` - Extract from well-formed HTML +- `test_extract_missing_title` - Handle articles without titles +- `test_extract_empty_content` - Handle empty articles +- `test_extract_network_error` - Handle connection failures +- `test_extract_timeout` - Handle slow responses +- `test_content_sanitization` - Remove unwanted elements + +#### TestTextToSpeech +- `test_tts_generation` - Generate audio from text +- `test_tts_chunking` - Handle long articles with chunking +- `test_tts_empty_text` - Handle empty input +- `test_tts_special_characters` - Handle unicode and special chars +- `test_llm_text_preparation` - Verify LLM editing +- `test_llm_failure_fallback` - Handle LLM API failures +- `test_chunk_concatenation` - Verify audio joining + +#### TestJobProcessing +- `test_process_job_success` - Complete pipeline execution +- `test_process_job_extraction_failure` - Handle bad URLs +- `test_process_job_tts_failure` - Handle TTS errors +- `test_process_job_s3_failure` - Handle upload errors +- `test_job_retry_logic` - Verify exponential backoff +- `test_max_retries` - Stop after max attempts +- `test_concurrent_processing` - Handle multiple jobs + +### Integration Tests + +#### TestEndToEnd +- `test_email_to_podcast` - Full pipeline from email to RSS +- `test_web_to_podcast` - Full pipeline from web submission +- `test_multiple_users` - Concurrent multi-user scenarios +- `test_error_recovery` - System recovery from failures + +### Test Infrastructure + +#### Fixtures and Mocks +- Mock OpenAI API responses +- Mock S3/Digital Ocean Spaces +- Mock Mailgun webhooks +- In-memory SQLite for fast tests +- Test data generators for articles + +#### Test Configuration +- Separate test database +- Mock external services by default +- Optional integration tests with real services +- Test coverage reporting +- Performance benchmarks for TTS chunking + +### Testing Best Practices +1. Each test should be independent and idempotent +2. Use descriptive test names that explain the scenario +3. Test both happy paths and error conditions +4. Mock external services to avoid dependencies +5. Use fixtures for common test data +6. Measure test coverage (aim for >80%) +7. Run tests in CI/CD pipeline +8. Keep tests fast (< 30 seconds total) + +### Pre-Launch Testing Checklist +- [x] All unit tests passing +- [ ] Integration tests with real services +- [ ] Load testing (100 concurrent users) +- [ ] Security testing (SQL injection, XSS) +- [ ] RSS feed validation +- [ ] Audio quality verification +- [ ] Error handling and logging +- [ ] Database backup/restore +- [ ] User data isolation verification +- [ ] Billing integration tests (when implemented) -- cgit v1.2.3