diff options
Diffstat (limited to 'Biz/PodcastItLater.md')
| -rw-r--r-- | Biz/PodcastItLater.md | 198 |
1 files changed, 198 insertions, 0 deletions
diff --git a/Biz/PodcastItLater.md b/Biz/PodcastItLater.md new file mode 100644 index 0000000..bb65082 --- /dev/null +++ b/Biz/PodcastItLater.md @@ -0,0 +1,198 @@ +# PodcastItLater MVP Implementation Prompt + +You are implementing a two-service MVP system called "PodcastItLater" that converts web articles to podcast episodes via email submission. This follows a monorepo namespace structure where all files live under `Biz/PodcastItLater/`. + +## Code Organization & Structure +- **Primary files**: + - `Biz/PodcastItLater/Web.py` - web service (ludic app, routes, webhook) + - `Biz/PodcastItLater/Worker.py` - background processor + - `Biz/PodcastItLater/Models.py` - database schema and data access +- **Keep code in as few files as possible following monorepo conventions** +- **Namespaces are always capitalized** (this is a Python project but follows the Haskell-style namespace hierarchy) + +## Technical Requirements + +### Core Libraries +```python +# Required dependencies +import ludic # web framework (see provided docs) +import trafilatura # content extraction +import openai # tts api +import boto3 # s3 uploads +import feedgen # rss generation +import sqlite3 # database +import pydub # audio manipulation if needed +``` + +### Database Schema +```sql +-- Queue table for job processing +CREATE TABLE IF NOT EXISTS queue ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + url TEXT, + email TEXT, + status TEXT DEFAULT 'pending', + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + error_message TEXT +); + +-- Episodes table for completed podcasts +CREATE TABLE IF NOT EXISTS episodes ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + title TEXT NOT NULL, + content_length INTEGER, + audio_url TEXT NOT NULL, + duration INTEGER, + created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP +); +``` + +## Service 1: Web Frontend (`Biz/PodcastItLater/Web.py`) + +### Responsibilities +- Serve ludic + htmx web interface +- Handle mailgun webhook for email submissions +- Provide manual article submission form +- Display processing queue status +- Serve RSS podcast feed +- Basic podcast player for testing + +### Required Routes +```python +@app.route("/") +def index(): + # Simple form to submit article URL + # Display recent episodes and queue status + # Use htmx for dynamic updates + +@app.route("/submit", methods=["POST"]) +def submit_article(): + # Handle manual form submission + # Insert into queue table + # Return htmx response with status + +@app.route("/webhook/mailgun", methods=["POST"]) +def mailgun_webhook(): + # Parse email, extract URLs from body + # Insert into queue table + # Verify webhook signature for security + +@app.route("/feed.xml") +def rss_feed(): + # Generate RSS from episodes table + # Use feedgen library + +@app.route("/status") +def queue_status(): + # HTMX endpoint for live queue updates + # Return current queue + recent episodes +``` + +### RSS Feed Metadata (hardcoded) +```python +RSS_CONFIG = { + "title": "Ben's Article Podcast", + "description": "Web articles converted to audio", + "author": "Ben Sima", + "language": "en-US", + "base_url": "https://your-domain.com" # configure via env var +} +``` + +## Service 2: Background Worker (`Biz/PodcastItLater/Worker.py`) + +### Responsibilities +- Poll queue table every 30 seconds +- Extract article content using trafilatura +- Convert text to speech via OpenAI TTS +- Upload audio files to S3-compatible storage +- Update episodes table with completed episodes +- Handle errors with retry logic (3 attempts max) + +### Processing Pipeline +```python +def process_article(queue_item): + """Complete article processing pipeline""" + try: + # 1. Extract content with trafilatura + content = extract_article_content(queue_item.url) + + # 2. Generate audio with OpenAI TTS + audio_file = text_to_speech(content) + + # 3. Upload to S3 + audio_url = upload_to_s3(audio_file) + + # 4. Create episode record + create_episode(title, audio_url, duration) + + # 5. Mark queue item as complete + mark_complete(queue_item.id) + + except Exception as e: + handle_error(queue_item.id, str(e)) +``` + +### Configuration via Environment Variables +```python +# Required environment variables +OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") +S3_ENDPOINT = os.getenv("S3_ENDPOINT") # Digital Ocean Spaces +S3_BUCKET = os.getenv("S3_BUCKET") +S3_ACCESS_KEY = os.getenv("S3_ACCESS_KEY") +S3_SECRET_KEY = os.getenv("S3_SECRET_KEY") +MAILGUN_WEBHOOK_KEY = os.getenv("MAILGUN_WEBHOOK_KEY") +``` + +## Email Processing Logic +- Parse email body for first HTTP/HTTPS URL found +- If no URL found, treat entire email body as article content +- Store original email in queue record for debugging + +## Error Handling Strategy +- Log all errors but continue processing +- Failed jobs marked with 'error' status and error message +- Retry logic: 3 attempts with exponential backoff +- Graceful degradation when external services fail + +## Audio Configuration +- **Format**: MP3, 128kbps +- **TTS Voice**: OpenAI default voice (can add voice selection later) +- **File naming**: `episode_{timestamp}_{id}.mp3` + +## HTMX Frontend Behavior +- Auto-refresh queue status every 30 seconds +- Form submission without page reload +- Simple progress indicators for processing jobs +- Basic audio player for testing episodes + +## Testing Requirements + +Create tests covering: +- Article content extraction accuracy +- TTS API integration (with mocking) +- S3 upload/download functionality +- RSS feed generation and XML validation +- Email webhook parsing and security +- Database operations and data integrity +- End-to-end submission workflow + +## Success Criteria +The MVP should successfully: +1. Receive article submissions via email webhook +2. Extract clean article content +3. Convert text to high-quality audio +4. Store audio in S3-compatible storage +5. Generate valid RSS podcast feed +6. Provide basic web interface for monitoring +7. Handle errors gracefully without crashing + +## Implementation Notes +- Start with Web.py service first, then Worker.py +- Use simple polling rather than complex job queues +- Focus on reliability over performance for MVP +- Keep total code under 300-400 lines +- Use reasonable defaults everywhere possible +- Prioritize working code over perfect code + +Implement this as a robust, deployable MVP that can handle real-world article processing workloads while maintaining simplicity. |
