From 100 videos uploaded in 3 years... to batch-processing 178 videos in days.
First version shipped in one week with Claude Code during my "vacation" from entrepreneurship.
Back in 2015-2016, I was the manager of FocusingsVlogs (Mel Dominguez), a Spanish YouTuber who had built an audience of 240,000+ subscribers. Life happened, and she deleted her channel for personal reasons before the pandemic.
I had a backup of almost all her videos—but not the thumbnails.
Fast forward to 2023. I reached out to Mel with an idea: let's re-upload her old videos as a digital memory archive. She agreed, as long as I handled everything so she could stay off the radar.
The problem? Creating thumbnails for 278 videos when:
- The original thumbnails were gone
- The videos were 720p with camera noise and low sharpness
- Mel had changed her appearance significantly since 2012-2016
- Each thumbnail required: watching the video, finding good frames, upscaling, retouching, designing in Canva
- AI image generation existed but couldn't reliably maintain her face
After 3 years of manual work (2023-2025), I had only uploaded 100 videos. 178 still waiting.
Then came the AI revolution of late 2025. Models could finally:
- Generate high-quality images with text
- Preserve facial identity from reference photos
- Understand complex creative prompts
So I shipped the first working version in 8 days during the holidays (December 27, 2025 - January 4, 2026) using Claude Code. Not even consecutive days—New Year's happened in the middle. And yet here we are: complete web interface, face detection pipeline, multi-provider AI integration. I'm genuinely amazed at how fast we can ship full applications now.
It automates what used to take me hours per video into a streamlined, AI-powered pipeline.
VIDEO FILE → Scene Detection → Face Extraction → Face Clustering → Transcription
↓
THUMBNAILS ← Image Generation ← Creative Prompts ← LLM Analysis ← Content Understanding
One click transforms any video into multiple AI-generated thumbnail options, each preserving the exact facial identity of the person you select.
The system uses strict identity preservation prompts that prevent AI models from "averaging" or "beautifying" faces. Your reference person appears in thumbnails looking exactly like themselves—not a similar-looking AI interpretation.
"A friend of this person should INSTANTLY recognize them in the generated image."
Switch between image generation providers based on your needs:
| Provider | Best Models | Max References | Strengths |
|---|---|---|---|
| Google Gemini | gemini-3-pro-image-preview | 14 images | Best identity preservation |
| OpenAI | gpt-image-1.5 | 16 images | High fidelity |
| Poe | nanobananapro, flux2pro | 14 images | Fast, reliable |
| Replicate | FLUX 1.1 Pro | 1 image | Open-source option |
No manual face labeling needed. The system:
- Detects all faces in the video using InsightFace
- Creates 512-dimensional embeddings for each face
- Clusters faces by identity using DBSCAN
- Separates clusters by scene (handles costumes/disguises)
Not all frames are equal. The system scores each face by:
- Quality: Sharpness, brightness, contrast
- Pose: Front-facing preferred over profiles
- Expression: Balanced selection (smiling, neutral, mouth closed)
- Size: Larger faces score higher
Uses LLMs (Claude, GPT-4, Gemini) to analyze video transcripts and generate:
- Multiple creative concepts per video
- Suggested clickable titles
- Color palettes and mood descriptions
- Text overlay suggestions
Beyond thumbnails, the tool generates YouTube-ready metadata:
- Titles: Multiple options in different styles (neutral, SEO-optimized, clickbait)
- Descriptions: With optional timestamps, hashtags, and calls-to-action
- Title-aware thumbnails: When you select a title, the image generation takes it into account to create visually coherent thumbnails that complement your chosen title
Server-Sent Events (SSE) provide instant feedback:
- Watch scene detection progress
- See faces being extracted live
- Track thumbnail generation in real-time
- Python 3.10+ (3.11 recommended)
- FFmpeg installed and in PATH
# Windows (PowerShell as admin) winget install FFmpeg # Linux sudo apt install ffmpeg # Mac brew install ffmpeg
- Redis (required for Web UI, not needed for CLI)
# Windows: Use Laragon (recommended), Docker, or WSL # https://laragon.org/ - includes Redis out of the box # Linux sudo apt install redis-server # Mac brew install redis
- NVIDIA GPU recommended (works on CPU, 5-10x slower)
- API keys for at least one provider (Gemini, OpenAI, or Poe)
# Clone the repository
git clone https://github.com/jordicor/youtube_thumbnail_generator.git
cd youtube_thumbnail_generator
# Create virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt
# Copy environment template
copy .env.example .env # Windows
# cp .env.example .env # Linux/MacEdit .env with your settings:
# Required: At least one image provider
GEMINI_API_KEY=your-gemini-key
# OPENAI_API_KEY=your-openai-key
# POE_API_KEY=your-poe-key
# Required: At least one LLM for prompts
ANTHROPIC_API_KEY=your-anthropic-key
# OPENAI_API_KEY=your-openai-key
# Directories
VIDEOS_DIR=C:/path/to/your/videos
OUTPUT_DIR=./output
# Image generation settings
IMAGE_PROVIDER=gemini
GEMINI_IMAGE_MODEL=gemini-3-pro-image-preview
# Transcription (local Whisper is free)
USE_LOCAL_WHISPER=true
WHISPER_MODEL=turbo
TRANSCRIPTION_LANGUAGE=enThe web interface uses Gran Sabio LLM for intelligent prompt generation. This is a multi-provider AI orchestration engine that handles all LLM calls with unified API key management.
1. Clone Gran Sabio LLM:
git clone https://github.com/jordicor/GranSabio_LLM.git
cd GranSabio_LLM
pip install -r requirements.txt2. Configure Gran Sabio (add your API keys):
cp .env.template .env
# Edit .env with your API keys (OpenAI, Anthropic, Google, etc.)3. Start Gran Sabio server:
python main.py
# Server starts at http://localhost:8000Port Conflict Note: Both Gran Sabio and this project default to port 8000. You have two options:
- Run Gran Sabio on a different port:
python main.py --port 8001and updateGRANSABIO_LLM_URL=http://localhost:8001- Or run this project on a different port:
uvicorn api.main:app --port 8080
4. Configure this project to use Gran Sabio:
Add to your .env:
GRANSABIO_CLIENT_PATH=C:/path/to/GranSabio_LLM/client
GRANSABIO_LLM_URL=http://localhost:8000Why Gran Sabio? Instead of duplicating API integration code, Gran Sabio provides a unified interface to multiple AI providers (OpenAI, Anthropic, Google, xAI, OpenRouter) with features like multi-model QA, thinking modes, and automatic retries. Your API keys are configured once in Gran Sabio and shared across projects.
| Mode | Requirements | Best For |
|---|---|---|
| Web UI | Redis + Gran Sabio | Interactive use, face cluster selection, real-time progress |
| CLI | None (just Python deps) | Batch processing, automation, simpler setup |
The CLI processes videos synchronously without a job queue, so Redis and Gran Sabio are not needed. However, you lose the ability to manually select face clusters and monitor progress in real-time.
Make sure Redis and Gran Sabio are running first, then:
python -m uvicorn api.main:app --reload --host 0.0.0.0 --port 8000Open http://localhost:8000 in your browser.
# Process all videos in configured VIDEOS_DIR
python main.py
# Process a single video
python main.py --single "path/to/video.mp4"
# Preview what would be processed
python main.py --dry-run- Click the directory dropdown
- Add a new directory pointing to your video folder
- Click "Scan" to detect videos
- Select videos (or use "Select All")
- Click "Analyze"
- Wait for the pipeline to complete:
- Scene detection (~30s per 10min video)
- Face extraction (~2min per 1000 frames)
- Face clustering (~5s)
- Transcription (~1min per 10min audio)
- Click on an analyzed video
- Generate titles and descriptions using the AI tabs (optional but recommended)
- Review detected face clusters
- Star the cluster representing the person for thumbnails
- Select your preferred title—the AI will design thumbnails that complement it
- Optionally:
- Add custom instructions
- Upload a style reference image
- Adjust concepts and variations count
- Click "Generate Thumbnails"
- Watch as AI creates your options
- Download favorites or the full ZIP
Videos with costumes, disguises, or location changes can confuse simple face clustering. The "View by Person + Scene" mode subdivides each person into scene-specific clusters:
- Person 1 - Scene 1 (regular outfit)
- Person 1 - Scene 5 (pirate costume)
- Person 1 - Scene 12 (different location)
Auto-selected frames not ideal? The Reference Frame Manager lets you:
- Browse all extracted frames
- Filter by scene
- Drag to reorder priority
- Add/remove from AI references
Upload an image for style inspiration:
- The AI analyzes composition, colors, lighting
- Your style reference influences the output
- Identity preservation remains priority #1
For power users processing many videos:
# Process all videos in directory
python main.py
# Single video
python main.py --single "my_video.mp4"
# Custom generation settings
python main.py --num-prompts 5 --num-variations 3 --image-provider poe
# Force regeneration
python main.py --force-thumbnailsyoutube_thumbnail_generator/
├── api/ # FastAPI web server
│ ├── main.py # App initialization
│ └── routes/ # REST endpoints
├── services/ # Business logic
│ ├── analysis_service.py # Analysis pipeline (1600+ lines)
│ └── generation_service.py # Generation pipeline
├── templates/ # Jinja2 HTML (dark theme UI)
├── static/ # CSS + JavaScript
├── database/ # SQLite with async access (auto-created on first run)
└── output/ # Generated files per video
For detailed technical documentation, see ARCHITECTURE.md.
- Minimum: 8GB RAM, any modern CPU
- Recommended: 16GB RAM, NVIDIA GPU with 4GB+ VRAM
- Python 3.10-3.11
- FFmpeg (for audio extraction)
- Windows 10/11 (primary target), Linux/Mac should work
- Gemini: Free tier available, ~$0.01-0.05 per thumbnail
- OpenAI GPT Image: ~$0.02-0.08 per thumbnail
- Poe: Subscription-based, varies by model
- Whisper Local: Free (uses your GPU/CPU)
- YouTubers re-uploading old content without original thumbnails
- Channel managers handling multiple videos
- Content creators who forgot to shoot thumbnail photos
- Anyone who wants AI-generated thumbnails that actually look like them
- Acerting Art: 430K+ subscribers, relaxation/meditation music
- GranSabio LLM: Multi-layer QA system for LLM content generation
- VR Relaxation Space Room: VR meditation app (Unity/Maya)
- Neo Atlantis: RPG game with 200+ 3D models (1999)
- Security Research Archive: 6 vulnerabilities, 2 CVEs (2004-2006)
MIT License - Use freely, attribution appreciated.
- Mel (FocusingsVlogs) for trusting me with her content archive
- Claude Code for making the initial week-long dev sprint possible
- ElevenLabs for their excellent speech-to-text API with speaker diarization
- The open-source community behind InsightFace, PySceneDetect, and Whisper
Built with obsession, automation, and a healthy disregard for manual labor.