YouTube Thumbnail Generator

AI-Powered Thumbnails, Titles & Descriptions with Face Identity Preservation

From 100 videos uploaded in 3 years... to batch-processing 178 videos in days.

First version shipped in one week with Claude Code during my "vacation" from entrepreneurship.

The Story Behind This Tool

Back in 2015-2016, I was the manager of FocusingsVlogs (Mel Dominguez), a Spanish YouTuber who had built an audience of 240,000+ subscribers. Life happened, and she deleted her channel for personal reasons before the pandemic.

I had a backup of almost all her videos—but not the thumbnails.

Fast forward to 2023. I reached out to Mel with an idea: let's re-upload her old videos as a digital memory archive. She agreed, as long as I handled everything so she could stay off the radar.

The problem? Creating thumbnails for 278 videos when:

The original thumbnails were gone
The videos were 720p with camera noise and low sharpness
Mel had changed her appearance significantly since 2012-2016
Each thumbnail required: watching the video, finding good frames, upscaling, retouching, designing in Canva
AI image generation existed but couldn't reliably maintain her face

After 3 years of manual work (2023-2025), I had only uploaded 100 videos. 178 still waiting.

Then came the AI revolution of late 2025. Models could finally:

Generate high-quality images with text
Preserve facial identity from reference photos
Understand complex creative prompts

So I shipped the first working version in 8 days during the holidays (December 27, 2025 - January 4, 2026) using Claude Code. Not even consecutive days—New Year's happened in the middle. And yet here we are: complete web interface, face detection pipeline, multi-provider AI integration. I'm genuinely amazed at how fast we can ship full applications now.

It automates what used to take me hours per video into a streamlined, AI-powered pipeline.

What This Tool Does

VIDEO FILE → Scene Detection → Face Extraction → Face Clustering → Transcription
                                                                        ↓
THUMBNAILS ← Image Generation ← Creative Prompts ← LLM Analysis ← Content Understanding

One click transforms any video into multiple AI-generated thumbnail options, each preserving the exact facial identity of the person you select.

Key Features

Identity Cloning Technology

The system uses strict identity preservation prompts that prevent AI models from "averaging" or "beautifying" faces. Your reference person appears in thumbnails looking exactly like themselves—not a similar-looking AI interpretation.

"A friend of this person should INSTANTLY recognize them in the generated image."

Multi-Provider AI Support

Switch between image generation providers based on your needs:

Provider	Best Models	Max References	Strengths
Google Gemini	gemini-3-pro-image-preview	14 images	Best identity preservation
OpenAI	gpt-image-1.5	16 images	High fidelity
Poe	nanobananapro, flux2pro	14 images	Fast, reliable
Replicate	FLUX 1.1 Pro	1 image	Open-source option

Automatic Face Clustering

No manual face labeling needed. The system:

Detects all faces in the video using InsightFace
Creates 512-dimensional embeddings for each face
Clusters faces by identity using DBSCAN
Separates clusters by scene (handles costumes/disguises)

Smart Frame Selection

Not all frames are equal. The system scores each face by:

Quality: Sharpness, brightness, contrast
Pose: Front-facing preferred over profiles
Expression: Balanced selection (smiling, neutral, mouth closed)
Size: Larger faces score higher

Creative Prompt Generation

Uses LLMs (Claude, GPT-4, Gemini) to analyze video transcripts and generate:

Multiple creative concepts per video
Suggested clickable titles
Color palettes and mood descriptions
Text overlay suggestions

AI Title & Description Generation

Beyond thumbnails, the tool generates YouTube-ready metadata:

Titles: Multiple options in different styles (neutral, SEO-optimized, clickbait)
Descriptions: With optional timestamps, hashtags, and calls-to-action
Title-aware thumbnails: When you select a title, the image generation takes it into account to create visually coherent thumbnails that complement your chosen title

Real-Time Progress Tracking

Server-Sent Events (SSE) provide instant feedback:

Watch scene detection progress
See faces being extracted live
Track thumbnail generation in real-time

Quick Start

Prerequisites

Python 3.10+ (3.11 recommended)

FFmpeg installed and in PATH

# Windows (PowerShell as admin)
winget install FFmpeg

# Linux
sudo apt install ffmpeg

# Mac
brew install ffmpeg

Redis (required for Web UI, not needed for CLI)

# Windows: Use Laragon (recommended), Docker, or WSL
# https://laragon.org/ - includes Redis out of the box

# Linux
sudo apt install redis-server

# Mac
brew install redis

NVIDIA GPU recommended (works on CPU, 5-10x slower)
API keys for at least one provider (Gemini, OpenAI, or Poe)

Installation

# Clone the repository
git clone https://github.com/jordicor/youtube_thumbnail_generator.git
cd youtube_thumbnail_generator

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

# Copy environment template
copy .env.example .env  # Windows
# cp .env.example .env  # Linux/Mac

Configuration

Edit .env with your settings:

# Required: At least one image provider
GEMINI_API_KEY=your-gemini-key
# OPENAI_API_KEY=your-openai-key
# POE_API_KEY=your-poe-key

# Required: At least one LLM for prompts
ANTHROPIC_API_KEY=your-anthropic-key
# OPENAI_API_KEY=your-openai-key

# Directories
VIDEOS_DIR=C:/path/to/your/videos
OUTPUT_DIR=./output

# Image generation settings
IMAGE_PROVIDER=gemini
GEMINI_IMAGE_MODEL=gemini-3-pro-image-preview

# Transcription (local Whisper is free)
USE_LOCAL_WHISPER=true
WHISPER_MODEL=turbo
TRANSCRIPTION_LANGUAGE=en

Gran Sabio LLM Setup (Required)

The web interface uses Gran Sabio LLM for intelligent prompt generation. This is a multi-provider AI orchestration engine that handles all LLM calls with unified API key management.

1. Clone Gran Sabio LLM:

git clone https://github.com/jordicor/GranSabio_LLM.git
cd GranSabio_LLM
pip install -r requirements.txt

2. Configure Gran Sabio (add your API keys):

cp .env.template .env
# Edit .env with your API keys (OpenAI, Anthropic, Google, etc.)

3. Start Gran Sabio server:

python main.py
# Server starts at http://localhost:8000

Port Conflict Note: Both Gran Sabio and this project default to port 8000. You have two options:

Run Gran Sabio on a different port: python main.py --port 8001 and update GRANSABIO_LLM_URL=http://localhost:8001

Or run this project on a different port: uvicorn api.main:app --port 8080

4. Configure this project to use Gran Sabio:

Add to your .env:

GRANSABIO_CLIENT_PATH=C:/path/to/GranSabio_LLM/client
GRANSABIO_LLM_URL=http://localhost:8000

Why Gran Sabio? Instead of duplicating API integration code, Gran Sabio provides a unified interface to multiple AI providers (OpenAI, Anthropic, Google, xAI, OpenRouter) with features like multi-model QA, thinking modes, and automatic retries. Your API keys are configured once in Gran Sabio and shared across projects.

Two Ways to Run

Mode	Requirements	Best For
Web UI	Redis + Gran Sabio	Interactive use, face cluster selection, real-time progress
CLI	None (just Python deps)	Batch processing, automation, simpler setup

The CLI processes videos synchronously without a job queue, so Redis and Gran Sabio are not needed. However, you lose the ability to manually select face clusters and monitor progress in real-time.

Run the Web UI

Make sure Redis and Gran Sabio are running first, then:

python -m uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

Open http://localhost:8000 in your browser.

Run the CLI (No Redis/Gran Sabio needed)

# Process all videos in configured VIDEOS_DIR
python main.py

# Process a single video
python main.py --single "path/to/video.mp4"

# Preview what would be processed
python main.py --dry-run

Usage Guide

Step 1: Add Your Videos

Click the directory dropdown
Add a new directory pointing to your video folder
Click "Scan" to detect videos

Step 2: Analyze Videos

Select videos (or use "Select All")
Click "Analyze"
Wait for the pipeline to complete:
- Scene detection (~30s per 10min video)
- Face extraction (~2min per 1000 frames)
- Face clustering (~5s)
- Transcription (~1min per 10min audio)

Step 3: Configure Generation

Click on an analyzed video
Generate titles and descriptions using the AI tabs (optional but recommended)
Review detected face clusters
Star the cluster representing the person for thumbnails
Select your preferred title—the AI will design thumbnails that complement it
Optionally:
- Add custom instructions
- Upload a style reference image
- Adjust concepts and variations count

Step 4: Generate Thumbnails

Click "Generate Thumbnails"
Watch as AI creates your options
Download favorites or the full ZIP

Advanced Features

Person + Scene Clustering

Videos with costumes, disguises, or location changes can confuse simple face clustering. The "View by Person + Scene" mode subdivides each person into scene-specific clusters:

Person 1 - Scene 1 (regular outfit)
Person 1 - Scene 5 (pirate costume)
Person 1 - Scene 12 (different location)

Manual Reference Selection

Auto-selected frames not ideal? The Reference Frame Manager lets you:

Browse all extracted frames
Filter by scene
Drag to reorder priority
Add/remove from AI references

External Style Reference

Upload an image for style inspiration:

The AI analyzes composition, colors, lighting
Your style reference influences the output
Identity preservation remains priority #1

CLI Batch Processing

For power users processing many videos:

# Process all videos in directory
python main.py

# Single video
python main.py --single "my_video.mp4"

# Custom generation settings
python main.py --num-prompts 5 --num-variations 3 --image-provider poe

# Force regeneration
python main.py --force-thumbnails

Architecture

youtube_thumbnail_generator/
├── api/                    # FastAPI web server
│   ├── main.py             # App initialization
│   └── routes/             # REST endpoints
├── services/               # Business logic
│   ├── analysis_service.py # Analysis pipeline (1600+ lines)
│   └── generation_service.py # Generation pipeline
├── templates/              # Jinja2 HTML (dark theme UI)
├── static/                 # CSS + JavaScript
├── database/               # SQLite with async access (auto-created on first run)
└── output/                 # Generated files per video

For detailed technical documentation, see ARCHITECTURE.md.

Requirements

Hardware

Minimum: 8GB RAM, any modern CPU
Recommended: 16GB RAM, NVIDIA GPU with 4GB+ VRAM

Software

Python 3.10-3.11
FFmpeg (for audio extraction)
Windows 10/11 (primary target), Linux/Mac should work

API Costs (Approximate)

Gemini: Free tier available, ~$0.01-0.05 per thumbnail
OpenAI GPT Image: ~$0.02-0.08 per thumbnail
Poe: Subscription-based, varies by model
Whisper Local: Free (uses your GPU/CPU)

Who Is This For?

YouTubers re-uploading old content without original thumbnails
Channel managers handling multiple videos
Content creators who forgot to shoot thumbnail photos
Anyone who wants AI-generated thumbnails that actually look like them

Other Projects

Acerting Art: 430K+ subscribers, relaxation/meditation music
GranSabio LLM: Multi-layer QA system for LLM content generation
VR Relaxation Space Room: VR meditation app (Unity/Maya)
Neo Atlantis: RPG game with 200+ 3D models (1999)
Security Research Archive: 6 vulnerabilities, 2 CVEs (2004-2006)

License

MIT License - Use freely, attribution appreciated.

Acknowledgments

Mel (FocusingsVlogs) for trusting me with her content archive
Claude Code for making the initial week-long dev sprint possible
ElevenLabs for their excellent speech-to-text API with speaker diarization
The open-source community behind InsightFace, PySceneDetect, and Whisper

Built with obsession, automation, and a healthy disregard for manual labor.

GitHub • Website • YouTube

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
api		api
database		database
i18n		i18n
job_queue		job_queue
services		services
static		static
templates		templates
tests		tests
workers		workers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.py		config.py
cuda_setup.py		cuda_setup.py
face_clustering.py		face_clustering.py
face_extraction.py		face_extraction.py
gransabio_prompt_generator.py		gransabio_prompt_generator.py
image_generation.py		image_generation.py
main.py		main.py
prompt_generation.py		prompt_generation.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
scene_detection.py		scene_detection.py
schemas_v2.py		schemas_v2.py
setup_check.py		setup_check.py
transcript_processing.py		transcript_processing.py
transcription.py		transcription.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

YouTube Thumbnail Generator

AI-Powered Thumbnails, Titles & Descriptions with Face Identity Preservation

The Story Behind This Tool

What This Tool Does

Key Features

Identity Cloning Technology

Multi-Provider AI Support

Automatic Face Clustering

Smart Frame Selection

Creative Prompt Generation

AI Title & Description Generation

Real-Time Progress Tracking

Quick Start

Prerequisites

Installation

Configuration

Gran Sabio LLM Setup (Required)

Two Ways to Run

Run the Web UI

Run the CLI (No Redis/Gran Sabio needed)

Usage Guide

Step 1: Add Your Videos

Step 2: Analyze Videos

Step 3: Configure Generation

Step 4: Generate Thumbnails

Advanced Features

Person + Scene Clustering

Manual Reference Selection

External Style Reference

CLI Batch Processing

Architecture

Requirements

Hardware

Software

API Costs (Approximate)

Who Is This For?

Other Projects

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages