
Multi-model visual embedding pipeline generating multiple vector representations per image. InsightFace ArcFace produces 512-dimensional face identity embeddings capturing geometric facial features. CLIP ViT-L/14 produces 768-dimensional scene embeddings capturing visual context, style, and composition. DINO generates self-supervised features for structural understanding. Face-region extraction isolates faces before embedding to produce identity vectors independent of background and scene content.
Every processed image receives multiple embedding vectors stored in PostgreSQL with pgvector extension. HNSW indexing on each embedding column enables sub-second nearest-neighbour queries across libraries of tens of thousands of images. Combined embedding queries weight multiple vector spaces for similarity searches that balance identity, style, and scene factors.
Distributed batch processing across GPU nodes with coordinated dispatch. Embedding generation is compute-intensive — face detection, alignment, region extraction, and multiple forward passes through different models per image. The pipeline batches efficiently, persists results incrementally, and checkpoints progress so interrupted runs resume without reprocessing completed images.
Downstream systems consume these embeddings for different purposes: the clustering engine uses them for character group discovery, the curation system uses them for face-based search, the analysis pipeline uses them for duplicate detection via perceptual hashing and embedding distance, and the similarity API exposes them for ad-hoc queries. The embedding pipeline is the foundation layer — compute once, query from everywhere.