Document Intelligence Pipeline
All work
AI Pipeline

Document Intelligence Pipeline

The ingestion layer for every text-based data source in the platform. Web pages, scraped content, PDFs, conversation logs — anything text goes in, structured intelligence comes out. A single processing pass runs multiple analysis stages: named entity recognition extracts people, organisations, locations, dates, and products. Topic modelling with BERTopic identifies subject matter. Semantic embeddings are generated and indexed for vector retrieval. Geolocation mentions are validated against real coordinates.

Sentiment analysis goes beyond positive/negative classification. A custom multi-emotion lexicon scores text against a spectrum of distinct emotional states, mapped from transformer classification output through a semantic proximity layer. The result is emotional arc tracking — how sentiment shifts through a document section by section, paragraph by paragraph. For long-form content, this reveals narrative structure and tonal shifts that a single aggregate score would flatten into nothing.

Aspect-based sentiment analysis (ABSA) with DeBERTa provides entity-level sentiment — not just 'this document is negative' but 'the sentiment toward [specific entity] in this paragraph is frustrated.' Combined with the emotion lexicon, this gives granular emotional context per entity per section.

The pipeline feeds every downstream system. Extracted entities populate the knowledge graph. Topic assignments drive content clustering. Embeddings enable semantic search. Emotional analysis informs conversation intelligence. Every document that enters the platform passes through this pipeline, and every system that needs structured data from unstructured text depends on its output.

Handles batch ingestion with checkpoint logging, deduplication against existing records, and configurable processing stages. Can run as a one-off import or as a continuous ingestion endpoint accepting new content as it arrives from scrapers, crawlers, or manual submission.

// Tech stack

FastAPIPythonspaCyBERTopicDeBERTaWeaviatenomic-embed-text-v2-moeUMAPHDBSCANSQLite
Live in production