AI Search Engine
All work
Web Platform

AI Search Engine

Self-hosted AI-augmented search engine combining multi-engine query dispatch, neural re-ranking, and deep content extraction. User intent detection classifies queries before execution to optimise engine selection and result processing. Query augmentation generates expanded search terms to improve recall across engines that handle different content types.

Multi-model neural re-ranking scores results beyond basic keyword relevance — transformer-based models evaluate semantic match between query intent and document content. BERTopic extracts topic clusters from result sets. KeyBERT pulls key phrases. spaCy runs named entity recognition. The ranking pipeline combines these signals with traditional relevance scoring to surface the most useful results, not just the best keyword matches.

Full content scraping with multiple extraction strategies: Playwright for JavaScript-rendered pages, Selenium for session-dependent content, Trafilatura and readability-lxml for article extraction, BeautifulSoup for structural parsing, and newspaper3k for news content. Handles CAPTCHAs (inline classifier), session cookies, infinite scroll, and complex page structures. OCR via Tesseract for text embedded in images. PDF extraction via PyMuPDF.

Multiple scraping modes: targeted single-URL extraction, multi-query batch scraping, lateral outward crawling from seed URLs, and internal site mapping that follows link structures to build complete site content indexes. Every scraped page gets readable content extraction, metadata preservation, keyword and topic tagging, and entity extraction — structured and ready for downstream ingestion into knowledge graphs, vector stores, or content pipelines.

// Tech stack

FastAPIPythonPlaywrightSeleniumBeautifulSoup4Trafilaturanewspaper3kreadability-lxmlPyTorchTransformersSentence-TransformersBERTopicKeyBERTspaCyNLTKscikit-learnUMAPSQLAlchemyaiosqlitePandasNumPyPillowpytesseractPyMuPDFhttpxaiohttpWebSocketslaunchd
Live in production