AI Knowledge Crawler
All work
Web Platform

AI Knowledge Crawler

Autonomous knowledge crawler that uses LLM reasoning as its decision engine. Given a topic or seed URL, it generates search queries, evaluates candidate links for topical relevance, and decides whether to follow a link deeper, pivot to a more promising direction, or stop — the crawl path is determined by an LLM evaluating content quality and relevance at each step, not by a fixed link-following algorithm.

Each crawled page is extracted using Trafilatura and BeautifulSoup for clean content, then summarised, chunked, and embedded into a vector knowledge base. Embeddings indexed in Weaviate enable semantic search across everything the crawler has collected. The knowledge base is cumulative — each crawl run builds on what previous runs discovered, expanding coverage without reprocessing known content.

Scheduled execution via system service — configure a topic domain and crawl parameters, then let it run autonomously on a recurring schedule. Each run checks what's already known, identifies gaps, and targets new content. Crawl depth, breadth limits, and relevance thresholds are configurable per topic.

Feeds directly into the content intelligence engine for knowledge graph construction and the RAG pipeline for retrieval-augmented generation. The crawler is the acquisition layer — it finds and structures new information so downstream systems can reason over it. Designed for building deep topical knowledge bases over time rather than broad shallow scraping.

// Tech stack

PythonhttpxTrafilaturaBeautifulSoup4APSchedulerSQLiteWeaviatenomic-embed-text-v2-moesystemd
Live in production