May 10, 20262026
MindMap-AI — Semantic Research Graph for Grounded PDF Intelligence
MindMap-AI converts academic PDFs into a Neo4j-backed semantic graph and answers questions using graph-selected evidence instead of ungrounded document summaries. The system links answers back to citations, graph relationships, and source PDF passages.
- Role
- Full-Stack Engineer & System Designer
- Stack
- FastAPI · Python · Neo4j · OpenAI API · Next.js · React · TypeScript · Zustand · React Force Graph · React PDF · TailwindCSS
- Links
- LiveRepository
Problem
Traditional document RAG systems often treat PDFs as flat text chunks and send large portions of the document directly to an LLM. This approach makes source tracing difficult, weakens relationship awareness between concepts, and increases the risk of unsupported or hallucinated answers. Research documents also contain interconnected entities, citations, claims, and semantic relationships that are difficult to explore through linear retrieval pipelines alone.
Solution
MindMap-AI converts research documents into a Neo4j-backed semantic graph through a multi-stage ingestion pipeline that includes parsing, semantic extraction, normalization, evidence linking, and graph construction. Instead of relying on raw-context prompting, the system retrieves graph-selected evidence through semantic traversal and evidence ranking before composing grounded answers. The frontend connects graph exploration, citations, source passages, and PDF highlighting into a unified research workflow.
Decisions
- Used Neo4j because semantic traversal and relationship querying are core system requirements
- Used relation instances instead of simple graph edges to preserve evidence and provenance
- Separated extraction from parsing so document structure changes would not tightly couple the pipeline
- Avoided full-document prompting to reduce hallucinated responses and improve traceability
- Used Zustand for lightweight interaction-heavy frontend state management
- Prioritized evidence-first retrieval before answer generation to improve answer grounding
Architecture highlights
- Separated PDF parsing, extraction, normalization, and graph writing into independent ingestion stages
- Implemented relation-instance graph modeling to attach evidence and provenance directly to claims
- Built graph-based retrieval before LLM synthesis to improve grounding and reduce hallucinations
- Designed evidence-aware query orchestration with ranking and clustering stages
- Connected semantic answers back to graph nodes, citations, and highlighted PDF passages
- Created an interactive frontend workflow for graph exploration and semantic inspection
Outcomes
- Built an end-to-end PDF-to-graph semantic ingestion workflow
- Implemented grounded semantic query answering with citation-aware retrieval
- Created interactive graph exploration connected to PDF evidence navigation
- Established backend, frontend, and e2e testing structure for semantic workflows
- Designed a modular architecture that allows independent evolution of parsing and retrieval layers
Overview
MindMap-AI is an evidence-grounded research system that converts academic PDFs into a semantic graph and enables graph-based question answering over extracted concepts, relationships, citations, and source passages.
Traditional PDF chat systems often summarize entire documents directly through an LLM, which can produce unsupported answers and weak source traceability. MindMap-AI approaches the problem differently by building a structured semantic graph from the document first, then using graph-selected evidence during query answering.
Instead of treating a PDF as a flat block of text, the system transforms documents into interconnected semantic entities, relationships, evidence nodes, citations, and passages stored inside Neo4j.
System Flow
PDF Upload
→ Parsing
→ Passage Extraction
→ Semantic Entity & Relation Extraction
→ Canonical Normalization
→ Neo4j Graph Construction
→ Evidence Linking
→ Semantic Query Pipeline
→ Grounded Answer Generation
→ PDF Citation Navigation
The ingestion pipeline separates parsing, extraction, normalization, graph construction, and query orchestration into independent stages. This allows the semantic layer to evolve independently from PDF parsing logic.
Semantic Graph Architecture
The system uses Neo4j as the primary graph database and stores extracted semantic structures as interconnected nodes and relations.
Instead of representing relationships as simple graph edges, the system reifies relations into explicit relation instances:
(source)-[:OUT_REL]->(RelationInstance)-[:TO]->(target)
(Evidence)-[:SUPPORTS]->(RelationInstance)
(Evidence)-[:FROM_PASSAGE]->(Passage)
This design allows the system to:
- attach evidence directly to relationships
- preserve provenance and citation traceability
- support confidence-aware retrieval
- connect answers back to exact document passages
Canonical normalization is applied during extraction so semantically similar entities can be merged into unified graph representations.
Query Pipeline
The query pipeline does not send the entire document directly to the LLM.
Instead, the system performs staged semantic retrieval:
Question
→ Candidate Selection
→ Traversal Planning
→ Evidence Ranking
→ Evidence Clustering
→ Answer Composition
→ Grounded Response
The query engine interprets the question, selects relevant graph candidates, traverses semantic relations, ranks supporting evidence, and only then composes an answer.
This architecture improves source grounding and reduces unsupported responses.
The system also includes abstain behavior when evidence coverage is insufficient:
No evidence → no confident answer
This prevents the model from generating highly confident answers without supporting graph evidence.
Frontend Interaction Model
The frontend is designed as an interactive semantic exploration interface rather than a simple chatbot UI.
Users can:
- upload and manage research PDFs
- explore semantic graph structures visually
- inspect extracted entities and relationships
- navigate supporting evidence
- jump directly to highlighted PDF passages
- focus graph traversal based on query results
The interface connects answers, graph nodes, citations, and source passages into a unified exploration workflow.
Query results dynamically drive graph focus, allowing semantic relationships and supporting evidence to remain visually connected during exploration.
Engineering Decisions
Why Neo4j
The project is fundamentally relationship-heavy. Neo4j was chosen because semantic traversal, evidence linking, and citation connectivity are central to the retrieval model.
Why Relation Instances Instead of Direct Edges
Simple graph edges cannot carry provenance and evidence properly. Relation instances allow the system to attach citations, confidence, and supporting passages directly to claims.
Why Separate Parsing and Extraction
PDF parsing and semantic extraction are isolated into separate stages so parsing changes do not tightly couple the extraction pipeline to document structure.
Why Graph Retrieval Before LLM Synthesis
The system retrieves and filters evidence before answer generation instead of relying on raw-context prompting. This improves grounding and reduces hallucinated responses.
Why Evidence-First Answering
The retrieval pipeline prioritizes evidence selection and ranking before answer composition so generated responses remain tied to verifiable document context.
Current State
MindMap-AI currently supports:
- academic PDF ingestion
- semantic graph construction
- evidence-grounded query answering
- citation-aware traversal
- graph visualization
- PDF evidence highlighting
- frontend semantic exploration workflows
The project also includes backend tests, frontend component tests, end-to-end smoke tests, and semantic evaluation fixtures for retrieval and answer quality validation.
Future improvements include stronger authentication, upload controls, rate limiting, and more advanced semantic retrieval optimization.