May 10, 20262026

MindMap-AI — Semantic Research Graph for Grounded PDF Intelligence

MindMap-AI converts academic PDFs into a Neo4j-backed semantic graph and answers questions using graph-selected evidence instead of ungrounded document summaries. The system links answers back to citations, graph relationships, and source PDF passages.

#Ai #Graph Systems #Backend

Role: Full-Stack Engineer & System Designer
Stack: FastAPI · Python · Neo4j · OpenAI API · Next.js · React · TypeScript · Zustand · React Force Graph · React PDF · TailwindCSS
Links: Live Repository

Problem

Traditional document RAG systems often treat PDFs as flat text chunks and send large portions of the document directly to an LLM. This approach makes source tracing difficult, weakens relationship awareness between concepts, and increases the risk of unsupported or hallucinated answers. Research documents also contain interconnected entities, citations, claims, and semantic relationships that are difficult to explore through linear retrieval pipelines alone.

Solution

MindMap-AI converts research documents into a Neo4j-backed semantic graph through a multi-stage ingestion pipeline that includes parsing, semantic extraction, normalization, evidence linking, and graph construction. Instead of relying on raw-context prompting, the system retrieves graph-selected evidence through semantic traversal and evidence ranking before composing grounded answers. The frontend connects graph exploration, citations, source passages, and PDF highlighting into a unified research workflow.

Decisions

Used Neo4j because semantic traversal and relationship querying are core system requirements
Used relation instances instead of simple graph edges to preserve evidence and provenance
Separated extraction from parsing so document structure changes would not tightly couple the pipeline
Avoided full-document prompting to reduce hallucinated responses and improve traceability
Used Zustand for lightweight interaction-heavy frontend state management
Prioritized evidence-first retrieval before answer generation to improve answer grounding

Architecture highlights

Separated PDF parsing, extraction, normalization, and graph writing into independent ingestion stages
Implemented relation-instance graph modeling to attach evidence and provenance directly to claims
Built graph-based retrieval before LLM synthesis to improve grounding and reduce hallucinations
Designed evidence-aware query orchestration with ranking and clustering stages
Connected semantic answers back to graph nodes, citations, and highlighted PDF passages
Created an interactive frontend workflow for graph exploration and semantic inspection

Outcomes

Built an end-to-end PDF-to-graph semantic ingestion workflow
Implemented grounded semantic query answering with citation-aware retrieval
Created interactive graph exploration connected to PDF evidence navigation
Established backend, frontend, and e2e testing structure for semantic workflows
Designed a modular architecture that allows independent evolution of parsing and retrieval layers

Overview

MindMap-AI is an evidence-grounded research system that converts academic PDFs into a semantic graph and enables graph-based question answering over extracted concepts, relationships, citations, and source passages.

Traditional PDF chat systems often summarize entire documents directly through an LLM, which can produce unsupported answers and weak source traceability. MindMap-AI approaches the problem differently by building a structured semantic graph from the document first, then using graph-selected evidence during query answering.

Instead of treating a PDF as a flat block of text, the system transforms documents into interconnected semantic entities, relationships, evidence nodes, citations, and passages stored inside Neo4j.

System Flow

PDF Upload
→ Parsing
→ Passage Extraction
→ Semantic Entity & Relation Extraction
→ Canonical Normalization
→ Neo4j Graph Construction
→ Evidence Linking
→ Semantic Query Pipeline
→ Grounded Answer Generation
→ PDF Citation Navigation

The ingestion pipeline separates parsing, extraction, normalization, graph construction, and query orchestration into independent stages. This allows the semantic layer to evolve independently from PDF parsing logic.

Semantic Graph Architecture

The system uses Neo4j as the primary graph database and stores extracted semantic structures as interconnected nodes and relations.

Instead of representing relationships as simple graph edges, the system reifies relations into explicit relation instances:

(source)-[:OUT_REL]->(RelationInstance)-[:TO]->(target)
(Evidence)-[:SUPPORTS]->(RelationInstance)
(Evidence)-[:FROM_PASSAGE]->(Passage)

This design allows the system to:

attach evidence directly to relationships
preserve provenance and citation traceability
support confidence-aware retrieval
connect answers back to exact document passages

Canonical normalization is applied during extraction so semantically similar entities can be merged into unified graph representations.

Query Pipeline

The query pipeline does not send the entire document directly to the LLM.

Instead, the system performs staged semantic retrieval:

Question
→ Candidate Selection
→ Traversal Planning
→ Evidence Ranking
→ Evidence Clustering
→ Answer Composition
→ Grounded Response

The query engine interprets the question, selects relevant graph candidates, traverses semantic relations, ranks supporting evidence, and only then composes an answer.

This architecture improves source grounding and reduces unsupported responses.

The system also includes abstain behavior when evidence coverage is insufficient:

No evidence → no confident answer

This prevents the model from generating highly confident answers without supporting graph evidence.

Frontend Interaction Model

The frontend is designed as an interactive semantic exploration interface rather than a simple chatbot UI.

Users can:

upload and manage research PDFs
explore semantic graph structures visually
inspect extracted entities and relationships
navigate supporting evidence
jump directly to highlighted PDF passages
focus graph traversal based on query results

The interface connects answers, graph nodes, citations, and source passages into a unified exploration workflow.

Query results dynamically drive graph focus, allowing semantic relationships and supporting evidence to remain visually connected during exploration.

Engineering Decisions

Why Neo4j

The project is fundamentally relationship-heavy. Neo4j was chosen because semantic traversal, evidence linking, and citation connectivity are central to the retrieval model.

Why Relation Instances Instead of Direct Edges

Simple graph edges cannot carry provenance and evidence properly. Relation instances allow the system to attach citations, confidence, and supporting passages directly to claims.

Why Separate Parsing and Extraction

PDF parsing and semantic extraction are isolated into separate stages so parsing changes do not tightly couple the extraction pipeline to document structure.

Why Graph Retrieval Before LLM Synthesis

The system retrieves and filters evidence before answer generation instead of relying on raw-context prompting. This improves grounding and reduces hallucinated responses.

Why Evidence-First Answering

The retrieval pipeline prioritizes evidence selection and ranking before answer composition so generated responses remain tied to verifiable document context.

Current State

MindMap-AI currently supports:

academic PDF ingestion
semantic graph construction
evidence-grounded query answering
citation-aware traversal
graph visualization
PDF evidence highlighting
frontend semantic exploration workflows

The project also includes backend tests, frontend component tests, end-to-end smoke tests, and semantic evaluation fixtures for retrieval and answer quality validation.

Future improvements include stronger authentication, upload controls, rate limiting, and more advanced semantic retrieval optimization.

Related projects

Promptura — Iterative Prompt Optimization System for Multimodal AI