What if the best way we construct AI doc chatbots right now is flawed? Most methods use RAG. They break up paperwork into chunks, create embeddings, and retrieve solutions utilizing similarity search. It really works in demos however usually fails in actual use. It misses apparent solutions or picks the fallacious context. Now there’s a new method referred to as PageIndex. It doesn’t use chunking, embeddings, or vector databases. But it reaches as much as 98.7% accuracy on robust doc Q&A duties. On this article, we’ll break down how PageIndex works, why it performs higher on structured paperwork, and how one can construct your individual chatbot utilizing it.
The Drawback with Conventional RAG
Right here’s the traditional RAG pipeline you’ve in all probability seen 100 occasions.
- You’re taking your doc – may very well be a PDF, a report, a contract – and also you chop it into chunks. Perhaps 512 tokens every, possibly with some overlap.
- You run every chunk by way of an embedding mannequin to show it right into a vector — a protracted record of numbers that represents the “which means” of that chunk.
- You retailer all these vectors in a vector database — Pinecone, Weaviate, Chroma, no matter your flavour is.
- When the consumer asks a query, you embed the query the identical method, and also you do a cosine similarity search to seek out the chunks whose vectors are closest to the query vector.
- You hand these chunks to the LLM as context, and it writes the reply.
Easy. Elegant. And completely riddled with failure modes.
Drawback 1: Arbitrary chunking destroys context
While you slice a doc at 512 tokens, you’re not respecting the doc’s precise construction. A single desk may get break up throughout three chunks. A footnote that’s crucial to understanding the principle textual content leads to a very completely different chunk. The reply you want may actually span two adjoining chunks that the retriever picks solely certainly one of.
Drawback 2: Similarity just isn’t the identical as relevance
That is the massive one. Vector similarity finds textual content that seems like your query. However paperwork usually don’t repeat the query’s phrasing after they reply it. Ask “What’s the termination clause?” and the contract may simply say “Part 14.3 — Dissolution of Settlement.” Low cosine similarity. Missed totally.
Drawback 3: It’s a black field
You get three chunks again. Why these three? You haven’t any concept. It’s pure math. There’s no reasoning, no clarification, no audit path. For monetary paperwork, authorized contracts, and medical information? That opacity is a major problem.
Drawback 4: It doesn’t scale to lengthy paperwork
A 300-page technical handbook with advanced cross-references? The sheer variety of chunks makes retrieval noisy. You find yourself getting chunks which might be vaguely associated as an alternative of the precise part you want.
These aren’t edge circumstances. These are the on a regular basis failures that RAG engineers spend most of their time combating. And the rationale they occur is definitely fairly easy — the complete structure is borrowed from engines like google, not from how people really learn and perceive paperwork.
When a human professional must reply a query from a doc, they don’t scan each sentence searching for the one which sounds most much like the query. They open the desk of contents, skim the chapter headings, navigate, and cause about the place the reply must be earlier than they even begin studying.
That’s the perception behind PageIndex.
What’s PageIndex?
PageIndex was constructed by VectifyAI and open-sourced on GitHub. The core concept is deceptively easy:
As a substitute of looking a doc, navigate it: the best way a human professional would.
Right here’s the important thing psychological shift. Conventional RAG asks: “Which chunks look most much like my query?”
PageIndex asks: “The place on this doc would a sensible human search for the reply to this query?”
These are two very completely different questions. And the second seems to provide dramatically higher outcomes.
PageIndex does this by constructing what it calls a Reasoning Tree. It’s primarily an clever, AI-generated desk of contents to your doc.
Right here’s how one can visualize it. On the high, you’ve a root node that represents the complete doc. Beneath that, you’ve nodes for every main part or chapter. Every of these branches into subsections. Every subsection branches into particular matters or paragraphs. Each single node on this tree has two issues:
- A title: what this part is about
- A abstract: a concise AI-generated description of what’s on this part
This tree is constructed as soon as, if you first submit the doc. It’s your index.
Now right here’s the place it will get intelligent. While you ask a query, PageIndex does two issues:
1. Tree Search (Navigation)
It sends the query to an LLM together with the tree, however simply the titles and summaries, not the total textual content. The LLM reads by way of the tree like a human reads a desk of contents, and it causes: “Okay, given this query, which branches of the tree are most certainly to comprise the reply?”
The LLM returns a listing of particular node IDs, and you’ll see its reasoning. It actually tells you why it selected these sections. Full transparency.
PageIndex fetches solely the total textual content of these chosen nodes, fingers it to the LLM as context, and the LLM writes the ultimate reply grounded totally in the true doc textual content.
Two LLM calls. No embeddings. No vector database. Simply reasoning.
And since each reply is tied to particular nodes within the tree, you all the time know precisely which web page, which part, which a part of the doc the reply got here from. Full audit path. Full explainability.
The way it Works: Deep Dive
Let me go deeper into the mechanics, as a result of that is the actually fascinating half.
The Tree Index – Constructing Part
While you name submit_document(), PageIndex reads your PDF or textual content file and does one thing outstanding. It doesn’t simply extract textual content but in addition understands the construction. Utilizing a mixture of format evaluation and LLM reasoning, it identifies:
- What are the pure sections and subsections?
- The place does one matter finish and one other start?
- How do the items relate to one another hierarchically?
It then constructs the tree and generates a abstract for each node. Not only a title. An precise condensed description of what’s in that part. That is what allows the good navigation later.
The tree makes use of a numeric node ID system that mirrors actual doc construction: 0001 could be Chapter 1, 0002 Chapter 2, 0003 the primary part inside Chapter 1, and so forth. The hierarchy is preserved.
Why This Beats Chunking
Take into consideration what chunking does to a 50-page monetary report. You get possibly 300 chunks, every with zero consciousness of whether or not it’s from the chief abstract or a footnote on web page 47. The embedder treats all of them equally.
The PageIndex tree, alternatively, is aware of that node 0012 is the “Income Breakdown” subsection below the “Q3 Monetary Outcomes” part below “Annual Report 2024.” That structural consciousness is enormously helpful if you’re looking for one thing particular.
The Search Part – Reasoning, Not Math
Right here’s the opposite factor that makes PageIndex particular. The search step just isn’t a mathematical operation. It’s a cognitive operation carried out by an LLM.
While you ask, “What had been the principle threat components disclosed on this report?”, the LLM doesn’t measure cosine distance. It reads the tree, acknowledges that the “Danger Elements” part is strictly what’s wanted, and selects these nodes, similar to you’d.
This implies PageIndex handles semantic mismatch naturally. That is the type of mismatch that kills vector search. The doc calls it “Danger Elements.” Your query calls it “major risks.” A vector search may miss it. An LLM studying the tree construction is not going to.
The Numbers
PageIndex powered Mafin 2.5, VectifyAI’s monetary RAG system, which achieved 98.7% accuracy on FinanceBench. For these unaware, this can be a benchmark particularly designed to check AI methods on monetary doc questions, the place the paperwork are lengthy, advanced, and filled with tables and cross-references. That’s the toughest surroundings for conventional RAG. It’s the place PageIndex shines most.
What’s it Greatest For?
PageIndex is especially highly effective for:
- Monetary stories: earnings statements, SEC filings, 10-Ks
- Authorized contracts: the place each clause issues and context is all the pieces
- Technical manuals: advanced cross-referenced documentation
- Coverage paperwork: HR insurance policies, compliance paperwork, regulatory filings
- Analysis papers: structured educational content material
Mainly: wherever your doc has actual construction that chunking would destroy.
And the actually thrilling factor? You should utilize it with any LLM. OpenAI, Anthropic, Gemini — the tree search and reply era steps are simply prompts. You’re in full management.
Arms-on With Jupyter Pocket book
Okay. You now know the speculation. You understand why PageIndex exists, what it does, and the way it works below the hood. Now let’s really construct one thing with it.
I’m going to open a Jupyter pocket book and stroll you thru the whole PageIndex pipeline: importing a doc, getting the reasoning tree again, navigating it with an LLM, and asking questions. Each line of code is defined. No hand-waving.
Set up PageIndex
%pip set up -q –upgrade pageindex
First issues first. We set up the pageindex Python library. One line, completed. No vector database to arrange. No embedding mannequin to obtain. That is already easier than any conventional RAG setup.
Imports & API Setup
import os
from pageindex import PageIndexClient
import pageindex.utils as utils
from dotenv import load_dotenv
load_dotenv()
PAGEINDEX_API_KEY = os.getenv(“PAGEINDEX_API_KEY”)
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)
We import the PageIndexClient. That is our connection to the PageIndex API. All of the heavy lifting of constructing the tree occurs on their finish, so we don’t want a beefy machine. We additionally load API keys from a .env file — all the time maintain your keys out of your code.
OpenAI Setup
import openai
async def call_llm(immediate, mannequin=”gpt-4.1-mini”, temperature=0):
consumer = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
response = await consumer.chat.completions.create(…)
return response.decisions[0].message.content material.strip()
Right here we outline our LLM helper operate. We’re utilizing GPT-4.1-mini for value effectivity — however this works with any OpenAI mannequin, and you may swap in Claude or Gemini with a one-line change. Temperature zero retains the solutions factual and constant.
Submit the Doc
pdf_path = “/Customers/soumil/Desktop/PageIndex/HR Insurance policies-1.pdf”
doc_id = pi_client.submit_document(pdf_path)[“doc_id”]
print(‘Doc Submitted:’, doc_id)
That is the magic line. We level to our PDF — on this case an HR coverage doc — and submit it. PageIndex takes the file, reads its construction, and begins constructing the reasoning tree within the background. We get again a doc_id, a novel identifier for this doc that we’ll use in each subsequent name. Discover there’s no chunking code, no embedding name, no vector database connection.
Await Processing & Get the Tree
whereas not pi_client.is_retrieval_ready(doc_id):
print(“Nonetheless processing… retrying in 10 seconds”)
time.sleep(10)
tree = pi_client.get_tree(doc_id, node_summary=True)[‘result’]
utils.print_tree(tree)
PageIndex processes the doc asynchronously — we simply ballot each 10 seconds till it’s prepared. Then we name get_tree() with node_summary=True, which provides us the total tree construction together with summaries.
Take a look at this output. That is the reasoning tree. You’ll be able to see the hierarchy — the top-level HR Insurance policies node, then Digital Communication Coverage, Sexual Harassment Coverage, Grievance Redressal Coverage, every branching into its subsections. Each node has an ID, a title, and a abstract of what’s in it.
That is what conventional RAG throws away. The construction. The relationships. The hierarchy. PageIndex retains all of it.
Tree Search with the LLM
question = “What are the important thing HR insurance policies and worker pointers?”
tree_without_text = utils.remove_fields(tree.copy(), fields=[‘text’])
search_prompt = f”””
You’re given a query and a tree construction of a doc…
Query: {question}
Doc tree construction: {json.dumps(tree_without_text, indent=2)}
Reply in JSON: {{ “pondering”: “…”, “node_list”: […] }}
“””
tree_search_result = await call_llm(search_prompt)
Now we search. For this, we construct a immediate that features the query and the complete tree — however crucially, with out the total textual content content material of every node. Simply the titles and summaries. This retains the immediate manageable whereas giving the LLM all the pieces it must navigate.
The LLM is instructed to return a JSON object with two issues: its pondering course of and the record of related node IDs.
Take a look at the output. The LLM tells us precisely why it selected every part. It reasoned by way of the tree like a human would. And it gave us a listing of 30 node IDs — each part of this HR doc, as a result of the query is broad.
This transparency is one thing you merely can’t get with cosine similarity.
Fetch Textual content and Generate Reply
node_list = tree_search_result_json[“node_list”]
relevant_content = “nn”.be a part of(node_map[node_id][“text”] for node_id in node_list)
answer_prompt = f”””Reply the query based mostly on the context:
Query: {question}
Context: {relevant_content}”””
reply = await call_llm(answer_prompt)
utils.print_wrapped(reply)
Step two. Now that we all know which nodes are related, we fetch their full textual content — solely these nodes, nothing else. We be a part of the textual content and construct a clear context immediate. Yet one more LLM name, and we get our reply.
Take a look at this reply. Detailed, structured, correct. And each single declare will be traced again to a particular node within the tree, which maps to a particular web page within the PDF. Full audit path. Full explainability.
The ask() Perform
async def ask(question):
# Full pipeline: tree search → textual content retrieval → reply era
…
user_query = enter(“Enter your question: “)
await ask(user_query)
Now we bundle the complete pipeline right into a single ask() operate. Submit a query, get a solution — the tree search, retrieval, and era all occur below the hood. Let me present you a few stay examples.
Kind a query: e.g., “What are the penalties for sexual harassment?”
Watch what occurs. It searches the tree, identifies the Sexual Harassment Coverage nodes particularly, pulls their textual content, and offers us a exact, cited reply in seconds. That is the expertise you need to ship to your customers.
One other one. Once more, it finds precisely the fitting part. No confusion, no noise, no hallucination. Simply the reply, from the doc, with a transparent path displaying the place it got here from.
Conclusion
Let’s convey this collectively. Conventional RAG finds textual content that appears much like a query. However the true objective is to seek out the fitting reply in a structured doc. PageIndex solves this higher. It builds a reasoning tree and lets the mannequin navigate it intelligently. The result’s correct and explainable solutions, with as much as 98.7% accuracy on FinanceBench. It isn’t excellent for each use case. Vector search nonetheless works properly for giant scale semantic search. However for lengthy, structured paperwork, PageIndex is a stronger method. You could find all of the code within the description. Add your API keys and get began.
I’m a Information Science Trainee at Analytics Vidhya, passionately engaged on the event of superior AI options corresponding to Generative AI functions, Massive Language Fashions, and cutting-edge AI instruments that push the boundaries of know-how. My position additionally entails creating partaking instructional content material for Analytics Vidhya’s YouTube channels, creating complete programs that cowl the total spectrum of machine studying to generative AI, and authoring technical blogs that join foundational ideas with the most recent improvements in AI. By means of this, I intention to contribute to constructing clever methods and share data that conjures up and empowers the AI group.
Login to proceed studying and luxuriate in expert-curated content material.
Hold Studying for Free

