Chaining Prompts to Build a Usable Internal Knowledge Base

Chaining Prompts to Build a Usable Internal Knowledge Base

1. Why embedding-based search alone fails in real workflows

Vector search sounds fancy until you dump 4k token blog posts in Pinecone and realize every query about pricing returns the same onboarding article. That happened to me inside an internal Notion-to-GPT retrieval setup — even with cosine similarity, it always thought the most relevant doc was a long FAQ labeled “basic subscription info (draft)”. Nobody on my team actually read it. Great.

The flaw: embeddings are blind to tone, recency, and context cues. If your query is “when did our refund policy change”, and the embedding says, “this pre-launch doc about returns is a 92% match”, you’re stuck. Plain semantic similarity isn’t granular enough when multiple pages say similar things 10 different ways. The lack of weighting for document metadata (like modification date, author certainty, linked products) is crippling. And yes, yes — you can bolt metadata filters onto vector search in some tools… but unless you’re using Pinecone with metadata filtering and sorting, it’s awkward to do dynamically with LLM calls.

Even worse, I once loaded 100+ Airtable records into an embedding-powered knowledge assistant where half the fields were blank — and that actually improved retrieval performance, just because there was less lexical noise. Which is backwards.

2. Using prompt chains instead of RAG for source consistency

Standard Retrieval Augmented Generation (RAG) is great for pulling source materials into prompts dynamically. But I kept running into this issue: when I queried my Notion-stored knowledge base through a basic RAG flow via Zapier and OpenAI, the LLM would often mix multiple sources together, then cite none. Lotta ghosts, no receipts.

The fix was actually a semi-manual prompt chain where we split:

  1. Chunks from docs are ingested via the API (I piped them from Notion into a JSON array via Make)
  2. A top-level routing prompt identifies what kind of query this is (“product policy”, “internal team protocol”, “external reference”)
  3. Then a chained prompt passes only semi-relevant documents to a summarizer, with metadata displayed inline
  4. LLM responds with: answer + excerpt + source title (enforced via JSON schema check)

This multi-prompt chain forced the model to quote from the right doc, instead of hallucinating a merged summary. It’s slower (~3–6 seconds per response versus 1 for base GPT-3.5 alone), but in exchange, I can trust what someone pastes in Slack isn’t completely made up. Roughly 15% more accurate in user testing, though we didn’t formalize it.

3. Breaking down your corpus into micro-intent document slices

The biggest lift was figuring out how to chunk long PDFs or Notion pages into slices that aligned to the types of questions users ask. Doing a straight 1K token chunk every N characters gave noisy overlaps and pulled in headers without context. Also, OpenAI’s tokenizer behaves very differently when dealing with tables or unordered bullets — it sometimes tokenizes entire sections as one piece, which collapses the chunking strategy entirely.

I switched gears after watching some users interact with our knowledge tool in Slack. Example: they’d ask “Where do I submit proof of purchase?” — but that word phrase never existed in raw docs. The actual phrasing was hidden in a legal PDF: “Receipts must be submitted through portal A within 14 days of delivery acknowledgment.”

So I coded a secondary processing step that:

  • Extracts action verbs or obligations (like “submit”, “upload”, “approve”)
  • Pairs them with noun objects (“proof”, “receipt”, “expense form”, etc)
  • Generates simulated Q&A pairs using GPT-4 — e.g., “How do I submit proof of receipt?”
  • Tags each document slice with these simulated questions

Then I used these generated proxies for embedding filtering rather than the raw documents themselves. Result: retrieval gets ~20% better match to question structure, which is the difference between useless and usable inside a Slackbot.

4. Avoiding hallucinated factual merges by enforcing JSON return formats

The first time someone showed me a summary from our internal bot that said “payouts are processed weekly via Wise or Payoneer depending on region,” I was confused because we hadn’t used Wise in months. The doc it hallucinated that from? A changelog that said, “We used to use Wise but stopped in July.”

The solution turned out to be really strict formatting — using a forced JSON structure in the final prompt in the chain. Example:

{
  "answer": <TEXT>,
  "source": <FILENAME>,
  "excerpt": <SUBSTRING FROM FILE>
}

This format makes LLM outputs parseable by validators (I use a custom GPT parser in Make to check key completeness). But more importantly, it forces the LLM to anchor its response in a specific file name and snippet. Once we did this, nearly all hallucinations disappeared unless the underlying source was wrong too — which is a better failure mode.

One edge case: if the prompt is too long (especially when passing multiple slices), GPT-3.5 or GPT-4 sometimes silently drops the field structure and responds like it always did — as a blob of helpful/nebulous text. To detect this, I added a “number of fields checked” counter in the JSON validator. If it comes back with <2 fields, we inject a retry prompt with stricter instructions.

5. Handling time-sensitive knowledge with prompt date injection

Internal documentation gets outdated fast. There was one infamous Friday where someone asked the bot about our annual pricing tiers, and it referenced packages from 2021 — because the changelog hadn’t been updated in the ‘main’ docs folder. The LLM had no concept of staleness, so it sounded legit. And a client got sent that pricing in an email later. Yeah.

The hack that finally worked: inject the current system date + last modified timestamp into the prompt explicitly:

You are responding on 2024-04-19.
The following document was last updated on 2021-12-05.
Only respond with info if the source is less than 18 months old.

Document: <SLICE>

This pushed GPT-4 to self-filter a bunch of irrelevant sources just based on timestamp deltas. In test runs, it ruled out ~30% of legacy pricing files that used to get misquoted casually. As a side effect, it also started hedging: “As of our latest documentation (dated October 2022)…” — which is a huge win for transparency, even when the actual answer is wrong-ish.

One open bug: OpenAI occasionally miscalculates time spans. It once told me a doc from 2021 was “less than 6 months old” in April 2024. Still can’t tell if this was a model bug or bad prompt phrasing, but it happened more frequently when the document date line was buried in the middle of the prompt block, not at the top.

6. Using simulated queries to stress test your retrieval logic

I wasted a solid week trying to get our knowledge base chatbot to answer basic finance questions reliably. No matter how I sliced the documents, GPT kept referencing irrelevant vendor onboarding guides. The docs were fine — it was the queries that were weird. Turns out, people ask these things in completely different ways than you’d expect.

So I wrote a prompt that went:

You are a compliance officer asking questions about finance policy.
Generate 25 weird or edge-case questions someone might ask.
Avoid reusing phrasing from the source document.
Give output as a JSON array.

Then I dumped those into my Make flow as test cases, logged failed matches, and used them as training labels. After running this, I discovered that most failure points came from:

  • Two policies using extremely similar phrasing with different rules
  • Bullet lists with ambiguous reasons for exclusions
  • Nested legal clauses triggering false positives

Generated queries are the best source of failure audits. Also, some of them were gold:

“If I purchase software with a personal card in Swaziland, what’s the reimbursement delay?”

Nothing in our docs says Swaziland. But it forced better location-based parsing AND led to adding country-specific policy variants.

7. When retry logic backfires and returns a worse answer

I had added retry-on-failure logic into my LLM flow: if the output didn’t match schema (missing “source” or “excerpt” fields), it triggers a semi-polite fallback prompt: “Please retry the same response using the expected JSON format.”

Worked fine until one day, someone asked “Where do I report abuse from a client?” and the first response pulled a valid HR policy excerpt. But the output had a minor formatting bug (extra comma in JSON). So the retry prompt fired — and the second output was a cheerful note about respectful language protocols. No mention of abuse reporting. Totally useless.

This stuff happens because retries are blind: the model doesn’t know what answer it just gave. Adding history makes prompts too long. What I ended up doing was evaluating partial fail cases: instead of retrying, accept an output if it gets over 2 of 3 fields right. Cleaner than having barely-correct spam overwrite a great answer just missing a brace.

8. Best practices I still force myself to follow every time

There are a few rules I always return to, because when I don’t, the whole chain inevitably breaks:

  • Keep prompts short and repeat key instructions twice — especially JSON constraints
  • Use internal function calls or Make parser steps to enforce schema instead of Regex
  • Never rely on model memory between prompt steps — always re-include context
  • If something worked great Tuesday, it might not Thursday — version your prompts if possible
  • Custom logs help: I track prompt length, response duration, and retry counts in Airtable
  • Add a debug mode that outputs the final LLM prompt text for review — users will ask
  • Avoid lookup tables in the prompt if they get longer than 5 items — too sparse

Breaking even one of these usually doesn’t kill the system immediately. It’ll keep working — just slightly worse — until someone trusts the wrong answer in a live client meeting.