Why Your AI Chatbot Prompt Worked Once But Never Again

Table of Contents

1. Token length limits silently sabotage conditional prompt branches

You’d think a prompt that works with test data would work in production. Except it doesn’t. When I first set up my support routing bot with GPT-4, I had a pretty clean system that categorized queries into topics like “billing,” “technical issue,” or “feature request.” Worked beautifully in Playground. Then I added logging, real user data started hitting it, and suddenly it responded with half-finished sentences or hallucinated categories like “potato.”

The issue? A hidden length ceiling. My prompt + system message + user input + response all together were quietly exceeding the token cap. But only sometimes. A five-line input from a user with a forwarded email thread exploded the context length — so my carefully written prompt got silently truncated mid-IF-statement. No warning. The bot still replied, but it mangled logic because the actual instructions got cut off right before the line that said If billing-related, respond with "Billing" only.

I only figured this out by copying the full chat history into OpenAI’s tokenizer.

“Your message exceeded maximum context length. The model attempted to recover.”

That recovery? Random garbage.

2. Prompt formatting changes model behavior more than you expect

There was a week where our bot kept replying with “Sure, here’s a poem about refund requests!” instead of actually tagging the issue. Turns out the culprit was indentation. I had prettified the prompt into a nicely formatted Markdown-styled block with bullets and headings. It looked clean in my IDE. But the shift from plain text to structure changed how the model interpreted some parts as “instructions” and others as “context.”

Removing the Markdown formatting and just hammering every instruction into plain numbered lines fixed it instantly. The reason? Models like GPT guess intent based on text patterns. A heading like ### Refund Requests got treated like a content label instead of a directive, and the LLM started improvising content under that heading instead of using it as a category label.

In the ticket logs, I literally have a customer asking about updating credit card info and getting back —

Here's a haiku:
Refunds may come slow
If your patience wears too thin
Use our contact form

I didn’t deploy poetry mode intentionally, but prompt formatting did.

3. Stop trying to pass JSON examples through as part of the system prompt

I used to build prompts like:

You will return a JSON object like this:
{
  "category": "billing",
  "priority": "high",
  "nextStep": "email support"
}

This worked for like a month, and then suddenly broke on a Friday night during a support campaign launch. I was getting malformed JSON, trailing commas, sometimes it would answer the user inside the JSON — pure chaos.

The fix — and I found this buried in a Zapier forum thread — was to move the example JSON into the user message part of the prompt, not the system prompt. Something about how the model interprets intent makes it way more likely to produce clean JSON if the structure appears inside the user input. Probably because it sees the user message as a request to respond, and the system prompt as configuration rather than guidance.

Once I moved the example below the user’s query like this:

Question: I need help with getting a refund

Example format:
{
 "category": "billing",
 "priority": "high",
 "nextStep": "email support"
}

The bot started returning well-structured JSON again, even with long inputs. Haven’t touched it since. Don’t know why it worked before — suspect model updates quietly altered the tokenizer context boundaries.

4. Temperature settings distort replicability more than you think

One day, the team asked why the FAQ-bot answered the exact same question three totally different ways. I swore up and down that the logic was sound. So they sent me two ChatGPT logs: one said “Yes you can export reports,” the other said, “I’m not able to help with exports.”

I finally opened the playground config (had been copied from an earlier version) — and saw temperature: 0.8.

The randomness made it… creative. And worse, it was specifically hallucinating limitations — like claiming export required a Pro license, which doesn’t exist. I dropped it to 0.2 and added this blunt line to the prompt:

Only respond with factual yes or no answers. Do not speculate.

That fixed it. Mostly. But the thing I didn’t expect: when using APIs like from OpenAI, different wrappers reset or tweak temperature defaults without warning. Our Make.com scenario and our Node backend were both calling GPT-4 with different settings. Nobody noticed until both ran live simultaneously.

Now every prompt file has temperature and top_p hardcoded — and don’t trust platform GUIs to reflect the actual request.

5. Strategic repetition inside prompts sets anchors the model obeys

This felt gross to do at first but ended up being a lifesaver. When working on a GPT-based refund routing tool for a SaaS client, the model kept drifting mid-response — tagging multiple reasons instead of picking one.

Tried enumeration, labels, even shouty all-caps instructions. What finally worked was repeating the final key instruction twice at the end:

Respond with only ONE of the categories below.
Only respond with ONE category.

Once I did that, response consistency jumped like 80%. It’s as if the model needs a second wakeup hit to snap into shape. I checked a bunch of internal test prompts from public Colab notebooks, and turns out this isn’t uncommon — devs sneak in repetition before critical junctions.

Oddly, putting the duplicate too far away makes it ineffective. Nest it inside the same paragraph block, or within 3-4 lines at the most.

6. Webhook triggered prompts misfire on resend without state tracking

This is a nasty one. You build a bot where a webhook triggers a prompt → the bot does something with the response → saves output to Airtable. Smooth. Until someone hits “Replay” on a failed webhook in Make, and suddenly that perfectly testable flow decides to output something totally new. Like last week, a webhook resend routed a user as “Angry complaint” when the first time it had listed them as “Neutral inquiry.”

Same input, different result. I’d forgotten to include the prior classification as part of the prompt context — so reattempts weren’t grounded in what was already chosen. The resend just treated it as a fresh input.

Hard lesson: for any webhook-triggered LLM automation, persist the last classification result and include it in the new request like this:

Previous classification: Neutral inquiry
Reevaluating after webhook error. Confirm or revise based on above.

Also learned that Make stores last outputs in the bundle metadata, but doesn’t make them visible unless you map them explicitly in the scenario. That visibility friction caused a big misfire.

7. Uppercased prompt commands weirdly reduce error rates in GPT

After a two-hour debugging spiral where prompts were giving inconsistent JSON keys (sometimes “priorityLevel”, sometimes “Priority_Level”), I tried something I now swear by: shouting the command.

RETURN JSON USING EXACTLY THESE KEYS:

Not sure why — maybe uppercase shifts the token weight — but the model stuck to the keys way more reliably. Even better when followed immediately with a JSON block it should mimic. I now use all-caps for any part of the prompt that controls format or where slipping up breaks a downstream tool (like Webflow CMS fields or Notion relations).

Funny enough, if you interleave uppercased commands too early in the prompt, they get ignored. The best position is immediately before output expectations. It’s more effective than you’d think. Try this:

CLASSIFY THE QUESTION INTO ONE OF THE FOLLOWING:

- Billing
- Technical Issue
- Cancellation
- Other

Success rate, in my testing at least, went from maybe 60-something percent to almost always right.

8. Model memory in chat UIs can taint later user submissions

This one’s from a weird late-night debugging panic where our internal chatbot suddenly started calling users by the wrong names. Turns out, when you use ChatGPT’s chat UI or embed a continuous-thread bot, prior entries really matter. We had test users inputting random names, locations, and test emails — and the model would retain those assumptions for future chats.

That meant when a real user named Sam submitted a query, the bot said “Thanks, David.” There was nothing in the prompt to override this — because the assumption wasn’t coming from the prompt, but from chat context carries.

Fix was two parts:

Make each session stateless — either via new threads or by clearing history before sending prompts
Inject a reset prompt like “This is a new user query with no prior assumptions” at the start of every message

In API setups, it’s easy — just don’t thread messages. But in UI-based sandbox testing (Playground, ChatGPT), you need to be annoyingly vigilant. Or you’ll be redacting logs and explaining to people why GPT keeps hallucinating personalized greetings.