Fixing Broken Support Prompts When Labels Change Midstream

Table of Contents

1. When tags change but your AI model still thinks old thoughts

We had two Notion databases feeding into an automated support reply system using OpenAI and Zapier. One held incoming questions, the other had templated responses. Everything was running fine until someone renamed a tag from “Billing – Cancel” to just “Cancel Request” to make it less aggressive. Was supposed to be harmless — more accurate even. But that new label? It nuked half the prompt logic.

The GPT prompt involved a section like this:

If the tag is "Billing - Cancel" then return Response Template #12, otherwise check for Upsell opportunities.

And I had forgotten that the string match on the tag label wasn’t happening in a fuzzy way — it was an exact match. So it just… silently failed. Zapier ran the GPT-4 prompt, didn’t find the expected branching condition, and defaulted to an upsell reply. Cue several confused customers asking why we were offering them gift cards after they said they wanted to cancel.

The fix was simple, but the time lost debugging wasn’t. Renamed labels weren’t even included in the Zapier task history payload until I clicked “Show Raw”.

2. Skipping error handling on label changes breaks every fallback

I’d built a fallback route in Make (Integromat) using a router with three branches: known tag, unknown tag, and empty tag. Renaming a label put it into the unknown tag bucket, which was fine — until someone added an automation that pre-filled empty form fields with default labels. Suddenly, support tickets that should’ve triggered a reply draft were routed into limbo because they looked labeled — but the label was garbage from a UX test tag: “Test Label Five”.

What made it worse: the Make scenario thought it handled the case, because technically the label wasn’t missing. So the Make run said “success”, nothing crashed, everything looked green. But users got nothing. Not even a “we’ve received your message” reply.

“Success” in the scenario doesn’t mean the customer got a response — it meant all paths ran without exceptions.

This led to two entire days of missed SLA alerts because we assumed volume was just low. It wasn’t. We weren’t answering them.

3. Narrow filters in Zapier get ignored when data shape morphs

Filtering by label-based conditions in Zapier is weirdly brittle, especially with nested data like from Intercom or Front. I was using a Formatter step to extract the primary label, then gating replies based on that value. But when the data payload shape changed (probably due to an API update or changed field mapping), the Formatter step failed silently. Instead of one label string, it became an array.

The filter still looked correct in the Zap editor, but hid the actual incoming shape under the hood. You had to expand the Data In → Raw field log to see it now returned:

"labels": ["Billing", "Cancel Request"]

So instead of “Cancel Request”, I was passing Cancel Request, Billing (as a single string) into the GPT prompt, which killed the matching logic. Surprisingly, GPT didn’t complain — it just hallucinated a new class of replies. A few customers got refund notices they didn’t ask for.

This is one of those quiet failures where your automation slowly stops obeying your intent while looking like it’s still helping.

4. Better fallback prompts saved me from worse hallucinations

Once I accepted that content categorization may always drift, I added fallback prompts for GPT completions. Instead of hardcoding reply types off exact labels, I started injecting the raw message plus any available tags and asking GPT to rate the user’s intent.

Given the message and the following optional tags: [“Cancel Request”, “Upsell Interest”]
Pick the most likely intent:
- Cancel
- Upgrade
- General Question
- Unknown

There’s still error risk — but it at least shifts things into probabilities rather than fake certainty. One thing that helped: logging major prompt completions into a Notion table with the GPT return. Not pretty, but it gave visibility into whether intents matched outputs. You start to notice when responses drift subtly off-label.

Here’s what started performing better consistently:

Always base logic on internal tags, not display labels
Watch for hidden arrays — Zapier loves to convert strings to lists silently
GPT prompt logs stored verbatim in Notion let you trace failures
Use GPT to determine intent from message + context instead of label logic
Add hooks for humans to downgrade or override AI-generated replies
Don’t treat success logs in Make or Zapier as proof that a message sent

5. UI quirks hide whether the AI got the right context

This one bit me during a permissions shift. A teammate duplicated a shared prompt block in Notion into their personal space, which severed the connection in the prompt template being called by GPT-4 in an OpenAI API call. Now replies came back with super generic tones, clearly missing the context we had formatted in the original block.

But because the endpoint was still valid (new version of the block existed), the API never threw. Totally green. Except GPT started using weird phrasing like “As per recent developments” and “We appreciate your valued patronage” out of nowhere. It wasn’t hallucinating — it had just lost the grounding examples we’d spent hours tweaking.

The dead giveaway was this log snippet:

prompt:
"Context: [PERSONAL BLOCK MENTIONED INSTEAD OF TEAM BLOCK]"

Solution: Created a fixed version of the shared prompt block that couldn’t be moved or duplicated without breaking access. Not perfect, but it stopped the context leak. Real lesson: even a successful API call can reference the wrong ghost of your original text.

6. Prompt tokens miscounted when sending data through Airtable lookup

The original setup used an Airtable lookup to pull canned responses based on categories derived from incoming tags. But since I needed more dynamic behavior, I started attaching sample replies to records as long-form markdown notes (inside a rich text field). When sending that to GPT, the total token count visually looked small — Airtable only showed five lines previewed unless you clicked into the expanded field.

Turned out the field contained some very old, very long training text — like 2000 tokens’ worth — buried on purpose by a repl.it intern during a save cycle. GPT started timing out or truncating replies even though everything looked synced. The Airtable API returned the full field, but you couldn’t tell in the UI.

The GPT log said:

{ "error": "context_length_exceeded" }

This kicked me hard because the Automation in Make passed that response to a Slack channel saying: “Reply failed – check content alignment”. Nobody thought to check token count.

7. Using category color as logic trick works but then breaks quietly

This was one of those dumb hacks that worked well for two weeks. Instead of building proper categorization logic, I just used Notion tag color as a decision point. Yellow tags meant billing; red ones meant cancel; blue meant onboarding. Then based on the color string passed in the webhook, I routed to different GPT prompts.

Worked beautifully. Until Notion changed their internal color naming — what showed as “Yellow” in the UI became “Amber” in the API response. That change rolled out silently.

So when a user picked the new Amber tag, the GPT prompt still looked like:

If the tag color is Yellow...

Which now never triggered. Even worse, color names weren’t documented in Notion’s API explorer (at the time), and old cached webhook samples still said “Yellow”. Confusing debugging session where logs disagreed depending on when they were recorded.

Fixed it eventually by switching to internal tag ID values instead of names or colors. Still sometimes check if they slipped in a new color.

8. Renaming custom fields mid-week caused GPT to mix up fields

We had set up a dynamic prompt builder that used Airtable fields like {{user_reason}} and {{plan_type}}. Everything worked fine — until someone renamed “user_reason” to “reason_for_contact” because it felt more readable in the UI. Normal behavior, something you’d expect not to destroy the planet. But yeah, the webhook still sent data using the old field ID — so GPT thought user_reason was blank.

Instead of writing:
“It sounds like you’re looking to cancel because of high pricing.”
it just said:
“It sounds like you’re looking to cancel because of.”

At first glance, looked like GPT hallucination. But digging into the payload showed the prompt had an empty {{user_reason}}, not a misparse. Fix was going back into the Airtable column settings, copying the new Field ID, and reloading the mapping. No warning or error, despite the mapping name being invalid. It just passed an empty string.

It’s like watching a perfectly clean system pass garbage through beautifully formatted pipes.