Fixing Fragile GPT Prompts That Keep Breaking OKR Tracking

Table of Contents

1. Rewriting vague OKR descriptions with structured GPT prompts

At some point, someone on the team will type an OKR like: “Drive more user engagement across key segments.” That’s not trackable. That’s not even measurable unless you’re building a vibes dashboard. One thing that made my life slightly easier was using GPT to rephrase these into something formulaic. I’d dump raw bullet points into a prompt like this:

"Turn these vague OKRs into measurable ones. Use a format like 'Increase X metric from Y to Z by date'. Keep subject names intact. Don’t invent data."

That mostly works, until it doesn’t. For example, if someone includes a stray date (“by Q2”), the model kept interpreting it as a hard goal deadline, even though we weren’t using that quarter anymore. Little things like that break automated tracking because you end up assigning completions too early.

Fun side issue: for a while, GPT kept replacing product names with its own hallucinated SaaS tools (e.g. turned “Streamline onboarding in PortalX” into “Create smoother sign-up in OnboardHero Pro”). Adding a sentence like “Do not rename product names under any circumstance.” fixed it… most of the time.

Also, multi-line inputs caused the model to just loop — especially if your list had more than five bullets. It’d just give you three, then stop. No error, just vibes. So now I send chunks of 3–4 OKRs at a time, using a Looping Zapier sub-Zap with GPT and Airtable updates.

2. Extracting actual metrics from fluffy GPT reformatted goals

OK, even when GPT gives you usable goals (“Increase NPS from 42 to 55”), you still need the numbers. This part shouldn’t be hard. But the model, when asked to “extract the metric and its target as a JSON object,” absolutely refused to output reliably parsable JSON. Quotes missing. Extra commas. Sometimes the model dropped the starting “{”… just went straight into keys.

A fix I found buried in a forum was this: ask for markdown-formatted JSON enclosed in triple backticks. And in your Zap (or script), regex parse whatever sits between the triple-backticks as a string before trying to treat it like JSON. Dirty, but works.

GPT Output (Actual):
“`json
{metric: “NPS”, current: 42, target: 55}“`

See it? The keys aren’t quoted. You can’t JSON.parse() that unless you pre-clean it. I ended up using Make.com’s route path with a regex block specifically to add quotes, then push that cleaned object to Notion rows where we visualize target vs. progress. Airtable was fine too, but I got tired of setting up calculated fields for every new goal.

Once, I had GPT throw in a full paragraph inside the triple backticks — like an essay on NPS strategy, not even data — because one upstream prompt said: “Include context where relevant.”

3. Handling quarterly OKR versions triggered by recursive scheduling

I had this Zap that re-generates OKRs each quarter based on previous completions and GPT rewriting the next versions. It runs on a Schedule trigger + Airtable filtered view. One day it fired twice. Then twice again. Blamed Airtable. Nope — the issue was Zapier caching didn’t reset when I bulk-edited records to change end dates. Which made two filters match temporarily within the same run window.

The GPT part was whatever. But the downstream effect? I got two entries called “Expand LATAM market share to 30%+” and “Expand LATAM market share to 33%” on the same day. The second run had picked up auto-incremented values from a previous GPT formatting step that wasn’t meant to update inside a loop.

Eventually isolated it to a misused Formatter step — I had a date diff calc that kept defaulting to 0 if the comparison failed. GPT saw a 0 and thought it meant “no progress last quarter,” so it aggressively updated the metric. Now I run a Make scenario earlier in the process to flatten all date logic and only let GPT see the final cleaned fields.

4. Filtering GPT outputs before pushing into Airtable or Notion

Don’t trust the output — validate before ingest

Learned this the hard way the moment GPT decided that “Drive innovation across teams” was a quantifiable OKR and made up a metric called “Team Synergy Index”. It was fake. It looked great. People believed it. Until someone asked, “Where does this number come from?”

Now I do a quick check before anything hits Airtable or Notion. Usually with a webhook into a tiny Node script or Make iterator. Checks:

If metric name is a known column in our DB or metrics schema
If any values include non-numeric data in supposed numeric slots
Regex: Reject any metric name with more than 4 consecutive capitalized words
Flag if target > 5000% increase compared to current

That third one caught some wild hallucinations. GPT once gave me: “Unified Cross-Team Alignment Momentum Metric” as the KPI. Looked legit. Had a number. Didn’t exist.

We auto-suppress those from pushing to tools with visualizations. I only realized this needed more enforcement after a team planning session showed a dashboard with one completion bar hilariously stretching way past the rest — all from a hallucinated baseline.

5. Instruction tuning the prompt chain for consistent quarterly updates

One thing I didn’t expect: GPT actually gets worse over time — in your own use case — unless you lock your own prompts. I had fine-tuned a nice little chain that used: current OKR, last quarter’s status, latest numbers, and then asked GPT to suggest the next version of the key result.

Ran great for six weeks. Then suddenly, instead of increasing targets by a reasonable margin, it started recommending major drops. From “Reduce churn from 4% to 3%” to “Reduce churn from 4% to 8%” — as if we wanted to increase churn.

I traced it back to a small edit I made in the helper prompt: added the word “trend-aware” thinking it would make GPT consider changes more realistically. Yeah — it worked too literally. When it detected an upward trend in churn, it assumed we were embracing it. I had to revert to templated prompts with exact wording tokens and re-copy them every time we provision the Zap again.

Stupid thing is: there’s no one-click way to version your prompt chains in Zapier or Make. You need to either store them in Google Docs and copy-paste, or embed hidden JSON blocks with prompt versions inside Airtable cells. I do the latter — and log final outputs into another field, just to compare how the same prompt evolves over time.

6. Logging GPT input and output pairs for traceable rollbacks

This was the missing piece when a VP asked, “Wait, why did our Q2 goal say 270% growth?” I couldn’t remember. And GPT doesn’t log unless you make it.

Added a Make router that captures both sides of each GPT call — raw input prompt, exact output text — and logs timestamped records right back into a hidden table inside Airtable. Basically:

Scenario triggers on goal creation or update
Captures all preprocessed prompt inputs, including current metric baselines
Calls GPT and stores both result and any errors
Appends everything into a logging base with autogenerated version ID

I also added GPT-generated comments (“Reason for suggested update”) — but those were even more unhinged than the outputs. One said: “Because alignment is destiny.” We don’t log those anymore.

Anyway, this saved me more than once. Especially when a goal progress bar in Notion slid backwards and I got a Slack message at 8am saying “wtf happened here.”

7. Breaking changes when GPT version updates without notice

Weird thing about depending on GPT for OKRs: you don’t own the model behavior. At one point, mid-March, outputs started responding differently to the "Rephrase using a consistent format" prompt — suddenly using past tense, and dropping the quantifiable target (“We worked to improve engagement” vs. “Improve engagement by 15%”).

Turns out OpenAI had pushed an update to GPT-4 that slightly altered system behavior for chat completions. No warning. Broke the formatting-based matchers I had in Notion automation scripts that compared OKRs to historical records.

Nothing official showed up, but someone on a Discord thread mentioned that model response length behavior had shifted. My own logs backed it up — prompts that used to return ~160 tokens were now closer to ~110. Clipped outputs, cutoff goals, and missing numerical values across the board.

I hardcoded a fallback: if output string is <100 characters, force retry with a longer supervised prompt. Not ideal, burns tokens, but caught 80% of nonsense. Still happens sometimes where the result is just truncated after “Improve revenue by” and then… nothing.