Prompt Engineering Pitfalls in UX Research Auto-Summarizers

Prompt Engineering Pitfalls in UX Research Auto-Summarizers

1. Why most UX research prompts end up summarizing the wrong thing

You run a user interview, feed the transcript into ChatGPT, hit enter—and it spits out three bullet points about button placement and colors. Meanwhile, the user spent five minutes explaining how lost they felt during onboarding. That’s the failure point: most prompt frameworks treat UX summaries like news recaps. The structure favors recency and frequency, not gravity.

I spent way too long trying to structure summaries with bullets like Key Insights, Feature Requests, and Frustrations, hoping LLMs would sort observations into tidy boxes. That backfired hard when I had three transcripts return nearly identical summaries. Turns out the model was just pattern-matching phrasings (“I didn’t like,” “I wish,” “it was confusing”) and ignoring any mixed signals. Nuance? Flattened. Contradictions? Muted.

Here’s what visibly broke it: if a user said something like “It wasn’t super confusing, but also wasn’t clear,” it never landed in the “Frustration” bucket. The LLM got paralyzed by hedge words and skipped it entirely. That phrasing shows up a lot in moderated interviews. Stack 10 of those and your summary ends up weirdly positive. I actually side-eyed my own study at first thinking, Oh cool, maybe onboarding is fine now. (It was not.)

So yeah, prompt frameworks that rely on fixed output categories will quietly fail by omission—the worst kind of bug.

2. Trying to force structure with JSON hallucinations and keyword traps

At some point I decided to fight LLM entropy with JSON order. I wrote a prompt like this:

"For the following transcript, return a JSON object with three keys: \"Problems\", \"Quotes\", and \"Suggestions\". Nest each key with an array of items derived from the text. Do not fabricate entries."

Seemed elegant. Until I tested it on a seven-minute transcript where the user talked about just one core problem. The model still invented two more filler items to complete the array. It even pulled in off-topic comments like “I liked the designer’s voice” in the Problems list. The tone fit, but the content didn’t.

Turns out models like obeying structure over truth, especially when you give them examples with the same number of entries every time. I didn’t document a schema, but the model hallucinated one because it thought that’s what I wanted. Just because the output looks like JSON doesn’t mean anything inside it happened.

Also worth noting: if the user says something super important early—with no repeat mention—it often gets lost unless you explicitly prime the LLM to weigh time-based patterns. I had to inject a ridiculous clarification: “Earlier statements may signal root issues—do not preference recency.” Even then, it only worked half the time.

3. Real-world friction when team members read the generated summaries

We ran a few test summaries through our research repo in Notion. A few engineers skimmed them and said “seems fine” without clicking through to the full transcript. That’s when I realized the summaries were becoming UX gaslighting loops. They conveyed tone, not weight. One even included the sentence “Users generally found the interface intuitive.” Nobody said that out loud—it was inferred from a lack of screaming. Not helpful.

One PM actually added a Jira ticket based on a paraphrased quote that never existed. The model condensed two adjacent phrases into a quote that sounded real but was subtly different—and way more optimistic. When I dug back into the original transcript, it was nowhere to be found. Like, not even close.

So now we prepend every autogenerated summary with a red banner: “This is an AI-assisted summary. Always verify core insights.” Not elegant, but it stops people from over-trusting a sentence that’s grammatically perfect but contextually wrong.

4. What changed when switching from task-based to persona-based prompts

Eventually I switched the prompt format to take the perspective of the user rather than the observer. Instead of asking the model to label what was said, I started asking:

“Reconstruct the user’s internal experience during this session. What were they trying to do, and what influenced their emotional state as they navigated the product?”

That single shift changed everything. Suddenly the summary picked up on goal-confusion mid-session. One talked about the user hesitating three times before clicking something. That moment never showed up in the old bullet styles. But when the framing was “what is this user trying to accomplish,” the model focused in. Way more signal, way less fluff.

Still not perfect. Sometimes it over-theorizes and invents intentions. Like, a user might say “I guess I’ll try this…” and it reads as high confidence. So I had to start injecting qualifiers like “Do not make assumptions not grounded in quotes—prioritize observed actions.” Which helped half of the time. I eventually added a system message to run before the user prompt just to keep things consistent across runs:

"You are a UX researcher assistant. You do not infer user motivation unless it is explicitly stated or strongly implied by sequential actions. Be concrete."

Yes, it’s overkill. But so is cleaning up hallucinated summaries across twenty sessions.

5. When GPT refuses to summarize because your transcript is too polite

This is probably the most ridiculous failure mode I hit: I ran transcripts from a moderated usability study—textbook boring, lots of “Sounds good,” “Yep,” “Okay sure”—and GPT-4 refused to summarize anything critical. The tone was just… too polite. The user clearly had friction but wasn’t voicing it directly. The transcript read clean. Too clean.

I even got this gem once:

"This user found the interface satisfactory and had no major complaints."

Except at 16:23 they literally tried clicking the same button three times and said, “Not sure what’s happening here.” But because they wrapped that in polite fillers, GPT didn’t flag it. I had to prompt with:

"Highlight all moments of hesitation, repetition, or indirect confusion regardless of user tone."

After that, boom—it caught the triple-click breakdown. So now that line is permanently in my snippet bank. If someone says “I guess…?” three times, and your model doesn’t flag it, you’re just summarizing vibes.

6. Dealing with repeating the same prompt and getting different outputs

This one’s deceptively basic: you run the same prompt on the same transcript, and get two different summaries. Not just rearranged—completely divergent weightings. One says “Main pain point: dashboard flow.” Another says “Users appreciated the dashboard.” No in-between. This is not a creativity feature. It’s a stability flaw.

And yeah, setting temperature to 0 helped a bit, but didn’t solve it entirely. There’s always a bit of prompt re-interpretation depending on hidden state or repeated phrasing. One time I duct-taped the session with

{"log_id": "ux2024_07a", "sequence": 3, "variant": "B"}

…hardcoded into the top of each prompt block. Not because the model used it, but so I could track which runs were hallucinating tone positivity based on transcript order. Eventually I split long transcripts and processed chunks with overlapping timestamps to reduce fragmentation, which helped with variance collapse.

If you’re not diffing your LLM summaries across runs, you’re just assuming probabilistic output is consistent. It’s not.

7. Seven things that actually improved summary reliability inside workflows

After rebuilding this pipeline half a dozen times (shoutout to my brittle Zapier chain triggering twice on Airtable status change), these are the knobs that worked:

  • Start all prompts with a unique session identifier and explicit user role context
  • Disable summary autocorrect features in whatever model or wrapper you’re using
  • Preprocess transcripts to retain filler words and incomplete sentences—don’t clean too much
  • Add chunk overlap of 10–15 seconds when breaking long interviews
  • Label clicks, pauses, and non-verbal cues in square brackets to aid inference
  • Use persona-aware prompting with grounded tasks, not research objectives
  • Always prompt for confidence level per insight—forces hedging instead of flat statements

This might sound like overengineering. But when your PM reads a weekly rollup with made-up optimism, you start to care about where the hallucinations creep in.

8. The unfixable bug hiding inside voice transcript models

I thought I was being clever by using Whisper to transcribe our sessions automatically. Worked great for twelve files. Then a single interview tanked—completely mis-transcribed a non-native speaker’s answer into something that sounded generically positive. The user said something like “I kept clicking, but it wouldn’t do what I wanted,” and Whisper garbled it into “It kept working like I expected.”

It read like a happy user. The LLM echoed that vibe. No flags were raised.

Only caught it when we re-listened manually for a highlight reel. So now we don’t feed raw Whisper transcripts into models. Every auto-transcribed session first goes through a quick manual markup—a ten-minute pass to fix tone-critical errors. Because if the input lies, nothing downstream will save you.