AI x Customer Research
Posts
AI x Customer Research - August '25

AI x Customer Research - August '25

The GPT-5 Issue | What’s better, what’s not, what to do with Agent Mode + a study on context windows.

Caitlin Sullivan
August 29, 2025

Read time: 20 minutes

I had this edition all mapped out and a few tests already run. But…OpenAI just had to launch a little thing called GPT-5. 🙄

I took this month to run as many mini tests as I could fit around client workshops.

This edition is a summary of a lot of exploring this month:

What I think is actually better in GPT-5 so far, what still needs babysitting, an example of an Agent Mode workflow for research, and a study that helps us understand whether GPT-5’s larger context window is a good thing or not.

Let’s dive in —

In this edition:

⚛️ An Agent Mode use case: A workflow worth using an Agent for in customer research, with example results.
🦾 GPT-5 Tests + Results: What’s improved, what’s not, what to think about.
👩‍🔬 A Study: How increasing input tokens impacts LLM performance (and what it means for customer research).

The list looks short, but there’s a ton to dig into below -

WORKFLOW UPGRADES

⚛️ An Agent Mode Use Case

What is Agent Mode?
A toggle in ChatGPT that let’s you build assistants that can run multi-step tasks linking use of various tools, accessing memory, and behaving with a bit more logic and persistence than a regular chat. Where you find it 👇

That’s the idea anyway. But the truth is, a good Agent still requires you to be really clear about what you want it to do - down to the precise steps, tools, instruction for how to use them, and how you’ll give ChatGPT access to the tools.

〰️

My “agent” workflow - What I asked it to do:

I pretended to work for the fitness tech company Whoop, and said:

Collect public reviews for Whoop (from Reddit, Trustpilot, blogs, etc.)
Log every individual comment in a Google Sheet (live Drive doc, not a CSV)
Label and categorize all feedback using inductive reasoning in the Sheet
Synthesize key complaints that negatively impact retention + recommendation
Put all findings into Google Slides with:
- Verbatim quotes and sources
- Detailed analysis of emerging patterns
- Design matching a provided template/guidelines

Verdict: This worked pretty well, and after a few tests to improve the prompts I used, the result was good enough that I’d definitely use this workflow for slide creation from desk research next time (in a real scenario). I can actually tell it to go do this for me like an intern, and I can work on something else (see video below) .

〰️

Here’s some of the Slides process + results -

ChatGPT Agent Mode (Slides creation) - Watch Video

Resulting slides from mid-test - they need a little work, but it’s a good start. With 15 slides like the one on the right here, it saved me a lot of copy-pasting. Font is correct (Inter), colors are correct once I sent the HEX codes, logo and other details were added to the template.

What’s worth considering:

It took Agent Mode much longer to turn the review mining findings into slides than it takes a specialty AI based slide tool (Beautiful AI, Gamma, etc) to create slides from a doc with your findings. But if your team already uses Enterprise level ChatGPT, then adding another slide-maker tool with AI means increasing the data privacy risk - you’re putting customer data and possibly secret findings into yet another tool.

🦾 GPT-5 Tests + Results: What’s Actually Improved?

Here are some fast results I got from hands-on testing across a bunch of typical research tasks:

🟡 Better accuracy… kinda.

It’s less hallucination-prone in general - but still unreliable with verbatim quotes. If you’re summarizing interviews or trying to lift exact language from transcripts, expect to double-check and be persistent in your prompts - e.g. “verbatim quotes ONLY!”.

〰️

🟡 Math logic is finally saner

GPT-5 now consistently uses Python behind the scenes for math tasks, without needing to be told to. That annoying thing where it sometimes “reasons out” math in natural language (I mentioned this in June) - I have not seen this happen in GPT-5.

I tested this with basic survey calculations and custom metrics from my synthetic data where I’ve seen previous GPT models, Claude and Gemini make funny math mistakes. GPT-5 handled calculations well as long as the prompt was clear. Messy requests = messy math. Still true.

〰️

🟡 Long, multi-step tasks: roughly the same (you need to be structured)

GPT-5 is supposed to be able to handle a long sequence of steps with a little less hand-holding in certain situations. For example, it’s not supposed to forget the earlier parts of a prompt chain as easily, and should remember context better.

I’ve seen this hold true in some of my step by step workflows in a normal chat window. But with Agent Mode, I’ve seen GPT-5 completely drop the ball.

Ex: In my prompt #2: “…put the findings from the review mining into Google Slides.”

(5 minutes later) ChatGPT: “Here’s your PowerPoint presentation. You can download it here [link].” 👿

What can we take from this? Structure, clarity and breaking up complex tasks into smaller chunks are still required. Vague instructions or too-complex workflows still derail it.

〰️

🟡 Big files - Seemingly less of a problem

I’ve tested heavily with two synthetic datasets from Kaggle:

20,000 rows of quant + geo data (partial dataset)
12,000 rows of data with open-text feedback

No issues so far. GPT-5 handled both with full retention of earlier prompts and instructions - even across multi-turn follow-up queries. No upload issues experienced, no missing or inaccurately retrieved data.

What this means:

This potentially opens up more ambitious workflows for cleaning, segmenting, or running hybrid qual/quant analysis without breaking the tool or your brain. I’m getting more hopeful on this front, even though there’s still a lot to think about in terms of how much content you upload in one go (see the study below 👇)…

AI STUDIES

👩‍🔬 How Increasing Input Tokens Impacts LLM Performance

TL;DR - As you add more text, reliability often drops—especially with semantic questions, and when key evidence sits in the middle of the set. Bigger windows help load more info, but they don’t guarantee accuracy in processing it.

What they tested -

18 models, controlled long-context tasks
They made the inputs longer, moved the answer earlier or later, and added extra lines that looked right but weren’t.

Why it matters for research -
Dumping entire interview banks into one prompt can hurt reliability: more plausible-but-wrong answers and missed references are highly likely.

⚠️ Why this is in the GPT-5 issue -
Some people are talking about the larger context window available in GPT-5. But larger context = capacity. A larger window lets you paste or upload more text, but it doesn’t guarantee better answers. Use the extra space to organize and curate, not to dump everything in and hope GPT-5 is smart enough to accurately deliver you the right things on its own (hint: it won’t).

—

Study: “Context Rot: How Increasing Input Tokens Impacts LLM Performance”

Best of luck until September. ✌️

-Caitlin