Read time: ≈20 minutes
It's been a wild year with AI.
Despite many still claiming that AI doesn't do research well, I've seen huge gains in 2025 across all my tests—from simple things continuously done poorly in the past (does AI count correctly?) to more complicated things (AI recognizing facial expressions and body language in videos).
It's not the end of the year yet, but with holidays around the corner and that end-of-year crunch time, I want to wrap things up for you while I still have your attention. (December's edition will be a little more relaxed).
I’ve highlighted what I believe are the most meaningful changes we've seen in AI for customer research, and how I think 2026 is shaping up.
In this edition:
🔄 Model updates: ChatGPT 4o → 5.1, Claude Sonnet 3.5 → Opus 4.5: What's the big deal for research tasks?
🎥 Gemini 3: Holy moly, this video analysis works (so far).
🤖 AI Moderators: Can we trust them now? A comparison vs. my Dec'24 study
⚙️ Agents: Are we any further than in 2024?
👩🔬 Studies: Synthetic users. Where are we now, and what's reliable?
🔮 Looking ahead: What 2026 already promises.
Let’s get into it —
MODEL UPGRADES
🔄 ChatGPT + Claude
ChatGPT 4o → GPT 5.1: Is this a big deal?
Yes. This one actually matters for research workflows.
The upgrade from GPT-4o (what we had in January) to GPT-5.1 (released November 2025) has two changes I particularly care about:
1. Context windows ≈tripled.
The API now supports 400K tokens—that's roughly 300,000 words. But the actual upload token limit is 272K - still a lot. You can still upload more entire hour-plus interview transcripts without chunking them up.
Fragmented analysis where AI "forgets" what was said earlier in the conversation has been noticeably reduced. The model’s ability to continue performing its best over a long analysis chat has improved.
2. Hallucinations dropped.
OpenAI reports 45% fewer factual errors versus GPT-4o. On open-ended factuality benchmarks, GPT-5 achieves just 1-2.8% hallucination rates versus 5-23% for predecessors.
(Keep in mind these are measures by OpenAI’s team, so there’s probably a bit of bias baked in, but I’ve seen improvements in all my tests).
〰️
My verdict: These improvements have meaningful compound effects for research tasks, where performance consistency and factuality are essential. I’ve consistently seen ChatGPT get better at maintaining accuracy and focus on the right task details over longer chats than it did before.
Claude Sonnet 3.5 → Opus 4.5: Is this a big deal?
Also yes, but for slightly different reasons.
Anthropic's progression through 2025—from Sonnet 3.5 to Opus 4.5 (released this month)—brings real improvements for qualitative analysis workflows.
1. Context windows expanded significantly here, too.
While Opus 4.5 maintains a 200K token standard window, Claude Sonnet 4.5 now offers 1 million tokens in beta—approximately 750,000 words. That's dozens of interview transcripts in a single analysis.
The new Memory function improves the ability to build knowledge bases that maintain consistency across conversations—directly applicable to continuous insights processes.
2. Deep research improved by 15%.
Deep research improved by ~15 percentage points when combining Opus 4.5 with Anthropic's new agentic features (effort control, context compaction, and advanced tool use).
The model specifically targets complex tasks, achieving state-of-the-art results on multi-step reasoning tasks that combine information retrieval, tool use, and deep analysis.
3. Hallucination findings are mixed.
Anthropic positions Sonnet 4.5 as having "lower rates of hallucination".
However, one academic analysis found Claude Opus 4 exhibited ~10% hallucination rate versus under 5% for Claude 3.7—bigger models don't automatically mean fewer errors.
🙋♀️ Tip: Always instruct Claude to use only provided documents and cite specific quotes. That's still the best mitigation, regardless of model.
MODEL UPGRADES
🎥 Gemini 3: Holy moly, this video analysis works (so far)
I just wrapped up my final AI Analysis course cohort for the year, and got quite a few questions about Gemini 3.
But I hadn't tested it yet. So I did that this weekend.
Let me tell you (1) what Gemini did that impressed me, and (2) how I repeated one of my regular tests in just 25 minutes. 😄
1. Gemini can actually see your expressions on video.
Previously, while other models and research-specific platforms claimed they could, they never panned out. Every time I uploaded a video to any of the major LLMs, responses were entirely fabricated.
Gemini 3 got observations right. It nailed:
Hand positions
Facial expressions
Reasonably guessed the meaning behind expressions (similar guesses as my own, jotted down quickly beforehand)
Repeated this across 10 different points in a single video, repeated with 6 videos
2. How I test this (in all the models):
I have a core set of participant videos I use repeatedly for AI video tasks.
Every few months, or when a hyped new model launches, I re-test:
Same videos
Same requests—tell me what the person is doing, describe their facial expression, tell me what it means.
Example request:
“At timestamp 9:34…what is the participant doing? Observe their body language, facial expression, and anything else you notice. Describe them, then interpret what you think they mean. Compare with what they are saying at that timestamp."
None of the models have done this successfully until now.
This is me, talking to a friend about parenting + business. Video frame at 1:05.

I prompted:
“What do you observe about this participant's facial expression and body language at 1:05? Compare non-verbal behavior with what they are saying at that time.”
Gemini 3’s Response

Whether you agree with Gemini’s interpretation of what my expression/behavior means or not, it correctly identified my position and facial expression in the video every time I asked for non-verbal observations ✅
This held true across 60 non-verbal observation requests over 6 videos so far.
⚠️ The reality check:
Gemini 3 achieves 87.6% on Video-MMMU (state-of-the-art), but this benchmark tests knowledge acquisition from educational videos—lectures and tutorials—not behavioral observation in customer research.
There is no standardized benchmark for observing participant body language in research videos.
WORKFLOW UPGRADES
🤖 AI Moderators: Can we trust them in 2026? Comparing today vs. my Dec.’24 study.
Quick refresher on what I found then:
Tools were best used as a "middle ground" between surveys and interviews ✅
AI moderators’ follow-up questions were hit-or-miss 🫤
Some tools allowed follow-up instructions while others didn't - giving guidance worked best
Participant experience was mixed—repetitive questions, canned responses, abrupt endings.
AI still needed heavy guidance for study setup or it wouldn’t do nearly as well as a senior researcher.
〰️
What's changed in 12 months?
1. True voice-to-voice interviews is standard.
Many platforms like Listen Labs, Userology and more now enable users to speak with a voice that sounds human. I updated my overview of “23 AI moderator tools to know”, in case you missed it, marking who does voice to voice and doesn’t.
2. Follow-up capabilities + study setup got upgrades
These were some of the biggest pain points in my Dec'24 study. Let me compare:

Some best practices guides now warn about over-probing—the concern shifted from "not enough follow-up" to "AI can and will dig in incessantly if you let it." 😅
〰️
🙋♀️ Tip: If you set up specific areas to probe in, make sure they are clearly a level deeper or in some way different from the main question they follow. Otherwise, the participant will feel like they’re getting the same question four times.
AI AGENTS
⚙️ Agents: Are we further than in 2024?
Short answer: Yes, but unevenly. Fully autonomous end-to-end research workflows remain largely out of reach.
What's emerging but not proven
Dovetail announced AI Agents in a closed beta that can perform automated actions like sending monthly Voice of Customer summaries, flagging issues, posting alerts.
But this is an example of predefined workflows, not truly autonomous orchestration.
Computer-use agents from Anthropic and OpenAI exist but remain slow and struggle with common interface interactions like scrolling, dragging and navigating some UIs.
Most custom agents you would build without a ready-made tool rely heavily on your ability to give clear, specific instructions with built-in feedback loops and the right task scope (e.g. not giving AI more than it can chew).
What's still hype
True end-to-end automation—plan → recruit → conduct → analyze → report without human oversight…isn’t happening reliably without exceptional prompting/AI skills.
The reliable capability hierarchy today:
More mature: Transcription, some sentiment analysis, some theme clustering, AI-moderated interviews (to a certain extent)
Emerging: Semi-automated reporting, integration-triggered workflows, research screening with agent handling checkpoints
Still not there: Fully autonomous multi-step research, end-to-end orchestration, agent-led strategic decisions without human guidance
AI STUDIES
👩🔬 Synthetic Users - Can they work, and what can we use them for?
PMs are wondering "can we skip recruiting tricky users now?"
Researchers are thinking, "are you trying to avoid talking to real customers?"
I want to do a reality check that is as accurate as I can manage. No agenda here - just calling out what studies say, and what they do not tell us.
〰️
The Stanford study everyone cites
The Stanford HAI "Generative Agent Simulations of 1,000 People" study achieved 85% accuracy on General Social Survey responses.
Impressive…until you understand the methodology:
Each participant underwent a 2-hour in-depth qualitative interview generating transcripts averaging 6,491 words per person
Full transcripts were injected into model prompts
An "expert reflection" module analyzed each interview through psychologist, economist, political scientist, and demographic expert lenses
The 85% accuracy was measured against the same individuals' own responses two weeks later
👆 This methodology bears no resemblance to how a product team would use synthetic users with generic persona descriptions of unknown customers. When was the last time you ran 2-hour interviews with a highly specific, standardized question set across 1000 customers?
Source: Generative Agent Simulations of 1,000 People (Stanford HAI, April 2025)
The Colgate-Palmolive study that's actually about product research
PyMC Labs partnered with Colgate-Palmolive to test synthetic consumers against 9,300 real human responses across 57 personal care product concept surveys.
Their "Semantic Similarity Rating" method achieved 90% of human test-retest reliability with realistic response distributions (KS similarity >0.85).
But here's what the methodology required:
They couldn't just ask LLMs "rate this product 1-5"—direct numerical prompting produced unrealistic distributions
Instead, they elicited free-text responses from GPT-4o and Gemini-2.0-flash, then mapped those to Likert scales using embedding similarity
They tested against supervised machine learning models trained on actual survey data—and the zero-shot LLM approach outperformed them (90% vs 65% correlation attainment)
The products were familiar personal care categories, not novel or complex offerings
👆 Notice: this study required a specific technical approach most teams won't implement, and it worked for well-understood consumer goods categories—not your novel B2B SaaS feature.
Source: LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings (arXiv, October 2025)
〰️
What studies actually prove ✅
With sophisticated methodology (semantic similarity mapping, not direct prompting), LLMs can achieve 90% of human test-retest reliability for purchase intent on familiar consumer goods
With 2-hour interviews of specific known individuals, agents can match that person's survey responses 85% as accurately as they match their own responses
Zero-shot LLM approaches can outperform supervised ML models trained on actual survey data—but only with the right elicitation method
What studies do NOT prove ❌
Do not assume that the studies say:
Generic persona-based synthetic users match real customer behavior (both studies required either deep individual data OR technical methodology sophistication)
Results generalize to novel products, complex B2B offerings, or niche audiences
Standard prompting achieves academic-quality results (it doesn't—direct "rate 1-5" prompts produce unrealistic distributions)
Synthetic users can replace real customer research for final product decisions
These methods work without validation against your actual customer data
Why you cannot simply say "it works"
The Stanford 85% was achieved by simulating specific individuals with detailed life stories based on a standardized survey protocol—corporate teams are trying to simulate generic customer segments they've interviewed few people from, on assumed profiles.
These are fundamentally different tasks.
Multiple studies found:
Mode collapse (narrower response distributions than real humans)
Typicality bias (stereotypical completions over diverse responses)
Positive bias (synthetic users are "much more positive than real humans")
To be clear: I am really excited about this particular topic and not anti-synthetic users. But we need to know where they work, and what studies don’t tell us yet.
ON TO 2026
🔮 What 2026 might look like
Based on this year's progress and roadmaps announced by teams like OpenAI, Anthropic, and Google DeepMind, here's what I expect:
Transcript analysis will approach near-human accuracy
I expect 95%+ accuracy for clear, single-speaker recordings and 90%+ for standard interview conditions (for top languages in training data) by late 2026.
Agent autonomy will be semi-autonomous for most of us, not fully autonomous
OpenAI is developing specialized agents priced at $2,000-$20,000/month for knowledge work. Claude 4 can work continuously for hours on complex tasks. One of the biggest blockers is still human ability to communicate what we want and tell AI how to do it our way (not whatever way they would otherwise reason through automatically).
I’m looking forward to automating more things reliably with improved models, like participant recruitment where we have clear screener criteria and qualifying characteristics.
Video analysis will improve but require human validation
By late 2026, AI will handle much more video analysis tasks for customer research and do it well—key moment extraction (based on image recognition, not just transcripts), emotional reaction identification and on-screen actions.
But human review remains essential for nuanced behavioral interpretation and catching cultural differences or reactions that aren’t highly represented in LLMs’ training data.
Synthetic user tools will remain supplements, not replacements
Appropriate by end of 2026: Testing highly specific use cases (ex: standard usability audits, pricing changes) where we have exactly the right kind of data for the application.
Still inappropriate: Final product decisions, niche audience research, emotional topics, replacement for qualitative research depth.
WHAT’S COMING NEXT?
December will be a lighter edition with a tighter focus.
I'll share some tests I've been kicking off (with 100’s of participants) in early 2026
And some thoughts on planning your 2026 AI-augmented research model
Best of luck kicking off the new month. ✨
-Caitlin