AI x Customer Research - July '25

Transcription evaluations head-to-head, and NotebookLM's "hidden" transcripts - are they any good? Plus a speedy test protocol...

Read time: 16 minutes

Hey all!

I’ve been surprised lately:
A lot of you are running your entire analysis process through NotebookLM - using its hidden transcripts as the foundation for all your AI insights.

People are uploading recordings straight into NotebookLM and relying on transcripts that, as several of you put it, “you actually can’t see.”

Yikes. 🫣

Instead of telling you this feels like a risky move, I decided to run some tests.

Is it really that bad? What are NotebookLM’s transcripts like? And how can you tell if they’re hidden?

Plus, comparing NotebookLM transcripts with those from Grain and Whisper, and how to test 3 transcript tools in <1 hour (with an AI evaluation in the mix).

Let’s do this —

In this edition:

A bunch about transcription…

  1. ⚠️ NotebookLM vs. Grain vs. Whisper: A few head-to-head battle findings across English and German transcripts.

  2. 👀 NotebookLM’s “secret” transcripts: Are they any good? Get NotebookLM to show them to you.

  3. ⚖️ Run a transcript evaluation in Claude: Get it to comb transcripts from multiple tools, and tell you which are worth trusting.

  4. 👩‍🔬 Run your own multi-tool transcript test in <1 hour: Steal my simple test protocol that starts small and scales.

WORKFLOW UPGRADES


⚠️ NotebookLM vs. Grain vs. Whisper

While running tests on NotebookLM’s transcription quality for my course students, I decided to compare them with a few popular options - Grain (common among those of you I’ve chatted with), and Whisper (via MacWhisper Pro) - considered a bit of a gold standard for highly accurate transcripts.

Top findings

  1. 👎 Whisper underperformed - don’t assume even “highly accurate” tools will work for your data - test them on your toughest transcripts first.

  2. 👍 Grain surprised me - I’d had some underwhelming results from the tool previously, but it aced recent tests.

  3. 👇 NotebookLM is at the bottom of this list - consistently missing brand and tool names, and even key terms that were stated clearly by participants.

Examples of how the results compared from my test notes:

〰️

  • Grain was smart enough to realize that my participant’s pronunciation of “Myro” actually meant the workshop/design platform “Miro” - the other two didn’t catch this

  • NotebookLM consistently synthesized things just enough to make the transcript noticeably less accurate, and skipped or misunderstood small but important words here and there - see “so” (very) vs. “zu” (too much) above.

  • Whisper results were a mixed bag. On transcripts where German speakers were interviewed in English, there was a 50/50 chance of error across some sentences where accents made pronunciation of key words different from the way a US-English speaker would have said them.


Verdict: Of these three tools, I’d easily choose Grain if continuing to need a tool that solidly catches brand/tool names and understands English spoken with European accents.

More about my test setup are in the “test protocols” section below

PROMPTING PLUS


👀 NotebookLM’s “secret” transcripts

TL;DR: If you can’t see the words, you can’t trust the insights.

What are NotebookLM transcripts even like?

NotebookLM can turn audio files straight into into analysis, but many people think you can’t see the transcripts. That’s a problem - because every quote list or statement about a pattern is built on that invisible text.

I didn’t believe that NotebookLM would be so secretive, so I found the simplest prompt that reveals the transcripts - and lets us check how solid they are. Please use this - don’t rely on hidden transcripts.

PROMPT THIS after uploading your audio file

“Give me a verbatim transcript of the audio you just processed for [file name]. Include Speaker labels.”

That’s it! It worked in 15/15 tests so far on transcripts of various kinds.

〰️

⚖️ Run a Transcript Evaluation in Claude

A WER score = an instant indication about whether you should trust the tool’s transcript capabilities.

What’s “WER”? Word Error Rate. 

It’s a measure of comparison between multiple test transcripts or tests against a control transcript.

If you want to use a control, use a transcript you know is highly reliable, or one you painstakingly wrote out manually.

Want to know how transcripts from 2+ tools compare really fast?

  • Use Python and jiwer.compare to compare word error rate between transcripts in Claude

  • Upload transcripts with the prompt below

PROMPT THIS - Ideally in Claude (handles JiWER best)

“Act as a master of transcript accuracy and comparison.

Run Python: jiwer.compare to compare the words transcribed in the [#] transcripts uploaded.”

〰️

By the way…

If you’re still struggling to get reliable analysis results from AI

  • Wishing you could cut analysis time measurably, but…

  • feeling like anything coming from AI could be hallucinated…

  • and spending more time fixing outputs than is saved using AI…

We fix all of that in my AI Analysis course!

⚡️ Enrollment for September is now open ⚡️

  • 4 week course

  • September 15 - October 10

  • Access to course content forever

  • Content kept updated for an additional 6 months

More details below 👇

TEST PROTOCOLS

💨 Want to repeat my transcript test? Do it in <1 hour

Try this: Pick one tricky interview recording, generate transcripts in 2+ tools, and run a WER comparison. The tiny effort pays for itself the next time you’re tempted to trust a black-box summary from a tool like NotebookLM or something your colleagues are using.

How I ran my complete test round using this -

  • Chose ONE audio interview recording per language (EN, DE)

  • Ran the one file through all three tools —> First results, in well under 1 hour

  • Repeated the process across 10x English audio files and 10x German files - results were consistent.

  • Repeated with 5x Swedish audio files

Get the full test protocol in Notion to guide you

WHAT’S COMING NEXT?

  • How can AI help us continue learning from existing research? Your AI repository options - coming in August

  • Synthetic users is a hot topic and I’m working on something there…

See you in August!

-Caitlin