Read time: ≈15 minutes

How well can Gemini 3 understand and analyze videos?

In November, I shared a snippet of the early findings from some recent tests.

I’ve been pretty excited about how well Gemini 3 actually read and generated the correct information about what was happening in video files I gave it last month.

There’s been a big gap in most people’s analysis processes: getting AI to help with user testing to notice customers’ body language and expressions as feedback - beyond the transcript.

I think everyone reading this newsletter knows that’s a really important piece for measuring real signal — capturing not just what people say, but what they do in key moments like looking at an early prototype of big product changes.

After running Gemini 3 tests on six videos in November, I kept going.

I ran 60+ micro tests with 20 additional videos last week — from Hotjar screen recordings, and clips of user tests with hard-to-read text or fast interactions on screen, to long-form podcast interviews borrowed from the internet.

Before many of you go on a break, I want to catch you up on Gemini 3’s video analysis capabilities. In ≈15 minutes, you’ll know exactly what to use it for (and how to test it on your own).

In this edition:

  1. 🎥 Gemini 3: What works, what doesn’t, how to get the best out of it.

  2. 🧑‍🔬 Rapid Testing New LLM features: How I ran 60+ Gemini 3 tests in <2 hours (including documenting it all)

Let’s dive in —

WORKFLOW UPGRADES

🔄 Gemini 3: What works, what doesn’t, how to get the best out of it.

Can it actually figure out what’s happening in your user videos for you?

Short answer: Only if you keep videos short.

I ran 60+ micro-tests on Gemini 3 with a mix of video types to figure out if it could:

  1. Identify what people were doing (not saying) - people’s body language, facial expressions

  2. Accurately describe what was happening in their context - the environment, test materials (ex: app views, landing pages), and mouse movements

  3. Interpret people’s behaviors — and whether it’s interpretation aligned with mine. (Less essential here, highly subjective).

My test set was diverse:

  • 20+ unique videos: user tests, Hotjar screen recordings, interviews, and podcasts - many covering completely different subject matter and test materials

  • Ranging from <15 minutes to 1+ hour

The promise 💭

In early tests, I was impressed. Gemini 3 picked up facial expressions, on-screen behavior, even background situation details with surprising accuracy.

I thought: this could really help us move faster through user tests with fewer human hours — like mapping out problematic user paths to compare across tests and find gold nuggets faster.

Here’s what I found —

〰️

What worked — and what didn’t

Short videos (<20 mins): Surprisingly good results

Even when text on screen was grayed out and video quality was mediocre, Gemini’s ability to read and understand the content in short videos was consistently high.

  • ≈85% accuracy on observations across short video tests

  • Gemini 3 caught things like:

    • Content on-screen (prototypes, detailed text, browser tabs vs. apps)

    • Facial expressions and body language

    • Background context (where the person was sitting, lighting, movement)

  • Interpretation of tone and emotion was hit-or-miss — but my interpretation is also subjective. It’s not a hardcoded eval.

→ If your video is short, Gemini 3 can save you time by pinpointing specific user reactions, mapping journeys and finding noteworthy events in tests.

Long videos (>30 mins): Not just yet.

  • Every long video I tested produced obvious hallucinations

  • Gemini fabricated behaviors, scenes, or whole parts of the conversation

  • One response claimed a participant was “gesturing with open palms” — the guy was literally sitting still the whole time

  • While technically, we can upload a 45-minute video, the output was consistently hallucinated for 20+ minute clips.

  • 1+ hour videos can’t even upload.

〰️

🖼️ An example

Let me show you the difference between the outputs from a 45-minute version of a Rich Roll video podcast episode vs. a 15-minute snippet containing the same content and request from me.

Q: “What changes at around {same specific point in the video}?”

Here’s what was actually going on in the video:

Australian pro swimmer and musician Cody Simpson shifted from interview mode at the table with podcast interviewer Rich Roll to playing guitar on a sofa.

Pretty obvious context shift.

Gemini’s output from the 45-minute video 👇

This is obviously wrong. It’s generating what would be reasonable to expect at this point in the conversation — not what it gathers from the actual video.

〰️

Gemini’s output from the same video — a 15-minute clip 👇

Compared to what I see and wrote down before the test, this is correct.

Bottom line: Gemini 3 is not a tool for long-form video analysis yet.

So, should you use it?

If you’re a Researcher, Designer or PM who wants to:

  • Skim user test reactions quickly

  • Map key issues across some prototype views

  • Pull out screens, behaviors, facial cues, tone


    — then yes, Gemini 3 can help but only if your videos are under 20 minutes.

Want to run a full 1-hour user test and have Gemini 3 write the highlights?
Don’t. It’ll make stuff up.

TESTING AI TOOLS

🤖 Your turn: Run 50+ micro-tests in Gemini 3 in <2 hours

Keep your test simple.

The key: You’re not testing whether Gemini 3 can do 10 tasks with videos. You’re testing one workflow question

“Can Gemini accurately describe what is observable in the video at specific timestamps and around specific topic discussions?”

That’s it. If it can’t do that consistently, we can’t use it for decisions based on non-verbal observations without a lot of human involvement.

Here’s how I run tests like this as efficiently as possible:

〰️

Pick your test data set

Examples:

  • 5 short videos under 20 minutes

  • 5 long videos (40m+)

How you get 50+ tests fast

  • 10 videos × 5 moments each = 50 micro-tests

  • A “moment” = a timestamp window where something meaningful happens

    Alt moment: a specific point in the test/conversation you are hoping it can spot and observe correctly by asking about topics, prototype views, etc.

〰️

Protocol (the exact loop)

Step 1 — Skim and mark moments (≈30 min)

For each short video:

  1. Skim (don’t watch everything)

  2. Pick 3–5 moments where:

    • you can observe a clear facial expression or behavior (laughter / tension / confusion)

    • a UI screen or specific text is visible

    • text or visuals are less clear, due to video quality or grayed out UI…

  3. For each moment, write:

    • Timestamp range (e.g., 06:28–06:35)

    • What you see (facts) (e.g., “smiles, looks away”, “clicks green button”, “privacy policy page, hovers over specific text: {note}”)

    • Your interpretation (optional) (e.g., “seems unsure / amused”, “hesitated, unsure which nav option to click”)

💡 Speed rule: if you can’t describe the moment in 1–2 lines, pick a different moment.

〰️

Step 2 — Run micro-tests in Gemini (≈60-70 min)

You’ll run one prompt for up to 5 moments, and one video at a time (fast, consistent, comparable).

Micro-test question types (pick one per moment):

  • “What is happening between them at [timestamp]?”

  • “What is [person]’s facial expression at [timestamp] and what might it signal?”

  • “What changed right after [timestamp]?”

  • “What UI/text is visible at [timestamp]?”

  • “Are they aligned or talking over each other at [timestamp]?”

This is how you keep each test tight and scoreable.

Step 3 — Scoring it: Check what holds up (≈20-30 min)

For each micro-test, grade Gemini’s answer in a few seconds:

  • PASS = key facts match what you noted (behavior + scene + interaction)

  • ⚠️ PARTIAL = mostly right but misses/warps 1 important detail

  • FAIL = hallucination / wrong scene / wrong behavior / wrong interaction

Hard rule: if it invents something big (like gestures, phase changes, “wrap-up,” etc.) → FAIL, even if some details are correct.


SHIFTING INTO 2026

👩‍🏫 AI Analysis results that hold up for tough decisions

I’m kicking off 2026 with two versions of my highly rated AI Analysis course:

  • AI Analysis for Researchers & Designers

  • AI Analysis for PMs

They’re built for the realities of each role and the decisions you’re responsible for.

I’ve run the course 5x in 2025 and revamped the curriculum to deliver even more progress in just a few weeks

The best part of each one:

Every live session is a working session — bring your data, and we’ll turn it into insights during the call with a system you can repeat.

You end every week with progress.

P.S. The PM course version is new. If you’re a PM and you want in, hit reply with “PM” and I’ll send you a private early rate (limited spots).

Until next time. Have a smooth slide into the new year.

-Caitlin