Skip to main content

Storytelling Structure for Explainer Videos

Who this is for: Anyone who wants to understand why the best explainer videos work — the narrative principles behind them — before applying those principles in ACT3 AI.

If you just want the step-by-step production process, go to Idea to Script or How-It-Works Video. This page is for understanding the foundation those steps are built on.


Why storytelling matters in an explainer

An explainer video is not a presentation deck with a voice-over. It is not a list of product features read aloud. The videos that work are the ones that tell a miniature story — they have a protagonist (the viewer), a problem (what the viewer doesn't understand or can't do), a transformation (the explanation), and a resolution (the viewer now understands or can act).

Without that story structure, the viewer has no emotional reason to keep watching. They may technically follow the explanation but they won't remember it, share it, or act on it.

Research on video retention consistently shows that viewers who engage with a narrative structure watch 2–3× longer than viewers watching the same information presented as a list. Memory of the core message is significantly higher.

Some data points worth knowing:

  • 95% vs. 10%: Viewers retain 95% of a message delivered via video versus 10% from text alone
  • 2.6× longer attention: Videos hold viewer attention 2.6× longer than text-only content
  • 22% higher brand recall: Narrative-structured video produces 22% higher brand recall versus non-narrative advertising
  • 70% viewer retention past 10 seconds: A strong hook in the first 5 seconds retains 70% of viewers past the 10-second mark

The four-element story structure

Every effective explainer video contains four elements, in order:

1. The Hook — establishing the viewer's problem

The hook answers the question: "Why should I keep watching?"

A great hook does not start with the product, the company, or the solution. It starts with a scene the viewer recognizes from their own experience. The friction. The confusion. The gap between where they are and where they want to be.

Example hooks:

  • "Every month, a third of small businesses make the same accounting mistake — and most of them don't find out until tax season."
  • "You've been told that compound interest is powerful. But if you're like most people, no one has ever shown you what that actually looks like over time."
  • "New hires at most companies spend their first week asking the same ten questions. This video answers nine of them."

In ACT3 AI, the hook is Scene 1 of your project. It's typically 10–20 seconds and 2–3 shots — enough to establish the problem and create forward momentum.

2. The Empathy Bridge — showing the viewer their current state

The empathy bridge is a brief moment — often visual, rarely narrated — where the viewer sees themselves in the situation being described. A character who looks like them, doing what they do, experiencing the problem that will be solved.

This is not a feature demo. It is a mirror. The viewer should think: "That's me."

In ACT3 AI, the empathy bridge is usually 1–2 shots in Scene 1 or at the opening of Scene 2. It uses Cinematic Realism for maximum relatability, and a character created to match the viewer demographic.

3. The Explanation — the transformation

The explanation is the payload: the concept, the product, the process. It is structured in layers, from simple to complex:

  1. The core idea — state it simply. One sentence, no jargon.
  2. The visual metaphor — show it physically (see Idea to Script for how to find your metaphor)
  3. The step-by-step — if the concept has a process, walk through each step individually
  4. The proof — show the concept working. The result.

Each layer is a Scene or a beat within a scene. Each step in the process is a separate shot. The rule: one idea per shot.

4. The Resolution — the viewer's transformed state

The resolution shows the viewer after the transformation. They have the knowledge. They can act. The problem is solved.

The resolution is short (5–15 seconds) and closes with either:

  • A visual image of the outcome (the employee submits the form correctly; the snowball has become an avalanche of compound interest)
  • A single clear call to action (not multiple CTAs — one)
  • A "now you know" closing that reinforces the core message

How this maps to ACT3 AI's structure

Story elementACT3 AI structureTypical length
HookScene 1, Acts 1 beginning10–20 seconds, 2–3 shots
Empathy bridgeScene 1 end / Scene 2 open5–10 seconds, 1–2 shots
Explanation (core idea)Scene 215–30 seconds, 3–6 shots
Explanation (steps)Scene 2–320–40 seconds, 4–8 shots
Explanation (proof)Scene 3 open10–15 seconds, 2–3 shots
ResolutionScene 3, Act 1 close10–20 seconds, 2–3 shots

A 90-second explainer video has approximately:


Pacing: the underrated element

Most explainer videos fail on pacing, not on content. Too many words, too few visuals, too little breathing room.

The 1:2 rule: For every second of spoken narration, there should be at least 2 seconds of visual — either a shot that shows what is being said, or a shot that reinforces it with context. If the ratio inverts (more talk than show), the video feels like a slide deck.

Cut before the viewer is ready: The ideal shot length in an explainer is 1.5–4 seconds. Most creators hold shots too long. Each cut creates a micro-jolt of attention. Short cuts keep the viewer's brain engaged.

Silence is not wasted time: Leave 0.5–1 second of silence between major explanation points. The viewer's brain needs that gap to process and file what was just said. Narration over narration is one of the most common mistakes.

In ACT3 AI, control pacing with shot timing and the timeline editor.


Narration style: the voice that carries the story

The voice-over is the through-line. It is what makes the video feel like a single coherent experience rather than a series of clips. Choose a voice and a tone that matches your viewer, not your product.

For business explainers: Clear, warm, confident. Not formal, not salesy. The voice of a knowledgeable colleague.

For educational content: Patient, precise, encouraging. Slower than you think — learners need time.

For marketing/advertising explainers: Energetic, credible, brief. Every word earns its place.

In ACT3 AI, assign a voice using Azure Neural TTS, or upload your own recorded narration using How to Upload a Recorded Voice. Set the delivery style (pacing, warmth, energy) in the voice delivery settings.


The prototypical explainer: a worked example

Here is the full story structure for a 90-second product explainer for a project management tool:


Hook (0:00–0:15)

Visual: A project manager at their desk, surrounded by three different chat windows, two spreadsheets, and a sticky-note-covered monitor. Overwhelmed expression.

Narration: "Most teams today are managing projects across five different tools. Nothing is in one place. Deadlines slip. Nobody knows what's done."

ACT3 AI setup: Scene 1, Cinematic Realism, office set. Medium shot → close-up of the scattered monitor.


Empathy Bridge (0:15–0:22)

Visual: Close-up of an email chain with 47 replies. Someone frantically scrolling.

Narration: "If you've ever missed a deadline because the update was buried in an email — this is for you."

ACT3 AI setup: 2 shots in Scene 1. Tight close-up + insert shot.


Explanation (0:22–1:05)

Visual: Screen transitions to a clean, organized dashboard. Same project manager, now relaxed. Walking through the interface.

Narration: "With [Product], all your tasks, updates, and timelines live in one place. Your team checks in where the work is — not in an email chain."

Walk through 3 features: Each gets 1–2 shots showing the interaction. Clean, well-lit office set.


Resolution (1:05–1:25)

Visual: The project manager in a team meeting. Everyone is aligned. Nobody is scrambling.

Narration: "Your team knows what to do. Deadlines are visible. Nothing falls through the cracks."

CTA: "Start free at [product website]."

ACT3 AI setup: Scene 3, 3 shots: wide meeting room shot → medium two-shot of satisfied team → product URL on screen.


Total: 3 Scenes, 18 Shots, 1 Act, ~150 words of narration, 90 seconds.



Choosing a script framework

The four-element structure above is a universal template. The frameworks below are specific ways to fill it in, each suited to a different type of content and audience:

FrameworkWhat it doesBest for
Problem-Agitation-Solution (PAS)State the problem → amplify the emotional cost of not solving it → introduce the product as reliefMost common; works for nearly any product
Before-After-Bridge (BAB)Show the frustrating present → paint the desired future → the product bridges the gapTransformation stories, lifestyle products
AIDAAttention → Interest → Desire → ActionUniversal conversion-focused content
Star-Story-SolutionFollow a relatable character through the problem → the product resolves itHumanizing abstract or technical products
How It WorksStep-by-step walkthrough of the product's mechanicsTechnical SaaS, process-heavy services, skeptical audiences
Before / AfterVisual contrast between state A (problem) and state B (solution)Product demos, before/after transformations
Question and AnswerPose the questions viewers are already asking, then answer themFAQ-style content, decision-stage buyers

For most explainer videos, PAS or BAB is the right starting point. Pick the one whose first movement (Problem or Before) you can write most vividly in terms of your viewer's specific lived experience.


What's next