Storytelling Structure for Explainer Videos
Who this is for: Anyone who wants to understand why the best explainer videos work — the narrative principles behind them — before applying those principles in ACT3 AI.
If you just want the step-by-step production process, go to Idea to Script or How-It-Works Video. This page is for understanding the foundation those steps are built on.
Why storytelling matters in an explainer
An explainer video is not a presentation deck with a voice-over. It is not a list of product features read aloud. The videos that work are the ones that tell a miniature story — they have a protagonist (the viewer), a problem (what the viewer doesn't understand or can't do), a transformation (the explanation), and a resolution (the viewer now understands or can act).
Without that story structure, the viewer has no emotional reason to keep watching. They may technically follow the explanation but they won't remember it, share it, or act on it.
Research on video retention consistently shows that viewers who engage with a narrative structure watch 2–3× longer than viewers watching the same information presented as a list. Memory of the core message is significantly higher.
Some data points worth knowing:
- 95% vs. 10%: Viewers retain 95% of a message delivered via video versus 10% from text alone
- 2.6× longer attention: Videos hold viewer attention 2.6× longer than text-only content
- 22% higher brand recall: Narrative-structured video produces 22% higher brand recall versus non-narrative advertising
- 70% viewer retention past 10 seconds: A strong hook in the first 5 seconds retains 70% of viewers past the 10-second mark
The four-element story structure
Every effective explainer video contains four elements, in order:
1. The Hook — establishing the viewer's problem
The hook answers the question: "Why should I keep watching?"
A great hook does not start with the product, the company, or the solution. It starts with a scene the viewer recognizes from their own experience. The friction. The confusion. The gap between where they are and where they want to be.
Example hooks:
- "Every month, a third of small businesses make the same accounting mistake — and most of them don't find out until tax season."
- "You've been told that compound interest is powerful. But if you're like most people, no one has ever shown you what that actually looks like over time."
- "New hires at most companies spend their first week asking the same ten questions. This video answers nine of them."
In ACT3 AI, the hook is Scene 1 of your project. It's typically 10–20 seconds and 2–3 shots — enough to establish the problem and create forward momentum.
2. The Empathy Bridge — showing the viewer their current state
The empathy bridge is a brief moment — often visual, rarely narrated — where the viewer sees themselves in the situation being described. A character who looks like them, doing what they do, experiencing the problem that will be solved.
This is not a feature demo. It is a mirror. The viewer should think: "That's me."
In ACT3 AI, the empathy bridge is usually 1–2 shots in Scene 1 or at the opening of Scene 2. It uses Cinematic Realism for maximum relatability, and a character created to match the viewer demographic.
3. The Explanation — the transformation
The explanation is the payload: the concept, the product, the process. It is structured in layers, from simple to complex:
- The core idea — state it simply. One sentence, no jargon.
- The visual metaphor — show it physically (see Idea to Script for how to find your metaphor)
- The step-by-step — if the concept has a process, walk through each step individually
- The proof — show the concept working. The result.
Each layer is a Scene or a beat within a scene. Each step in the process is a separate shot. The rule: one idea per shot.
4. The Resolution — the viewer's transformed state
The resolution shows the viewer after the transformation. They have the knowledge. They can act. The problem is solved.
The resolution is short (5–15 seconds) and closes with either:
- A visual image of the outcome (the employee submits the form correctly; the snowball has become an avalanche of compound interest)
- A single clear call to action (not multiple CTAs — one)
- A "now you know" closing that reinforces the core message
How this maps to ACT3 AI's structure
| Story element | ACT3 AI structure | Typical length |
|---|---|---|
| Hook | Scene 1, Acts 1 beginning | 10–20 seconds, 2–3 shots |
| Empathy bridge | Scene 1 end / Scene 2 open | 5–10 seconds, 1–2 shots |
| Explanation (core idea) | Scene 2 | 15–30 seconds, 3–6 shots |
| Explanation (steps) | Scene 2–3 | 20–40 seconds, 4–8 shots |
| Explanation (proof) | Scene 3 open | 10–15 seconds, 2–3 shots |
| Resolution | Scene 3, Act 1 close | 10–20 seconds, 2–3 shots |
A 90-second explainer video has approximately:
- 1 Act
- 3 Scenes
- 15–25 Shots
- 1 visual style
- 1 host/narrator digital actor
- 2–3 Sets (one per scene, or reuse the same set)
Pacing: the underrated element
Most explainer videos fail on pacing, not on content. Too many words, too few visuals, too little breathing room.
The 1:2 rule: For every second of spoken narration, there should be at least 2 seconds of visual — either a shot that shows what is being said, or a shot that reinforces it with context. If the ratio inverts (more talk than show), the video feels like a slide deck.
Cut before the viewer is ready: The ideal shot length in an explainer is 1.5–4 seconds. Most creators hold shots too long. Each cut creates a micro-jolt of attention. Short cuts keep the viewer's brain engaged.
Silence is not wasted time: Leave 0.5–1 second of silence between major explanation points. The viewer's brain needs that gap to process and file what was just said. Narration over narration is one of the most common mistakes.
In ACT3 AI, control pacing with shot timing and the timeline editor.
Narration style: the voice that carries the story
The voice-over is the through-line. It is what makes the video feel like a single coherent experience rather than a series of clips. Choose a voice and a tone that matches your viewer, not your product.
For business explainers: Clear, warm, confident. Not formal, not salesy. The voice of a knowledgeable colleague.
For educational content: Patient, precise, encouraging. Slower than you think — learners need time.
For marketing/advertising explainers: Energetic, credible, brief. Every word earns its place.
In ACT3 AI, assign a voice using Azure Neural TTS, or upload your own recorded narration using How to Upload a Recorded Voice. Set the delivery style (pacing, warmth, energy) in the voice delivery settings.
The prototypical explainer: a worked example
Here is the full story structure for a 90-second product explainer for a project management tool:
Hook (0:00–0:15)
Visual: A project manager at their desk, surrounded by three different chat windows, two spreadsheets, and a sticky-note-covered monitor. Overwhelmed expression.
Narration: "Most teams today are managing projects across five different tools. Nothing is in one place. Deadlines slip. Nobody knows what's done."
ACT3 AI setup: Scene 1, Cinematic Realism, office set. Medium shot → close-up of the scattered monitor.
Empathy Bridge (0:15–0:22)
Visual: Close-up of an email chain with 47 replies. Someone frantically scrolling.
Narration: "If you've ever missed a deadline because the update was buried in an email — this is for you."
ACT3 AI setup: 2 shots in Scene 1. Tight close-up + insert shot.
Explanation (0:22–1:05)
Visual: Screen transitions to a clean, organized dashboard. Same project manager, now relaxed. Walking through the interface.
Narration: "With [Product], all your tasks, updates, and timelines live in one place. Your team checks in where the work is — not in an email chain."
Walk through 3 features: Each gets 1–2 shots showing the interaction. Clean, well-lit office set.
Resolution (1:05–1:25)
Visual: The project manager in a team meeting. Everyone is aligned. Nobody is scrambling.
Narration: "Your team knows what to do. Deadlines are visible. Nothing falls through the cracks."
CTA: "Start free at [product website]."
ACT3 AI setup: Scene 3, 3 shots: wide meeting room shot → medium two-shot of satisfied team → product URL on screen.
Total: 3 Scenes, 18 Shots, 1 Act, ~150 words of narration, 90 seconds.
Choosing a script framework
The four-element structure above is a universal template. The frameworks below are specific ways to fill it in, each suited to a different type of content and audience:
| Framework | What it does | Best for |
|---|---|---|
| Problem-Agitation-Solution (PAS) | State the problem → amplify the emotional cost of not solving it → introduce the product as relief | Most common; works for nearly any product |
| Before-After-Bridge (BAB) | Show the frustrating present → paint the desired future → the product bridges the gap | Transformation stories, lifestyle products |
| AIDA | Attention → Interest → Desire → Action | Universal conversion-focused content |
| Star-Story-Solution | Follow a relatable character through the problem → the product resolves it | Humanizing abstract or technical products |
| How It Works | Step-by-step walkthrough of the product's mechanics | Technical SaaS, process-heavy services, skeptical audiences |
| Before / After | Visual contrast between state A (problem) and state B (solution) | Product demos, before/after transformations |
| Question and Answer | Pose the questions viewers are already asking, then answer them | FAQ-style content, decision-stage buyers |
For most explainer videos, PAS or BAB is the right starting point. Pick the one whose first movement (Problem or Before) you can write most vividly in terms of your viewer's specific lived experience.
What's next
- Idea to Script — Turn this structure into a written script
- How-It-Works Video — Production guide for product explainers
- Concept Explainer — Making abstract ideas visual
- Step-by-Step Process Video — Walkthrough format for procedures