Essay 04 · Watch module

How Watching TV Can Actually Teach You Spanish

There is research behind watching television as a language learning tool. Here is what the evidence says about video, subtitles, and how to make passive viewing work for you.

By Habla 16 April 2026 6 min read
All essays

Put on a Spanish show and you will pick up the language. That claim gets made a lot, and it is not entirely wrong. But the mechanism matters. Watching television works under specific conditions. Outside those conditions, you will spend three hours entertained and learn approximately nothing.

Here is what the research actually shows, and why Habla built a Watch module around it.

The level question: Krashen's i+1

In 1982, Stephen Krashen set out a theory that would become one of the most debated ideas in applied linguistics. His Input Hypothesis proposed that people acquire language when they understand messages that are slightly beyond their current level. He called this "comprehensible input" and shorthand it as i+1, where i is what you already know and the +1 is the next attainable step.

The critical word is comprehensible. Krashen argued that incomprehensible input, the kind where you understand fewer than 60 to 70 per cent of the words, produces very little acquisition at all. You are not swimming in the deep end and picking things up. You are drowning in noise.

We acquire language by understanding messages slightly beyond our current level of competence.

Krashen, S. (1982). Principles and Practice in Second Language Acquisition. Pergamon Press.

This has a direct implication for television. Putting on Casa de Papel in week two of learning Spanish is not comprehensible input. It is stress. The show moves fast, the characters speak in regional slang, and the plot gives you very little visual scaffolding for the vocabulary. You will follow the guns and the dramatic music. You will not follow the grammar.

What makes video better than audio alone

Video has a genuine structural advantage over audio. Researchers call it multimodal input. When you watch rather than listen, you get visual context alongside the spoken language: facial expressions, gestures, object references, scene setting. That visual layer is not decoration. It actively supports comprehension.

A 2018 meta-analysis by Montero Perez, Peters, and Desmet reviewed 29 studies on video-based language input and found consistent evidence that visual context aids vocabulary acquisition and retention, particularly for learners at intermediate and lower levels. The images give meaning to words that would otherwise slide past unprocessed.

Webb and Nation (2017) noted in their comprehensive review of vocabulary learning that incidental acquisition, picking up words from context without deliberate study, is significantly more effective when the context is rich. A video scene where someone walks into a kitchen, opens a fridge, and says "tengo hambre" gives you everything you need to infer that phrase. Audio alone gives you less.

Comprehensibility of the input is the single most important variable.

From Habla Journal

The subtitle question

This is where people get strong opinions. Native language subtitles, target language subtitles, or no subtitles?

Robert Vanderplank has spent decades studying captioned media and language learning. His 2016 review of the evidence concluded that target language subtitles (reading Spanish while hearing Spanish) support acquisition more consistently than native language subtitles. The reason is straightforward: when the subtitles match the audio, you are reading and hearing the same language simultaneously. Your brain connects the phonetic form of a word to its written form while processing meaning from context.

Native language subtitles, by contrast, tend to pull your attention onto the written English rather than the spoken Spanish. You follow the story. You do not process the audio.

Peters and Webb (2018) looked specifically at incidental vocabulary acquisition from L2 television and found that learners encountered target words significantly more often when watching with target language subtitles than without. More encounters with a word, spaced across an episode, is exactly the kind of repetition that drives retention.

No subtitles works best at higher levels, when comprehension is high enough that the audio alone is sufficient and subtitles become a crutch rather than a scaffold.

Passive vs active watching

There is a difference between watching a show and watching it as a learner. Passive watching, where you sit back and let the content wash over you, produces the lowest acquisition rates. You understand what you understand, and the rest disappears.

Active watching means engaging with the content deliberately. Noticing unfamiliar words. Pausing to replay a line. Connecting a word you just heard to something you encountered in a prior session. This is not the same as stopping every thirty seconds with a dictionary. It is a lighter touch, a degree of intentional attention rather than full passive consumption.

The distinction matters because it is why "just watch Spanish TV" rarely works for beginners. Without the prior vocabulary and comprehension base to recognise what is unknown, active noticing is not possible. You cannot notice a word you have never encountered in any form. This is why the sequence matters: build some vocabulary first, then watch. The watch session consolidates and expands what you already have.

How Habla's Watch module handles this

Every clip in the Watch module is levelled. A1 clips use slow, clear speech with high-frequency vocabulary. A2 clips introduce more complex sentence structures, always within reach of someone who has completed A1 material. The level system is not decorative. It exists because Krashen was right: the comprehensibility of the input is the single most important variable.

Each clip is accompanied by target language subtitles by default. You can turn them off as your level rises. Habla does not force you to pick one mode for the whole session. Within a session, you can adjust per clip based on how familiar the material feels.

Content is selected for visual richness. A cooking segment where the host names ingredients as she picks them up. A street interview where the visual context anchors every exchange. A documentary where the images and the narration are closely tied. These are not just watchable. They are structured to maximise the chance that unfamiliar words arrive with enough surrounding context to be acquired rather than lost.

The Watch module sits inside a 15-minute session, which means clip selection is deliberate. Two or three clips, chosen for your current level, with an optional review prompt at the end to surface the words you encountered. Incidental acquisition from a well-levelled clip. Deliberate consolidation in the review step. That sequence is what turns passive entertainment into measurable progress.