← All Essays
Technology 11 min read

When Games Learned to Speak

The transition from text to voiced dialogue changed what games could say, who could make them, and how much they cost — and it happened faster than anyone anticipated

Text as the default

Text was the default communication medium for game narrative not by design preference but by storage constraint. A text string requires approximately one byte per character; a second of audio requires 8,000 to 44,100 bytes depending on sample rate, and that without compression. A game that contained twenty minutes of voiced dialogue on an 8 MHz processor with a 128-kilobyte RAM budget was a game that contained twenty minutes of dialogue that the hardware couldn't store or process. Text was the only viable option for any game that wanted to communicate more than a few words of narrative, and the adventure game, the RPG, and the graphic adventure all developed sophisticated text-based narrative conventions that became genre standards not as stylistic choices but as practical constraints treated as aesthetic decisions.

Infocom's text adventures — Zork, The Hitchhiker's Guide to the Galaxy, Deadline — developed the capacity to convey character, atmosphere, and narrative complexity entirely through text in ways that remain impressive as writing independent of their status as games. The writing quality was a response to the constraint: if text was all you had, the text had to be exceptional. When voice acting became technically feasible, the writing quality that text-only games had required became an optional quality rather than a necessary one, and the average quality of game narrative writing declined as production attention moved to audio performance rather than prose.

CD-ROM and the voice experiment

Wing Commander III: Heart of the Tiger (1994) was among the first major games to use full voice acting throughout, with a cast that included Mark Hamill, Malcolm McDowell, and John Rhys-Davies performing characters in FMV cutscenes at a production cost that was unprecedented for games at the time. The game required four CD-ROMs to contain its video and audio content — significantly more than the single disk that most games of the era used — and the production investment was approximately $4 million, the largest game budget of its year.

The FMV approach — filming actors in real locations or on sets, then digitising the footage and compressing it to run from CD-ROM — was technically impressive and aesthetically dated almost immediately. Compression artifacts in the video made the footage look worse than the game's rendered graphics in many cases, and the workflow of producing FMV content was more similar to low-budget television production than to game development, requiring facilities and expertise that game studios didn't typically have internally. By the late 1990s, in-engine cutscenes — animated sequences rendered by the game engine with voice acting overlaid — had replaced FMV for most games because they produced more consistent visual quality and were easier to revise as game design changed during development.

Motion capture and convergence with film

Motion capture technology — recording actor movement through markers tracked by multiple cameras and translating the captured motion data to digital character animation — entered game production through the mid-1990s and became standard for high-budget character animation by the mid-2000s. The convergence of voice recording and motion capture into performance capture — recording facial performance, body movement, and voice simultaneously — was the approach that games including Heavy Rain (2010), L.A. Noire (2011), and The Last of Us (2013) used to produce character performances that matched the expressiveness of film acting in real time rendering.

The Last of Us's performances — Troy Baker as Joel, Ashley Johnson as Ellie — established a benchmark for acted character performance in games that critics from outside the games press acknowledged was comparable to television drama. The comparison was relevant: Neil Druckmann's creative direction and Amy Hennig's earlier narrative design work at Naughty Dog had produced games that were explicitly competing with television and film for narrative quality, and The Last of Us demonstrated that the comparison was not aspirational but accurate. Games could produce the emotional impact that drama produced, given sufficiently high-quality performance and writing.

The cost implication of professional voice acting at scale was significant. A game with ten hours of voiced dialogue required scripting ten hours of dialogue (approximately 150,000 to 200,000 words), recording it with actors (at union or non-union rates, with associated studio costs), editing and directing the performances, and integrating the audio into the game's dialogue system. For major studio productions, voice acting budgets grew from tens of thousands of dollars in the CD-ROM era to millions of dollars for games with large casts and extensive dialogue. The indie developer who couldn't afford professional voice acting returned to text-based dialogue as a stylistic or practical choice, which is why many independent games of the 2010s and 2020s use text dialogue rather than voice — not because the designers prefer it but because the cost of quality voice production remained beyond most independent studios' budgets.