Teaching Vision-Language Models to Speak Cinema

Machine Learning Blog | ML@CMU | Carnegie Mellon University

Zhiqiu Lin

May 13, 2026, 11:06 PM

A year of building a video caption pipeline with 100+ professional creators, and what it taught us about scaling supervision instead of models. By Zhiqiu Lin and Chancharik Mitra. Based on our CVPR 2026 work, Building a Precise Video Language with Human-AI Oversight (Highlight, Top 3%). How close is today's video generator to a Hollywood cinematographer? Hollywood directors reach for certain shots because they make a scene land. They cue a specific feeling in the viewer that flat coverage cannot. Open your favorite video generator (Veo 3.1, Seedance 2, or any of the latest open-source models) and ask it for a dolly zoom of a man standing in the middle of a bustling street, the way Hitchcock used the shot to make the world feel like it is collapsing inward. Or a rack focus pulling from a coffee cup to the woman behind it, the kind of focus pull that quietly tells the audience where to look. Or a Dutch-angle shot of a nervous person staring into the void, a tilted frame that puts the viewer on edge. Most generators will hand back something close to a generic dolly-in, or a slow-motion clip with the wrong focal subject. The output […]