Video models are zero-shot learners and reasoners

Google Google DeepMind

TL;DR

Veo 3 shows emergent zero-shot abilities across many visual tasks, indicating that video models are on a path to becoming vision foundation models—just like LLMs became foundation models for language.

Perception

Modeling

Manipulation

Reasoning

Abstract

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding?

This research demonstrates that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo 3's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

Podcast Overview

Listen to a generated summary of the research paper.

Perception

Edge detection

Segmentation

Keypoint localization

Super-resolution

Blind deblurring

Blind denoising

Low-light enhancement

Conjunctive search

Dalmatian illusion understanding

Shape cue-conflict understanding

Rorschach blot interpretation

Modeling

Material properties (flammability)

Rigid body transform

Soft body transform

Gravity (earth)

Gravity (moon)

Buoyancy (bottle cap)

Buoyancy (rock)

Visual Jenga

Object packing

Material optics (glass)

Material optics (mirror)

Color mixing (additive)

Color mixing (subtractive)

Categorizing objects

Omniglot (recognition)

Omniglot (generation)

Omniglot (parsing)

Memory of world states

Manipulation

Background removal

Style transfer

Colorization

Inpainting

Outpainting

Text manipulation

Image editing with doodles

Scene composition

Novel view synthesis

3D-aware reposing

Transfiguration

Professional headshot

Dexterous manipulation (jar)

Dexterous manipulation (throw/catch)

Dexterous manipulation (baoding balls)

Affordance recognition

Drawing

Visual instruction

Reasoning

Graph traversal

Tree BFS

Sequence (dots)

Sequence (arrows)

Sequence (circles)

Sequence (squares)

Connecting colors

Shape fitting

Sorting numbers

Tool use

Simple sudoku completion

Water puzzle solving

Maze solving (mouse)

Robot navigation

Rule extrapolation

Analogy (color)

Analogy (resize)

Analogy (reflect)

Analogy (rotate)

Maze (5x5)

Maze (7x7)

Maze (9x9)

Maze (irregular)

Symmetry (shape)

Symmetry (random)