视频模型是零样本学习者和推理者

Google Google DeepMind

简要总结

Veo 3展示了在众多视觉任务中的零样本能力，表明视频模型正走在成为视觉基础模型的道路上——就像大语言模型成为语言基础模型一样。

感知

建模

操作

推理

摘要

大语言模型(LLM)的卓越零样本能力已将自然语言处理从特定任务模型推向统一的通用基础模型。这种转变源于简单的基本要素：在网络规模数据上训练的大型生成模型。有趣的是，同样的基本要素也适用于当今的生成式视频模型。视频模型会像大语言模型发展出通用语言理解能力一样，走向通用视觉理解吗?

这项研究表明，Veo 3可以零样本解决大量它未经明确训练的任务：分割对象、检测边缘、编辑图像、理解物理属性、识别对象功能、模拟工具使用等等。这些感知、建模和操纵视觉世界的能力，使其能够进行迷宫求解和对称性求解等早期形式的视觉推理。Veo 3的新兴零样本能力表明，视频模型正走在成为统一通用视觉基础模型的道路上。

播客概览

收听研究论文的生成摘要。

感知

Edge detection

Segmentation

Keypoint localization

Super-resolution

Blind deblurring

Blind denoising

Low-light enhancement

Conjunctive search

Dalmatian illusion understanding

Shape cue-conflict understanding

Rorschach blot interpretation

建模

Material properties (flammability)

Rigid body transform

Soft body transform

Gravity (earth)

Gravity (moon)

Buoyancy (bottle cap)

Buoyancy (rock)

Visual Jenga

Object packing

Material optics (glass)

Material optics (mirror)

Color mixing (additive)

Color mixing (subtractive)

Categorizing objects

Omniglot (recognition)

Omniglot (generation)

Omniglot (parsing)

Memory of world states

操作

Background removal

Style transfer

Colorization

Inpainting

Outpainting

Text manipulation

Image editing with doodles

Scene composition

Novel view synthesis

3D-aware reposing

Transfiguration

Professional headshot

Dexterous manipulation (jar)

Dexterous manipulation (throw/catch)

Dexterous manipulation (baoding balls)

Affordance recognition

Drawing

Visual instruction

推理

Graph traversal

Tree BFS

Sequence (dots)

Sequence (arrows)

Sequence (circles)

Sequence (squares)

Connecting colors

Shape fitting

Sorting numbers

Tool use

Simple sudoku completion

Water puzzle solving

Maze solving (mouse)

Robot navigation

Rule extrapolation

Analogy (color)

Analogy (resize)

Analogy (reflect)

Analogy (rotate)

Maze (5x5)

Maze (7x7)

Maze (9x9)

Maze (irregular)

Symmetry (shape)

Symmetry (random)