INTERNVIDEO2

2024/9/14 读论文系列视频大模型

论文发布日期:2024-08-14

ABSTRACT

推出了InternVideo2，是一个新的视频基础模型（ViFM），在视频理解、视频文本任务和视频对话取得了SOTA结果
核心设计是一种渐进式的学习框架，集合了masked video modeling，crossmodal contrastive learning，and next token prediction三种任务，将视频编码器规模提升到了6B
数据层面，通过对视频进行语义分割并生成音频、语音、字幕来对齐时空一致性
模型在视频对话和长视频理解上显著优于其他模型，突出了长上下文推理能力

Introduction

目标
学习可转移的时空表示是CV的一个关键研究领域，LLMs极其变体MLLMs的出现对视觉研究和许多其他领域产生了深远影响。有效地将视频embedding进大模型，并利用它们的能力进行视频理解已经成为关键的任务
动机
- 过去的工作已经确定了几种有效的视频特征表示学习方法：reconstructing videos with masked inputs，aligning videos with languages，predicting the next token using videos
- 这些方法是互补的，而且可以通过一种渐进式训练任务进行统一。InternVideo，UMT，VideoPrism已经利用了两阶段训练方法，包括masked reconstruction and multimodal contrastive learning在下游取得了卓越的性能。沿着这个方向，作者的目标是video-based next token prediction融入渐进式学习框架并扩大整个训练进程，包括模型和数据
方法
- 提出了一套渐进式学习框架，包括三个阶段
  1. capturing spatiotemporal structure via unmasked reconstruction
  2. aligning with semantics from other modalities
  3. enhancing its open-ended dialogue power through next token prediction
- 构建了一个巨大的数据集
  402M数据条目，其中2M视频，50M视频文本对，50M视频音频（audio）语音（speech）文本对，300M图像文本对
- 多模态对齐方法
  作者将视频按照语义切分成片段，并且专注于利用audio, video, and speech重新对齐这些片段的语义描述
  作者首先为这三种模式分别生成字幕，然后将各个字幕融合在一起创建了更全面的表述
贡献
- 提出了InternVideo2，利用三个任务使模型在视频理解中更加擅长推理
- InternVideo2在60个视频/音频任务上实现了SOTA性能，并且善于长上下文
- 提供了一个增强数据集，包括了训练中音频数据的验证合并以及改进字幕方法

Related Work

Video Foundation Models

构建ViFM的典型方法
- video-text contrastive learning
- masked video modeling
- next token prediction
过去的工作
- All-in-one采用单一的模型骨架with多重的预训练目标函数
- UMT combined masked modeling with video-text contrastive learning
- mPLUG-2引入了建模多模态的新设计，它利用了一个通用模块去提高模态间的关联，同时集成了针对特定模态的模块加以区分。除了video-text pretraining，还利用了视频中的音频信息
- MERLOT Reserve利用了大规模的video-speech-transcript pairs来学习视频表征
- VALOR采用了独立的视频、音频和文本编码器，并训练joint visual-audio-text representation
- VAST构造了audio-visual-speech dataset 并开发了一个多模态骨干网络，在视频音频相关任务表现突出
- VideoPrism 结合了video-text contrastive learning and video token reconstruction在公共和专有的视频组合数据中，在视频任务中取得领先

Multimodal Large Language Models

过去的工作
- Flamingo在大量多模态任务中表现出强大的zero/few-shots性能
- Public MLLMs，例如LLaVA和InstructBLIP提出了利用视觉指令微调数据提高视觉对话能力
- 以视频为中心的MLLMs，例如VideoChat、VideoChatGPT和Valley，通过利用指令数据去连接video encoders和LLMs来进行开放世界视频理解

Method

Video Encoder：

ViT, includes additional projection layers for distillation, introduce attention pooling

For input videos, sparsely sample 8 frames, perform a 14×14 (h × w) spatial downsampling. These spatiotemporal tokens are then concatenated with a class token and combined with 3D position embeddings

Stage1: Reconstructing Unmasked Video Tokens

利用两个专家模型：
- InternVL-6B
- VideoMAEv2-g
这两个模型用于指导视频编码器对未被遮蔽的区域进行token级别的重构
训练过程：
- 将完整的视频输入到不同的教师模型中
- 逐帧遮蔽80%的tokens
- InternVL模型提供语义指导
- VideoMAEv2模型提供动作感知指导
- 使用简单的投影层来转换未被遮蔽区域的知识（?）
对齐方法：
- 只对未被遮蔽的tokens进行对齐
- 通过最小化学生模型和教师模型之间的均方误差(MSE)来实现对齐
目标函数：

其中、和分别是我们的视频编码器、InternViT-6B 、 VideoMAEv2 的 ViT-g部分。表示 token 的索引，是 InternVideo2 对输入视频提取的相应 token。是归一化因子。和平衡了所使用模型之间的影响

实验：
- 随机初始化video encoder，然后对齐不同层的输出
  1. 对齐InternVL的后6层
  2. 对齐VideoMAEv2的后4层
  3. 对齐InternVL的最终输出token
- 最终损失是简单地将不同的损失项相加进行优化
- 预训练后，放弃投影层，只使用基本的编码器
- 增强了多模态友好性与动作建模的实践敏感性

Stage 2: Aligning Video to Audio-Speech-Text

模型结构：
- InternVideo2有一个大型视频编码器，以及相对轻量级的音频和文本编码器。
- The used audio encoder is a 12-layer transformer initialized with BEATs (90M). It takes as input 64-dimensional log Mel filterbank spectrograms, generated using a 25ms Hamming window, from 10-second-long clips (padded with zeros).
- 文本编码器使用BERT-Large的前19层，多模态解码器使用BERT-Large的后5层配备交叉注意力
训练目标
通过文本建立不同模态不之间的对齐（包括video，audio，image，speech）。利用跨模态对比和匹配损失和掩码语言建模损失
是跨模态对比损失,是跨模态匹配损失,是掩码语言建模损失
- 跨模态对比损失
其中和分别表示学习到的视频和文本嵌入.和分别表示输入信号的模态和描述该信号的文本描述。计算两个特征之间的余弦相似度。是可学习的温度参数
- 跨模态匹配损失
  其中计算和之间的匹配可能性。表示给定的视频和文本是否配对
  () 或不配对 ()
- 掩码语言建模损失
  其中计算基于之前的标记预测第个文本标记的似然。这里指的是视频字幕
masked learning strategy
- Aligning Masked Visual-Language-Audio
  - 冻结音频编码器
  - 专注于对齐视觉、音频和文本特征
  - 使用全面的图像()、视频()和音视频()数据集进行预训练。
  - 使用的模态组合表示为，其中每对表示来自相应模态的连接特征,表示视频及其音频语音转录文本（transcript）
- Unmasked Visual-Audio-Language Post-Pretraining
  - 冻结视觉编码器
  - 联合对齐音频、视觉和文本特征
  - 使用较小的图像和视频数据子集（25M），完整的音频（0.5M）和音视频数据集（50M）
  - 不使用掩码策略，以确保与推理过程一致并最小化下游任务的性能下降
  - 使用的模态组合为

Stage3: Predicting Next Token with Video-Centric Inputs

模型连接与调优：
- 将InternVideo2与大型语言模型(LLM)连接，使用QFormer设计。
- 采用渐进式学习方案，使用InternVideo2作为视频编码器，训练一个视频BLIP模型与开源LLM进行通信。
高清后训练阶段：
- 目的是提高模型的细粒度和长时空能力。
- 输入视频处理：
  - 将视频分割成最多6个子视频，每个分辨率为224x224像素。
  - 同时保留一个相同分辨率的全局缩放子视频。
额外训练过程：
- 进行两个额外的训练周期（epoch）：
  - 第一个周期使用8帧视频输入
  - 第二个周期使用16帧视频输入
- 更新内容：
  - 更新视频编码器和BLIP Qformer
  - 使用LoRA技术更新LLM

Multimodal Video Data

Video-only Data for Masked Autoencoders
a new video set without labels named K-Mash from action recognition datasets
Videos with Audio-Speech Modalities
a multimodal video dataset, coined as InternVid2, with video-audio-speech information and their descriptions for strengthening video perception via other modalities
We design a video multimodal annotation system VidCap to give proper unimodal
and crossmodal descriptions for textualizing videos from different perceptions
Instruction-Tuning Data for Video Dialogue
a updated training version of MVBench.This training data encompasses key features of image and video understanding across crucial tasks, including 1) conversation, 2)caption, 3) visual question answer, 4) reasoning, and 5) classification

LOADING

INTERNVIDEO2

ABSTRACT

Introduction

Related Work

Video Foundation Models

Multimodal Large Language Models

Method

Stage1: Reconstructing Unmasked Video Tokens

Stage 2: Aligning Video to Audio-Speech-Text

Stage3: Predicting Next Token with Video-Centric Inputs

Multimodal Video Data

Experiments

Video Classification

Action Recognition

Temporal Action Localization

Video Instance Segmentation

Video-Audio-Language Tasks

Video Retrieval

Video Temporal Grounding

Audio-related Tasks

Video-centric Dialogue and its Applications

Ablation Studies

Scaling Video Encoder

Training Data and used Teachers in Stage 1

Training Arch, Method, and Data in Stage 2

Training and Evaluation in Stage 3