其中
实验:
随机初始化video encoder,然后对齐不同层的输出
最终损失是简单地将不同的损失项相加进行优化
预训练后,放弃投影层,只使用基本的编码器
增强了多模态友好性与动作建模的实践敏感性
模型结构:
训练目标
通过文本建立不同模态不之间的对齐(包括video,audio,image,speech)。利用跨模态对比和匹配损失和掩码语言建模损失
其中
跨模态匹配损失
其中
(
掩码语言建模损失
其中
masked learning strategy
Video-only Data for Masked Autoencoders
a new video set without labels named K-Mash from action recognition datasets
Videos with Audio-Speech Modalities
a multimodal video dataset, coined as InternVid2, with video-audio-speech information and their descriptions for strengthening video perception via other modalities
We design a video multimodal annotation system VidCap to give proper unimodal
and crossmodal descriptions for textualizing videos from different perceptions
Instruction-Tuning Data for Video Dialogue
a updated training version of MVBench.This training data encompasses key features of image and video understanding across crucial tasks, including 1) conversation, 2)caption, 3) visual question answer, 4) reasoning, and 5) classification