We had recently released Owlet-Phi-2, an efficient and powerful video-text model for critical video-related tasks. Going a step ahead, in order to utilize the audio signals from the video as well, we propose a model architecture that handles audio-visual inputs explicitly. We train our model, Owlet-Phi-2-Audio, with both audio and visual data from a video instruction-tuning dataset. Comparison with other visual-only models showcase that training on audio data indeed leads to improved accuracy of responses.
Model Architecture
Following the idea of fusing the modality inputs into LLM, we build a video-text MLLM architecture consisting of two separate branches for audio and visual inputs. Each branch consists of modality encoder, projector layers to transform the encoder representations into LLM embedding space, followed by the backbone LLM.
We use Whisper as an audio encoder, and use its last hidden state as audio representations. We use sigLIP image encoder to encode the video. We treat video as a sequence of images, and compute frame representations using sigLIP. We then compute spatial and temporal average of representations across 100 uniformly sampled frames, and use it as a video representation. This process of encoding multiple image frames to form a video representation is explained in detail in our previous blog, Owlet-Phi-2. We rely on low-cost, efficient, lightweight LLM backbone with 2.7 Billion parameters, phi-2. Projector layer for both vision and audio branch is mlp2x-gelu.
Pretraining
Pretraining aims to align different modalities to text LLM space, by training on some generic modality-to-text task. Only projector layer weights are trained during this phase, while encoders, and LLM weights are frozen. We pretrain our audio projector layers using a combination of Speech-to-Text(STT), and audio captioning dataset with 50K samples each. We freeze the vision branch while pretraining audio projector layers, and vice versa.
Finetuning
Finetuning or instruction tuning is aimed to train the LLM model to follow the exact requests or questions in the user prompt. Unlike previous works, we explicitly train both the audio and visual branches of the model simultaneously. We use video instruction-tuning dataset containing both the audio and visual data, by extracting the audios(wav format) from the videos(mp4 format) for our use-case.
Benchmark Dataset & Evaluation
Existing benchmarks do not consider audio information while creating the question-answer pairs based on videos. Thus, it is challenging to evaluate the capability of model to attend to both the audio and visual signals while generating the output. Therefore, we annotate such an audio-visual instruction-tuning dataset that contains question-answer pairs based both on audio and visual information in the video. We release a set of 120 such samples, as we intend to scale the size and quality of the data in future.
We rely on LLM-based evaluation(using GPT-3.5), which rates the output on the scale of 1-5. Model is compared on 5 key metrics as stated here. We compare our audio-visual model with the visual-only baseline that we have trained, as well as other audio-visual model, Video-LLaMA.
How to use our model?
For democratizing the research in the field of multimodal AI, we have released Owlet-Phi-2-Audio, as an open-weights model, on HuggingFace hub.
You can access the model here. We also accompany it with a simple Python script to run inference and get you started easily.
We hope this will prove useful to wide range of AI developers and researchers working towards better multimodal AI systems.