Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. We are excited to announce our new model, Owlet-phi-2, built for conversations with videos. It is the first of our long-term efforts to build Owlet, a family of lightweight and efficient multimodal models. This blog highlights the following key points related to our latest release-
- Existing image-text LLMs
- Owlet architecture: extending to videos
- Model training & evaluation
- How to use the model?
Existing image-text LLMs
Recently, models like GPT4-V, LLaVa, and MiniGPT-4 have displayed interesting multimodal chat capability, using image-text data. Typical architecture of these models is shown in figure below.
We explore the possibilities of extending existing image-text capabilities to video, which essentially can be considered as a sequence of image frames.
Owlet architecture: extending to videos
Owlet-phi-2 relies on the following components
- Pretrained vision encoder: sigLIP
- Pretrained LLM: Phi-2
- Connector: mlp2x_gelu
The key feature of Owlet-phi-2 architecture is its video encoding process. We uniformly sample 100 frames from a video and compute their image encodings using sigLIP. We then employ temporal and spatial pooling to intelligently compute the representation for whole video. This process is explained in the below diagram.
Model training & evaluation
Owlet-phi-2 is trained on 40K samples of video instruction tuning data. Only the connector layer and the LLM weights were updated during the training, while vision encoder was kept frozen. To efficiently train the LLM weights, we employ LoRA.
Owlet-phi-2 performs competitively on the wide range of metrics, despite being lightweight and trained on less data as compared to its competitors. The following table showcases the evaluation results on video instruction tuning benchmark released here. The metric values are scored at the scale of 0–5, higher the better, using GPT-3.5. Especially, our model is better at understanding context of the video, and maintaining consistency in the outputs.
How to use the model?
We have uploaded the merged model weights on the HuggingFace hub(link). We also accompany it with a Python inference script (check model card) to get you started easily. Stay tuned for more updates!