Phronetic AI

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. We are excited to announce our new model, Owlet-phi-2, built for conversations with videos. It is the first of our long-term efforts to build Owlet, a family of lightweight and efficient multimodal models. This blog highlights the following key points related to our latest release-

Existing image-text LLMs
Owlet architecture: extending to videos
Model training & evaluation
How to use the model?

Existing image-text LLMs

Recently, models like GPT4-V, LLaVa, and MiniGPT-4 have displayed interesting multimodal chat capability, using image-text data. Typical architecture of these models is shown in figure below.

*A typical Multimodal LLM architecture. Connector transforms the output from encoders into tokens which are then concatenated with text tokens before being sent into LLMs.*

We explore the possibilities of extending existing image-text capabilities to video, which essentially can be considered as a sequence of image frames.

Owlet architecture: extending to videos

Owlet-phi-2 relies on the following components

Pretrained vision encoder: sigLIP
Pretrained LLM: Phi-2
Connector: mlp2x_gelu

The key feature of Owlet-phi-2 architecture is its video encoding process. We uniformly sample 100 frames from a video and compute their image encodings using sigLIP. We then employ temporal and spatial pooling to intelligently compute the representation for whole video. This process is explained in the below diagram.

*The key feature of Owlet architecture: extending image encoder for video encoding. The tensor dimensions are shown above as the video is processed and fed to the LLM*

Model training & evaluation

Owlet-phi-2 is trained on 40K samples of video instruction tuning data. Only the connector layer and the LLM weights were updated during the training, while vision encoder was kept frozen. To efficiently train the LLM weights, we employ LoRA.

Owlet-phi-2 performs competitively on the wide range of metrics, despite being lightweight and trained on less data as compared to its competitors. The following table showcases the evaluation results on video instruction tuning benchmark released here. The metric values are scored at the scale of 0–5, higher the better, using GPT-3.5. Especially, our model is better at understanding context of the video, and maintaining consistency in the outputs.

*Results on the video_instruct benchmark. Owlet-phi-2 performs competitively despite being lightweight and trained on fewer samples*

How to use the model?

We have uploaded the merged model weights on the HuggingFace hub(link). We also accompany it with a Python inference script (check model card) to get you started easily. Stay tuned for more updates!