Phronetic AI

The generation and consumption of video content are on the rise, encompassing everything from social media posts to CCTV surveillance footage. Videos offer a wealth of multimodal signals, including visual, temporal, and audio elements, making it challenging to search through them effectively. To address this, we introduce a video search platform designed to help you navigate a collection of videos using multimodal queries, whether they are text or image based.

High-level architecture

Similar to the traditional search, we have two main processes: indexing and querying. The indexing process interprets various modalities from video content, extracting different features and storing them in a vector database. Additionally, we extract and store relevant textual information to enhance keyword-based search relevance. At query time, we use AI based routing to decide the indexes that best suit to serve the given query. If multiple indexes are to be used, search results will be combined and re-ranked to deliver most relevant results.

Key features of our platform

Multimodal Support: Our platform meticulously extracts rich, detailed information from both audio and visual frames within videos. By indexing this multimodal data, we ensure comprehensive and highly effective search capabilities.‍
Versatile Query Options: Effortlessly search using natural language text queries to describe scenes, activities, or speech topics, or upload an image to find specific faces or objects within videos. Future updates will include support for video and audio queries.‍‍‍
Natural Language Interaction: Our platform understands the intent behind the user queries using AI, and directs them to the relevant facets, providing a smooth and intuitive search experience.‍‍‍
Enhanced Navigation: Our platform retrieves and highlights the most relevant segments of each video, making it easy to navigate through search results.
Extensibility: Our scalable platform architecture allows for extending to additional aspects of video data, both while indexing as well as searching.

Example use-cases

Examples of the use-cases supported in platform are listed below. We are constantly adding more functionalities to the search.

Activity search

We index activities happening in a video. It allows the user to search for a particular activity by querying in natural language. An example search query for activity looks like:

Audio search

Several business use-cases require filtering the videos based on the speech in the video. In such scenarios, you can simple enter the text query, and retrieve the videos where the text is spoken about. An example search query for audio looks like:

Face search

Many a times, you would face a situation where you remember the face of a person, but do not have any other details about that person. At such times, you can rely on our face-based search, where you just have to upload the image of the person. Video Search will retrieve the videos in which the person is present. An example search query for face looks like:

Object search

You can also retrieve the videos that consist of a particular object in it. Next time, you can use this feature to filter your favourite cat videos, or hamburger recipes! An example search query for objects looks like:

Video RAG: extending video search

An effective use of video search is to enhance the performance of multimodal question answering systems. Multimodal LLMs are limited by the number of frames they can process as context, which is problematic for long videos. Sampling a fixed number of frames uniformly may not provide enough information for accurate answers, leading to hallucinations. This issue is particularly challenging for "needle-in-the-haystack" questions in lengthy videos. By using retrieval-augmented generation (powered by our video search platform) with multimodal LLMs on long videos, we found a qualitative improvement over the multimodal QA without RAG. Please find sample results in the table below. Also, stay tuned for more updates in this area in the coming weeks.

Video link	Question	Answer	Answer with RAG
Yellow chickens	What colour are the chickens in the video?	In the video, there are no chickens. The scene appears to be a still from a video game or a simulation...[hallucination].	The chickens in the video are yellow.
Skeleton revolving	What is the skeleton doing in the video?	The skeleton in the video is holding a piece of paper with the text "I want to be alone" and is also holding a piece of paper with the word "ice cream." The skeleton seems to be interacting with a person who is sitting on a couch, and there is a cat nearby...[hallucination]	The skeleton in the video appears to be performing a dance or a playful action, as it is shown in a dynamic pose with one arm raised and the other bent at the elbow. The background suggests an indoor setting, possibly a room with a window and a door.

How do I sign up?

Please register your interest in the platform at https://videosearch.phronetic.ai.

We are hiring!

We hope you enjoyed this post. If you are enthusiastic about tackling these kinds of challenges, Phronetic AI is hiring passionate ML scientists and engineers. Please check out our careers and research pages.

‍