Phronetic AI

Motivation

‍
Video dubbing is a process that involves replacing the original audio of a video with new recordings, translated to a different language. This technique is widely used in the film, television, and gaming industries to make content accessible to a broader audience, ensuring that language barriers do not impede the distribution of visual media . Towards this end , we release a video dubbing pipeline that dubs the audio of video to a specific target language while mapping the lip movements of the video according to the translated audio .

Model Architecture and Pipeline

‍

Our video dubber pipeline makes use of the following components .

ffmpeg for audio extraction .
Seamless Communication for Audio Translation .
Video Re-talking for Lip Syncing with the translated audio .

‍

ffmpeg

Using ffmpeg we extract the audio from the input video without re-encoding the audio file and save it as a separate audio file . We maintain the best audio quality using ffmpeg and only map the audio stream to the output . We empirically chose the .wav format to save the output audio .

‍

Seamless Communication

We use the Seamless Expressive module within Seamless Communication which uses a prosody aware encoder and a pretssel decoder . This incorporates an expressivity embedding which allows the translation model to maintain high semantic translation performance . The encoder uses a speech encoder with an expressivity encoder and a NAR text to unit encoder to generate prosodic units . The decoder model uses a textless acoustic model on the input prosodic units . The decoder concatenates the target language embedding with the expressivity encoder outputs to produce the audio output in the target language .

‍

Video Re-talking

The lip sync network uses a combination of cross attention , fast fourier convolutions and an AdaIN layer to generate the output video . The model masks the lower half of the face in the input video and applies cross attention between the input masked frames and the reference frames in the video . The output tokens of the cross attention is combined with the mel-spectrogram of the input audio with the fourier convolutions and the AdaIN layer to generate the output video .

‍

Our pipeline currently supports 100 input languages including region languages like Tamil , Kannada and Telugu with single channel input audio with a single speaker. You can get the list of languages supported here.

‍

Qualitative results

Spanish to English

https://aihub-general.s3.ap-south-1.amazonaws.com/blogs/videos/video-dubbing/spa_to_eng_input.mp4

https://aihub-general.s3.ap-south-1.amazonaws.com/blogs/videos/video-dubbing/spa_to_eng_output.mp4

‍

English to Spanish

https://aihub-general.s3.ap-south-1.amazonaws.com/blogs/videos/video-dubbing/eng_spa_input.mp4

https://aihub-general.s3.ap-south-1.amazonaws.com/blogs/videos/video-dubbing/eng_spa_output.mp4

‍

Hindi to English

https://aihub-general.s3.ap-south-1.amazonaws.com/blogs/videos/video-dubbing/hin_to_eng_input_2.mp4

https://aihub-general.s3.ap-south-1.amazonaws.com/blogs/videos/video-dubbing/hin_to_eng_output_2.mp4

‍

Limitations

Using our current pipeline we find that it is incapable of cloning the voice in the source video to the target translated video . An possible solution would be to use a form of cross attention between the embeddings of a voice cloning network and the expressive embeddings to make sure that the target translated voice has the same qualities as the input source voice .

‍