A latest examine on arXiv.org proposes a novel task: automatic online video dubbing. It calls for synthesizing human speech that is temporally synchronized with the provided silent online video in accordance to the corresponding textual content. A multi-modal product named Neural Dubber is proposed to solve the task.
In order to command the period of the generated speech and synchronize it to the lip movement of the speaker, a textual content-online video aligner adopts an interest module involving the online video frames and phonemes and upsamples the sequence in accordance to the size ratio of spectrogram and online video frame sequences.
So that to simulate authentic circumstances of the task, the scientists propose the picture-dependent speaker embedding module, which aims to synthesize speech with distinctive timbres conditioning on the speakers’ faces in the multi-speaker setting.
The experimental outcomes present that in phrases of speech good quality, Neural Dubber is on par with point out-of-the-art textual content-to-speech models.
Dubbing is a publish-output approach of re-recording actors’ dialogues, which is extensively utilised in filmmaking and online video output. It is generally carried out manually by experienced voice actors who examine traces with right prosody, and in synchronization with the pre-recorded videos. In this function, we propose Neural Dubber, the initial neural network product to solve a novel automatic online video dubbing (AVD) task: synthesizing human speech synchronized with the provided silent online video from the textual content. Neural Dubber is a multi-modal textual content-to-speech (TTS) product that makes use of the lip movement in the online video to command the prosody of the generated speech. Additionally, an picture-dependent speaker embedding (ISE) module is made for the multi-speaker setting, which allows Neural Dubber to make speech with a acceptable timbre in accordance to the speaker’s encounter. Experiments on the chemistry lecture single-spetaker dataset and LRS2 multi-speaker dataset present that Neural Dubber can make speech audios on par with point out-of-the-art TTS models in phrases of speech good quality. Most importantly, both qualitative and quantitative evaluations present that Neural Dubber can command the prosody of synthesized speech by the online video, and make substantial-fidelity speech temporally synchronized with the online video.
Exploration paper: Hu, C., Tian, Q., Li, T., Wang, Y., Wang, Y., and Zhao, H., “Neural Dubber: Dubbing for Silent Videos In accordance to Scripts”, 2021. Website link: https://arxiv.org/ab muscles/2110.08243