VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Maria J. Danford

Online video-to-Speech (VTS) synthesis is a undertaking of reconstructing speech alerts from silent online video by exploiting their bi-modal correspondences. A latest analyze published on arXiv.org proposes a novel multi-speaker VTS method, Voice Conversion-based Video clip-To-Speech.

Online video editing. Image credit: TheArkow by using Pixabay, no cost license

Although previous methods specifically map cropped lips to speech, hence top to insufficient interpretability of representations realized by the design, the paper gives a far more legible mapping from lips to speech. To start with, lips are transformed to intermediate phoneme-like acoustic models. Then, the spoken content is accurately restored. The program can also make significant-good quality speech with adaptable command of the speaker id.

Quantitative and qualitative benefits show that point out-of-the-art functionality can be accomplished underneath each constrained and unconstrained situations.

Nevertheless considerable progress has been manufactured for speaker-dependent Video-to-Speech (VTS) synthesis, little interest is devoted to multi-speaker VTS that can map silent movie to speech, when allowing for adaptable manage of speaker identity, all in a one procedure. This paper proposes a novel multi-speaker VTS process primarily based on cross-modal expertise transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is utilised for the content encoder of VC to derive discrete phoneme-like acoustic models, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the information encoder of VC to form a multi-speaker VTS method to convert silent video to acoustic units for reconstructing precise spoken written content. The VTS system also inherits the benefits of VC by making use of a speaker encoder to generate speaker representations to proficiently manage the speaker identification of generated speech. Substantial evaluations verify the efficiency of proposed technique, which can be applied in both equally constrained vocabulary and open vocabulary situations, attaining condition-of-the-artwork functionality in generating superior-good quality speech with significant naturalness, intelligibility and speaker similarity. Our demo page is released below: this https URL

Analysis paper: Wang, D., Yang, S., Su, D., Liu, X., Yu, D., and Meng, H., “VCVTS: Multi-speaker Video clip-to-Speech synthesis by using cross-modal information transfer from voice conversion”, 2022. Website link: https://arxiv.org/stomach muscles/2202.09081


Next Post

Research: Do Popular AI Communication Tools Favor the Privileged?

Synthetic intelligence applications can comprehensive our emails, transcribe our conferences, and personally tailor how we learn a new language. But these technologies aren’t developed for all. “These applications that we’re making to boost human everyday living are getting focused to a lot more privileged populations, leaving underserved populations out of […]

Subscribe US Now