ParrotTTS: Text-to-Speech Synthesis Exploiting Disentangled Self-Supervised Representations

Neil Shah1,2,*, Saiteja Kosgi1,*, Vishal Tambrahalli1, Neha Sahipjohn1, Anil Nelakanti3, Vineet Gandhi1

1Kohli Centre on Intelligent Systems, CVIT, IIIT Hyderabad

2TCS Research, Pune, India

3Amazon, Bengaluru, India

Accepted to the Findings of the ACL: EACL 2024

More samples on Indian languages (Bhojpuri, Hindi, Kannada, Gujarati) can be found on our Github repository.


We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual TTS models using only a fraction of paired data as latter.

Proposed Architecture

(a) Traditional TTS and (b) ParrotTTS

Demo of Single Speaker TTS

Source Text SS-Tacotron2 (Baseline) SS-FastSpeech2 (Baseline) SS-FastSpeech2-SupASR (Baseline) SS-Tacotron2-UnsupASR (Baseline) SS-WavThruVec (Baseline) SS-VQVAE (Baseline) ParrotTTS: NAR-TTE (Ours-Main) ParrotTTS: NAR-TTE (Ours-Half-Transcripts) ParrotTTS: AR-TTE (Ours)
Under the conditions referred to in the previous chapter.
Modern printers understand this, but it is only practiced in the very best establishments.
And therefore approximately one point six seconds after the president was shot in the head.
A formal and thorough description of the responsibilities of the advance agent is now in preparation by the service.
And investigative capabilities of the agencies now operating in this field but will continue.
The department hopes to design a practical system which will fully meet the needs of the protective research section of the secret service.

Demo of Multi-Speaker TTS

Source Text GT-Mel+Vocoder MS-FastSpeech2 (Baseline) MS-FastSpeech2-SupASR (Baseline) VC-FastSpeech2 (Baseline) MS-WavThruVec (Baseline) ParrotTTS: NAR-TTE (Ours-Main)
Until recently she worked as a recruitment consultant in London.
No one in downing street can speak the language of the unions.
The key player in this matter is now the prime minister.
I tried to be cautious but its hard in that role.
The advisors had some discussions but no conclusion was reached.

Comparing samples for seen speakers

Source Text Language Ground Truth Fastspeech 2 Meta-TTS ParrotTTS-CTE-Model (Ours) ParrotTTS-PTE-Model (Ours)
The uncle claimed her. The husband resisted. English
When Oswald was arrested in the Texas Theatre, he was wearing a brown sport shirt with a hole in the right sleeve at the elbow. English
शिष्य सदा गुरु के पदचिह्नों पर चलता है, लेकिन उसका उद्देश्य केवल ज्ञान प्राप्त करना ही होता है। Hindi
पाँचवीं पंचवर्षीय योजना में इसके कार्यक्षेत्र को और विस्तृत किया गया| Hindi
--En efecto--continuó el mentiroso--, y si aquel hombre eminente defendió con tanto calor la paz con los republicanos, Spanish
Entonces me dijeron que habiendo salido otra balandra a reconocer los restos del Rayo, y los de un navío francés que corrió igual suerte, me encontraron junto a Marcial, y pudieron salvarme la vida. Spanish
Je leur ai donc probablement fait manger non pas du bœuf, mais du cheval. French
n bon impôt, c’est un impôt stable, identifié, connu. French

Comparing samples for unseen speakers

Source Text Language Ground Truth Fastspeech 2 Meta-TTS ParrotTTS-CTE-Model (Ours) ParrotTTS-PTE-Model (Ours)
मेजर ध्यानचंद हे भारतीय हॉकीचे खेळाडू आणि संघनायक होते. Marathi
भालचंद्र नेमाडे यांची हिंदू ही कादंबरी अपेक्षेपेक्षा फारच कमी खपली. हिंदूच्या एकूण चार कादंबऱ्यांची मालिका निघणार. Marathi
Aber nicht zu viel, stotterte die kleine Prezel, sonst bleibt ja garnichts mehr von mir übrig, und ich muß doch meine große Prezel suchen! German
Wenn du jetzt nicht dein goldnes Kleid anhättest, müßtest du erfrieren; German
la música infinita del espacio. Spanish
Yo sí. Será porque estoy muy nerviosa. Spanish

Comparing samples for cross-lingual speech synthesis

Source Text Text Language Speaker Language Reference Voice Fastspeech2 Meta-TTS ParrotTTS (Ours)
Capítulo Tercero de Bailen. Spanish Hindi
पाँचवीं पंचवर्षीय योजना में इसके कार्यक्षेत्र को और विस्तृत किया गया| Hindi French
पीक कर्ज वितरण कमी असलेल्या बँकांना जिल्हा प्रशासनाने कारणे दाखवा नोटीस बजवाव्यात. Marathi German
Je leur ai donc probablement fait manger non pas du bœuf, mais du cheval. German French
The middle yard, as far as its limits would permit, was appropriated to felons and transports. The wards here were generally very crowded. English Marathi
पाँचवीं पंचवर्षीय योजना में इसके कार्यक्षेत्र को और विस्तृत किया गया| Hindi English
पीक कर्ज वितरण कमी असलेल्या बँकांना जिल्हा प्रशासनाने कारणे दाखवा नोटीस बजवाव्यात. Marathi Spanish
Je leur ai donc probablement fait manger non pas du bœuf, mais du cheval. French Spanish