Hacker News Clone

Translatotron: An End-to-End Speech-to-Speech Translation Model

by bkudria on 5/17/2019, 1:56 AM with 39 comments

by Nr7 on 5/17/2019, 10:35 AM
Cool stuff. I wonder how long it will take until speech synthesis is at the point where it can be used to create dialogue for video games. Obviously it would (at least initially) replace just the less important stuff spoken by background characters or less important NPCs, but even that could be a big improvement.
Just imagine how much more immersive a game like Skyrim would be if the writers could just write the lines and then run it through a speech synthesizer to get finished dialogue. No need to hire multiple actors and get them to a studio to record their lines. It would be so much easier and faster to create a massive amount of unique dialogue and you wouldn't have to listen to the same "arrow to the knee" line spoken by the same couple of actors over and over again everywhere you go.
It would improve user created mods as well since just about anyone could then just create new characters with completely custom voices and dialogue.
by bowmessage on 5/17/2019, 6:00 AM
Is it just me or are the final results (Translatotron translation) not playable, while the initial audio samples work fine?
by kozak on 5/17/2019, 8:04 AM
A perfect illustration for how Chromium's monopoly is killing the open web. The sample audio doesn't play in Firefox, because who cares about anything other than Chrome, right?
by akie on 5/17/2019, 9:45 AM
I'm just going to say that this is flat out amazing. Really super impressive.
by ttctciyf on 5/17/2019, 10:27 AM
Am I reading this correctly, that there's no explicit semantic representation going on at any stage, it's purely audio input frequencies -> ML -> audio output frequencies? If so, that's ... so much for Jerry Fodor and his Language of Thought Hypothesis, eh? (Yeah, mostly joking but still..)
by lostmsu on 5/23/2019, 2:15 AM
Is the model available yet?
by londons_explore on 5/17/2019, 8:54 AM
There seem to be lots of questionable translations in their source data - missing emphasis, wrong words, and sometimes stuttering or mistaken vocalisations.
If they could fix that, I think results could be much better.