Amazon Polly – Text to Speech in 47 Voices and 24 Languages

  • > "I can't embed sound clips in this post"

    You can read this as: We're pleased to announce our really cool TTS feature that took a lot of engineering know-how and effort ... but you'll have to click through, because we can't seem to get around the limitations of our CMS to embed audio content in a blog post.

  • I was trying to figure out an affordable way to send "Read Later" articles in voice to mobile device, either as podcast or other format, to keep myself relevant while driving to/from office.

    I realized this tool might not be cheap, since it may take the voice actor/actress 2 hours per day to produce my content (2-hour driving commuting per day for me). To get familiar local accent, it costs ~$36 in Australia, and maybe slightly cheaper for US accent. The value it brings me can hardly justify the cost.

    Now, with Polly, things changed - it produces reasonable voice, and 2-hour content would only cost ~$0.3. I decided to launch my service as soon as Instapaper approves my API request.

    At the same time, put your email here: http://readlater.launchrock.co/

  • One thing I wish more services like this offered is non-speech sounds. CereVoice, for example, lets you insert laughs, coughs, sighs, etc and it can really enhance the output in some cases. Google's WaveNet also manages to simulate the catching of one's breath during particularly long utterances, although I realize it uses a completely different technique (neural net vs. concatenative synthesis).

    My biggest problem with CereVoice, though, has been its terrible web API. It doesn't support streaming output, so it renders the audio to an Amazon S3 bucket and then returns a URL, which is pretty inconvenient (and slow). You have to do the same for transcripts, too. So, if you want everything, you have to make 3 separate HTTP requests and parse 2 XML documents for one round of synthesis.

    IBM Watson's TTS API gets it right, imo. Its streaming mode returns audio frames and transcripts over a WebSocket connection.

  • After that WaveNet speech synthesis demo, none of these sound even remotely good.

    https://deepmind.com/blog/wavenet-generative-model-raw-audio...

  • Wish we had seen this a few days ago before dropping funds on a human to record for us. I played with several different voices and ran it on a text corpus that we gave to the human, and in some cases I would say this even sounds better.

    Computer generated voices feel most robotic when their intonation of a word is abnormal or their pauses between words make the sentence feel choppy. The intonation and natural pauses between words is very good for all of the main voices.

    The Japanese voice Mizuki was the most comical addition, since I can't think of a real situation where she would ever actually be used. Mizuki speaks Engarish (the Japanese version of English) beautifully, but any Japanese person who can understand Engarish will also understand English. Also, Mizuki doesn't add the correct vowel ending to all words, e.g. she correctly says "cheezu" for "cheese", but says "steku" instead of "steki" for "steak".

  • Audio examples are available here: https://aws.amazon.com/polly/

  • I tried a few random sentences and some articles paragraphs with both Vitória and Ricardo (Brazilian Portuguese voices) and Ricardo did pretty well. I was impressed, really. Vitória on the other hand was not much better (as in "fluent", with rhythm and right intonation) than other available female voices out there for pt_BR.

    EDIT: oh, I had no idea they have used Ivona

  • Did they use ivona for this? Probably. Ivona Amy is awesome1

  • This sounds better than OSX text-to-speech for audiobook purposes, but the 1500 character limit per API call is annoying. Instead of sending the ebook text in full, I have to split by paragraph and (occasionally) sentence and then stitch everything back together with manually inserted pauses, making the audio a bit uneven.

  • I wonder why they pulled the Ivona text2speech android app. I'm still happily using it. It's quite comfortable to listen to articles in pocket while on the train (and to keep listening while changing trains).

    edit: ah it was just the german version that is not available any more, english one seems to be still the store: https://play.google.com/store/apps/details?id=com.ivona.tts

  • Does anyone know if NPR recently used this (or something similar) recently, on a story? I recall listening to a story in my car (hardly paying attention) and the person narrating the story, sounded unusual. I thought it is probably computer generated but I couldn't tell for sure. I guess the general public better get ready to tell the difference?

  • Just me, or do the foreign voices sound much more realistic than the English speaking ones do? It could be the fact that I am not a native speaker of Icelandic, French etc. so perhaps to a native speaker, it may still seem robotic and sterile, but to me, the inflections and cadence sounds much more natural in the non English synthetic speech.

  • Surprised that Icelandic is being offered. As a native Icelandic speaker; to me it sounds about as good as can be expected with such a service. Isochrony (had to search the dictionary for that one) is a bit off but expected based on the context of phrases / words used to create the samples.

  • I wonder if they support speech marks like Ivona, for allowing synchronization of text with audio, useful for text highlighting.

  • The link to the Polly service console is broken. Polly doesn't even show up in the list of services on my AWS account.

  • Hope someone can post an audio clip for those of us lacking an account, curious to see how this sounds!

  • I need the inverse of this is anyone selling that?

  • Does aws offer a speech recognition service too?

  • seems like a great opportunity for a tool to aide in learning pronunciation of foreign languages

  • There's already a well known .NET library called Polly that's under the purview of the .NET Foundation. See http://www.thepollyproject.org/

    This could get a bit confusing for some folks.