Firefox Voice

  • Had to read the privacy policy to see they use Google.

    >We share your audio recording with Google Cloud’s speech-to-text service to assist us in processing and carrying out your commands. Audio recordings are shared without personally identifiable metadata, and we’ve instructed Google’s service not to retain the audio or transcript associated with a command after it processes the command

  • I've been working on this exact thing for Chrome the last 3 years: https://www.lipsurf.com Anyone can make an open source plugin for it to do anything with voice (https://github.com/LipSurf/plugins)

    I've wanted to port it to Firefox, but the HTML5 SpeechRecognition API (https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...) is still not available. Why not just make the API available and leave this in addon territory for all developers?

  • You should try rhasspy. It's open source. It respects your privacy by using offline services. Fully customizable (each service can be replaced by another) All the services are containerized for easy installation and is available for several architectures such as arm ( on a raspberry pi). There is even n Option to use Mozilla deepspeech tts service.

    https://rhasspy.readthedocs.io/en/latest/

  • I'm sorry, but cloud based speech recognition in itself would already be a red flag, even if Mozilla was doing it in-house. Outsourcing it to Google though? I feel like a company as ostensibly privacy-focused as Mozilla should really know better by now...

  • "We’ve instructed the Google Speech-to-Text engine to NOT save any recordings."

    Hahaha! :D Thanks for the good laugh.

  • I guess Mozilla's own speech to text (https://github.com/mozilla/DeepSpeech) isn't good enough, so they have to use Google's?

  • Mozilla needs to come more around to Apple's way of thinking. These things need to be done locally on the device, not farmed out to some cloud. Use the cloud (CDN) to deploy the software, but run the software locally.

  • Alright, gave it a shot. First impressions:

    * "Make me laugh" always brings me to the same YouTube video.

    * Had pretty much no issues with the default prompts. It was able to find some challenging Spotify playlists, open random websites (including non-standard English domains ones when I spelled them out).

    * "Read this page" uses an awful TTS engine, which is a shame considering that I might actually use this feature on a somewhat regular basis. I'm assuming it uses whatever it detects on the OS level, and so far I haven't bothered with finding a better one (on Ubuntu, if you know of one, please suggest).

    * "Set a timer for X min" works just fine, which is probably the only thing I use Google's assistant on my phone (or whatever it's name might be now).

    * I like the idea of routines in the app settings, which is supposed to tie multiple queries together. I could see myself using it for something like a morning routine (tell me what time it is, give me weather info, read me news, etc.)

  • Google-worries aside, judging from the preview it's pretty slow. I'm not a super-fast typer, but these delays sure look like something that would discourage me from actually using it. Maybe it's not even that it's slow, just that the delays are super-obvious somehow, all these disruptive animations and such.

  • Speech to text in the cloud is a hard no from me. Especially if it’s Google’s speech to text.

  • I don't understand the utility of it. Yes, I can see how this might be considered cool and hip, but.. which my problem as a user does it solve, exactly?

  • So do I yell out my password to log in to websites?

  • Voice browsing on Windows is exactly what I don't need. I'd have a lot of use for being able to search the internet by voice and have the browser read an article to me while my phone is mounted to my dashboard. Without having to configure my whole phone for visual impairment, that is.

  • Everytime I use a voice interface I regret it. The only time it works is when I ask Google a single short six-grade-level question when nobody else is in the room talking, and I'm not otherwise occupied by anything that would prevent me from just using my phone. I get that there are some people who can't use their thumbs, and I pity those people because voice interfaces are the most frustrating things on this planet.

    Why isn't Firefox implementing PWA features like the Share Target API instead of shaving this yak?

  • I actually tried this with Siri while cooking yesterday. It's not there yet but I asked "Hey, Siri ... read me the synopsis of the movie Adam's Rib" and Siri proceeded to read a short synopsis of that movie. It worked on another but had to make me choose one of 7. It failed on the 3 try where I tried another movie, it gave me selections, when I picked one "read me the first one" it just repeated the title instead of telling me the synopsis.

  • I don't use voice assistants so I don't know if these are common, but some of the examples in the list of commands[0] are interesting.

    >Ask about a webpage - Display or open information to the current page or website.

    >Example - What are people saying about this page? (Opens Reddit comments for a specific webpage or article) - What did this page used to look like? (Shows page history in archive.org)

    ---

    >Giving commands nicknames (experimental) - Create names or shortcuts for actions.

    > Example - Say "open new york times", then "Give that the name news" - news (will open nytimes.com)

    [0] https://mozilla-extensions.github.io/firefox-voice/lexicon.h...

  • undefined

  • Recent HN thread Thoughts on Voice Interfaces [1] about a blog post by one of the Firefox Voice engineers.

    [1] https://news.ycombinator.com/item?id=24040539

  • > Audio from your voice request is sent to Mozilla’s Voicefill server without any personally identifiable metadata.

    > Voicefill sends the audio to Google’s Speech-to-Text engine, which returns transcribed text. We’ve instructed the Google Speech-to-Text engine to NOT save any recordings. Note: In the future, we expect to enable Mozilla’s own technology for Speech-to-Text which enables us to stop using Google’s Speech-to-Text engine.

    I was kind of hoping their homegrown speech-to-text engine had become good enough for production use. Disappointing to see that they still have to rely on Google.

  • Is there a preference to have text to speech via another service provider?

  • Google charges 2.4 cents per minute for STT so there's no way Mozilla could afford to offer this service if it actually got popular. I mean, that obviously won't be an issue, but still.

  • undefined

  • I'm surprised by how well the stt work for me as a non native English speaker. I hope to see more products that uses Common Voice.

  • > When you make a request using Firefox Voice, the browser captures the audio and uses cloud-based services to transcribe and then process the request.

    Is it that hard to do local processing, either due to computational power or storage requirements? Or is it just more convenient for them to do it this way?

    Edit: this comment in another subthread kind of answered the question: https://news.ycombinator.com/item?id=24098950

    If I'm drawing the right conclusion, it's a bit of both: hundreds of megabytes of storage is fine for most people but not everyone, and while I probably wouldn't listen to the latest and greatest artists (and binary diffs are a thing, small additions aren't that large), it is convenient for devs to just push it to a server and be done rather than pushing model updates to everyone all the time.

    Edit2: https://news.ycombinator.com/item?id=24096836 Wait, what?! The data is all sent to Google? I was thinking of using this for their sake (opting into using my data for common voice) but this is an instant deal breaker.

  • The default keyboard shortcut wasn't working and it was opening a different extension instead. I went to the voice extension settings and thought it was bad ux how you have to enter the case-sensitive keyboard shortcut names instead of pressing the keys to read the keys.

  • No one here has mentioned this.

    But I believe speech recognition will not take-off until it understands whispering speeches.

    Vocal chords strain makes current solutions unsustainable for continuous usage

  • The Google Recorder app on Pixel phones (and I'm pretty sure general Android release) does super accurate on-device transcription, for what it's worth.

  • Interesting concept but I don't think it is practical in an open office environment when you sit next to your colleagues and speak to your browser.

  • I am a Mozilla supporter so I am happy to support this.

  • undefined

  • Opera had this a decade ago. RIP

  • I see a lot of skeptical voices here, (somewhat warranted, given it's a voice assistant technology), but the fact remains that if we want open, on-device voice recognition, we'll have to do the work and donate sample data.

    This extension is trying to provide some useful functionality in the hopes that Mozilla gets more data for https://commonvoice.mozilla.org

    I'd at least consider recording your voice, especially if you're a non-native English speaker, like myself, have an accent etc.

    It took many years for free software to start to take on the smartphone segment, with previous efforts, (including by Mozilla), failing and only now PinePhone & Librem 5 giving it another go, but unless you're a super hardcore enthusiast, you carry an iPhone/Android today.

    I see this as a way to push back on the likes of Amazon, Google and Apple with this. If regular Firefox users are able to use an on-device, privacy respecting voice assistance and other open-source projects can use Mozilla's tools and datasets to build compelling competitors to Alexa, I'd see that as proof that free software is able to address new, emerging markets too.

  • I've looked into open source voice assistants before. I found mycroft, Jarvis and a few others, but either got bogged down in dependencies or configuration. Many supported shipping your data to Google or Amazon if you configured it, or an open source voice recognition tool.

    I hate this idea that our voice has to be shipped somewhere to be processed. I remember a lot of the speech-to-text tools in the early 2000s weren't all that great (they needed a lot of training), but why haven't we been able to advance on-device processing? Why is everything done in "the cloud."

    So the only way to semi-accurately do voice recognition is to source algorithms that re-train off of millions of people? We have processors in our desktops and laptops that dwarf that compute power by leaps and bounds. We should be looking to Star Trek TNG level voice processing, on each individual device, without some central mainframe.

    But marketing, advertising revenue, data mining, free (as in beer) software that pumps your data like an oil rig, efficiency in data centre (cloud) design .. all these factors have led to these powerful little Intel/ARM/Ryzen chips to be nothing more than thin clients when they're not playing games.

    If Mozilla really wanted to make something amazing and in the spirit of Firefox, give us an experiment where voice processing is done on our devices. Even if it meant I needed to download a 230GB data set, I'd gladly do it, if it could remotely help in getting away from these data silos.

  • Just another useless gimmick that you try once, then never bother to use again. Imagine talking to your browser at work.

  • I still haven't understood how any of this is an improvement of a shell.

  • Who would use such a gimmick? I certainly wouldn't...

  • Whats wrong with deep speech?

  • So let me get this straight. They break global keyboard shortcuts, which people could use to play/pause media on different websites three years ago:

    https://bugzilla.mozilla.org/show_bug.cgi?id=1411795

    and instead of fixing that, they introduce this shit? Fuck Mozilla. I am already using Waterfox, looks like that wont be changing any time soon.