HN2new | past | comments | ask | show | jobs | submitlogin

I'm not too sure Siri/ Google Assistant doesn't do this already, but to serve us ads.


I talked to an Amazon Echo engineer about how the sound recording works. They said there is just enough hardware on the device to understand "hello Alexa" and then everything else is piped to the cloud.

Currently, ML models are too resource intensive ($$) for always on-recording.


That would also be crazy expensive and hard to do well. They struggle with current speech reco that’s relatively simple, and can’t do this more complex always listening thing at high accuracy and identifying relevant topics worth serving an ad on even if they wanted to and it wasn’t illegal. This is always the thing people would say for Alexa and Facebook too. The reality is people see patterns where there aren’t any or forget they searched for something that they also talked about and that’s what actually drove the specific ad they saw.


A high-end phone is quite capable of doing automatic speech recognition continuously, as well as NLP topic analysis. The last years voice activity detection has moved down into the microphone itself, to enable ultra low power always-listening functionality. It then triggers further processing of the potentially-containing-speech audio. Modern SoC have dedicated microcontroller/microprocessor cores that can do further audio analysis, without involving the main cores or the OS. Typically deciding if something is speech or not. Today this is usually doing Keyword Spotting (hey Alexa etc). These are expected to get access to neural accelerators chips, which will further improve power efficiency and eventually having sufficient memory and computer to run speech recognition. So the technological barriers are falling one by one.


I worked on Alexa (and Cortana before that). I’m aware of the current tech. The tech barriers for doing this at high accuracy cheaply are still very much there.


Which barriers do you considered to be the most problematic (and why)? I think that using todays CPU/GPU on a modern phone, one would be able to run a model capable of a Word Error Rate of under 10%. I mean, it would take up maybe 1GB of RAM. And battery life would be impacted, but still a quite usable phone? And that seems like it would give workable quality data for some topic / user modelling? I am assuming one would limit triggers to cases where the person speaking is within say 2 meters of the phone, which seems ok limitation for a phone - ulike for a home device like Alexa).


If Siri or Google were doing this, it would have been whistleblown by someone by now.

As far I as understand, Siri works with a very simple "hey siri" detector that then fires up a more advanced system that verifies "is this the phone owner asking the question" before even trying to answer.

I'm confident privacy-sensitive engineers would notice and flag any misuse;


> I’m not too sure Siri/ Google Assistant doesn’t do this already, but to serve us ads.

If it did, traffic analysis would probably have revealed it.


They're not. A breach of trust at that level would kill the product instantly.


Call me jaded but I don't believe that anymore. They might lose 20%. Maybe that's enough to kill but I honestly believe people would just start rolling with it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: