April 30, 2018
Sounds are fundamentally different from voice and music
Speech recognition and wake words are limited by the type of sounds that the human mouth can produce, as well as conditioned by the communicative structure of human language, which can both be exhaustively mapped.
Similarly, music mostly results from physical resonance, and is conditioned by the rules of various musical genres.
So whilst the human ear and brain are very good at interpreting sounds in spite of acoustic variations, computers were originally designed to process repeatable tasks. Thus, teaching a machine how to recognise speech and music greatly benefits from such pre-defined rules and prior knowledge.
Sounds, on the other hand, can be much more diverse, unbounded and unstructured than speech and music.
Think about a window being smashed, and all the different ways glass shards can hit the floor randomly, without any particular intent or style. Or think about the difference between a long baby cry and a short dog bark, or the relative loudness of a naturally spoken conversation versus an explosive glass crash.
Now you understand why sound recognition required us to develop a special kind of expertise: collecting sound data ourselves and tackling real-world sound recognition problems made us pioneering experts in understanding the full extent of sound variability.