January 18, 2022
Our Alexandria dataset now contains over half-a-billion metadata points
At the heart of our Alexandria™ data platform is the Alexandria dataset, the world’s largest, commercially-exploitable dataset for audio ML tasks. It now features 40 million labelled recordings across 1,200 sound classes and includes over half-a-billion metadata points.
As you’ll frequently read on our research blog, the quality, diversity and range of our data are integral to our ability to give consumer products a greater sense of hearing. It represents a wide range of hardware, environments, sounds (both target and non-target), and acoustic scenes that can be used to train and evaluate models to hit accuracy expectations and work robustly in all real-world environments when deployed on consumer products.
Importantly, we know everything about the data that we use to develop our models. This full data provenance means that we have all of the permissions required to use the data commercially. We also capture essential information about the recorded subject, distances, microphone sensitivity, hardware, location, etc. That forensic approach to data gives us the most incredibly rich metadata, which plays a key role as models progress through our ML pipeline. It also protects our customers from the technical, legal and reputational risks associated with scraping data from unreliable sources like YouTube, etc.
My colleague Cosmin Frăteanu recently discussed the importance of augmentation, which relies heavily on data quality and metadata. In his blog, he said: “This is yet another reason why you can’t just use any audio dataset found on the internet. Without the metadata, it is simply impossible to simulate in a scientific manner an accurate acoustic scenario with specific hardware and software effects.”
In addition, machine learning is an iterative process. When we evaluate model performance, the metadata provides information on where the model performs well and needs further training. So, for example, if our engineers conclude that a ‘car horn’ model is underperforming at a certain distance in a particular location, they know where to focus their efforts.
In ancient Egypt, the Great Library of Alexandria was established between 283-246 BCE to house every text ever written. Later, the Greek philosopher Galen preserved vital metadata about each text, including the title, author, origin, length, etc. The Greek scholar Callimachus built on this, cataloguing the texts by genre.
These pioneers understood the importance of breadth but also the depth of information. When we gave our dataset its name, it was important to capture the same lofty ambitions for audio.