November 21, 2019
The Cortex-M0+ challenge: Overcoming technical barriers
There is a drive to embed AI on devices at the edge of the network – and sound recognition is no exception to this. Qualcomm, Google, Arm and others refer to this movement as ‘tinyML’.
It offers product developers maximum freedom, can reduce cloud infrastructure costs and, by offering cloudless AI, tech companies can meet consumer privacy expectations.
While delivering such ‘compactness’ is a significant engineering challenge, it can also open up a wide range of opportunities. Particularly at a time when hardware and software space is becoming more and more competitive.
As you can read in my other blog post, we’ve managed to embed our ai3™ software on an Arm Cortex-M0+ chip and it is a really exciting leap forwards for microML (as we call it). It also positions sound recognition as an exciting and realistic prospect for smartphones, hearables and many other compact or constrained products.
However, pushing the boundaries of embedded ML was not without its challenges.
Arm Cortex-A Series and Arm Cortex-M Series
We work with many products that use powerful application processors based on the Cortex-A series from Arm. To deliver more compact solutions for our customers who want to roll sound recognition out across a wide range of products, we had to explore possibilities at the microprocessor level.
Chips in the Arm Cortex-M Series have much lower silicon area, much higher energy efficiency and are often used as microcontrollers in closed embedded products.
They are more focused on the lower end of the performance scale, where there isn’t much need for customization and configuration. Despite this, they’re still quite a powerful family of processors.
In particular, the M4 is designed for signal processing tasks. As ai3TM is effectively a signal processing application our software performs as expected on the M4 chip with minimal customization. In fact, last year we demonstrated a battery-powered M4 solution using a chip from Ambiq and a piezoelectric microphone from Vesper.
But we wanted push the boundaries further for embedded ML – and deploy ai3TM on a Cortex-M0+ based chip. This is the smallest 32-bit Arm microprocessor available for embedded applications. On a smartphone, you can find a M4 chip is running a wake word detection module so more meaty tasks run on the main applications processor. Sound recognition running on the smaller M0+ means similar tasks can potentially be performed on smartphones with a wider variety of sounds.
The M4 and M0+ features are compared in more detail in Table 1 below. Essentially the key differences lie within the capabilities of the instruction set, their overall processing power, floating-point calculations and also cost.
M0+ vs M4 – key differences
To fit ai3TM on a M0+, there were elements related to the hardware that we had to work around. The figure below shows the support available in the M0+ and M4 microprocessors.
M0+ instruction set architecture vs M4 instruction set architecture
M0+ uses the Armv6-M architecture which has a smaller instruction set and offers less hardware support. This means mathematical calculations are more labour intensive on this chip.
The richer Cortex-M4 instruction set means there’s hardware support for a number of the key operations needed to make the machine learning algorithms run efficiently.
These instructions are not available on the M0+; the compiler will inject specific replacement routines which take more time to compute.
M0+ floating-point vs M4 floating-point
The M0+ has no support for floating-point calculations in hardware, whereas this is at least an option in the M4.
Where floating-point isn’t available, we can use our fixed-point implementation. With fixed-point calculations, tasks that could’ve been carried out by the hardware are actually programmed into the software. Issues like scaling, rounding, underflow and overflow need to be taken into consideration.
All of this tends to lead to fixed-point versions using more MIPS to complete the calculations. Essentially, it’s having to do more for the same task as there’s no hardware support.
M0+ RAM vs M4 RAM
As M0+ chips are intended for items with very limited processing needs, these have very limited RAM and the clock speeds are typically lower than for Cortex-M4 microcontrollers.
To be able to develop and debug effectively, we actually needed a M0+ part with larger amounts of RAM.
The architecture of ai3TM is designed to be flexible, but it’s helpful to have more space while developing. So we selected the Cortex-M0+ based NXP evaluation board as it’s capable of running the software with the view of getting it to work on much smaller devices.
Selecting the NXP evaluation board to work on
The NXP evaluation board offers the specifications we needed, including:-
- I2S to collect PCM samples with limited code;
- The Cortex-M0+ CPU running at ‘non-turbo’ 72MHz;
- Onboard integrated LED and buttons; and
- 128kB on-board flash, 96kB on-board RAM.
The flash and RAM are still too small for your ‘typical’ ai3TM build, so we made changes to the embedded software to reduce the memory requirements. For example, the clip recording and our debugging tool were disabled as these wouldn’t work on such low powered devices.
Our design and architecture philosophy around ai3TM allows for flexibility and scalability, and this is why we’ve been able to deploy the software on the M0+. It effectively gives us the ability to meet very specific hardware targets without compromising on performance.
This laser-focus on compactness can also be found throughout our ML pipeline. This is because we’ve always understood and focused on the end-user application for our technology and optimised the model accordingly. And because we only focus on sound recognition, our models are extremely efficient which makes our job of embedding it onto constrained customer hardware less of a headache.
Now that sound recognition has entered its second generation – where you can expect to see it on a wide range of products recognising many more sounds and acoustic scenes – this helps give product developers ultimate flexibility.
Dominic Binks is the VP of Technology at Audio Analytic, based in Cambridge, UK.