The “MSGEQ7-killer” I promised was in-progress before I took my hiatus. I’ve now worked it into something barely functional (40 updates per sec with nil niceities), but the ATTiny85 proved slow and unstable. It didn’t even have hardware multiplication. Working toward publishing wasn’t worth it, but I’ll leave the code here in case it helps someone.

I’ve dedicated a lot of time to music visualization in Arduino (see my best work, ESP32-oled-spectrum). However, I’ve never elaborated on my thoughts and findings before, and it just ended up hidden in that project. Now, I figured, some people could probably benefit from it. Yes, an FFT can be run on a microcontroller, and you won’t need the MSGEQ7 or a PC if you do this right. It’s a bit informal (apologies), but here are three tips to get that running fast and at high resolution.

1. Carefully pick your FFT library

The most popular FFT library in the Arduino Library manager right now, ArduinoFFT, is most likely not the library you want to use. It has a very nice API, but it has a glaring problem: it uses the double data type. Almost all microcontrollers have no acceleration for the double type (the Teensy 4.0 is one exception), and this means double arithmetic is handled in software, slowly. As a result, picking a different library is generally multiple times faster, regardless of the microcontroller.

So which libraries should you use? Here are two libraries that I’ve tried with success and would recommend:

  • fix_fft
    • There seem to be two versions of this on the internet, both for integers (fixed-point):
      • a 16-bit version (see second entry) published in 2006, and
      • an 8-bit version on Github that is available through the Arduino library manager.
    • The real FFT seems to be broken on both.
    • fix_fft implements the in-place form of the FFT. In-place algorithms can build their output inside the input’s memory.
    • Both versions have an ambiguous license. It’s impossible to legally integrate these into an open-source project without permission.
  • kissfft
    • It is configurable for 32-bit floats and 16-bit integers (32-bit integers seemed broken for me).
    • There is a working real FFT. A real FFT only works on real data (audio is), but it can run twice as fast!
    • kissfft implements the out-of-place form of the FFT. This means that you need to allocate extra memory for the FFT output.
    • Permissively MIT-licensed.

Considering kissfft, 32-bit floats offer the fastest and most accurate approach if you have the memory and an FPU. However, it is generally acceptable to work on 16-bit integers instead. Considering fix_fft, it’s one of the only 8-bit FFT libraries out there. You could use the 8-bit fix_fft to trade away even more accuracy for memory, but I myself would rather not be mired by licensing problems.

All said, definitely feel free to consider other libraries. For example, there are some incredible assembly routines out there that squeeze the most FFTs out of the atmega (NOT the attiny) architecture. The one by elm-chan is even permissively CC-BY-3.0 licensed, though I think that’s not GPL compatible.

Addendum: Using kissfft

kissfft was originally written with PCs in mind. However, it’s simple enough to use in embedded projects. Navigate to the kissfft repository, and pick out the following files: kiss_fft.c, kiss_fft.h, and _kiss_fft_guts.h. If you need the real FFT, also pick out the following files: kiss_fftr.c and kiss_fftr.h. Once you move these files into the source directory, you can include kiss_fft.h (or kiss_fftr.h) and call kissfft’s functions like any other library. If you need the 16-bit integer version, you will need to compile with the option -D FIXED_POINT=16. Most embedded developers should be familiar with this.

However, Arduino doesn’t nicely allow this, and it’s not enough to just #define this before the include; you must create a file called platform.local.txt where platform.txt is located. This varies by the device you’re compiling for, but good places to start looking are:

  • C:\Program Files (x86)\Arduino\hardware
  • C:\Users\%USERPROFILE%\AppData\Local\Arduino15\packages

For example, I’ve found platform.txt for my specific ESP32 compiler at C:\Users\%USERPROFILE%\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4. The contents of platform.local.txt should be:

compiler.c.extra_flags= -D FIXED_POINT=16
compiler.cpp.extra_flags= -D FIXED_POINT=16

2. Reuse old samples

One of my iterations ended up spending a majority of its time doing nothing. Why? It was waiting to collect enough samples to do an FFT on. Because it was spending so much time waiting, I was able to double its performance by just letting it reuse old samples. Were there consequences? Of course: old data now took time to disappear from the visualization. However, this was actually desirable because information tended to appear then disappear too quickly to be appreciated anyway. That is why other projects tend to apply exponential time-smoothing to the output, and reusing old data just happens to get part of the way there.

In ESP32-oled-spectrum (MIT-licensed, by the way), I wrote a custom circular buffer that exposed it's write index. I'll admit that it's not a complete implementation.
template <typename TYPE> class circularBuffer{
    int buffer_size;
    volatile TYPE *buffer;
    volatile int read_index = 0;
    volatile int write_index = 0;
    volatile bool available = true;
    void insert(TYPE val){
      buffer[write_index] = val;
      write_index %= buffer_size;
    TYPE pop(){
      TYPE val = buffer[read_index];
      read_index %= buffer_size;
      return val;
    bool empty(){
      return read_index == write_index;
    circularBuffer(int size){
      buffer_size = size;
      buffer = (TYPE*)calloc(size, sizeof(TYPE));
circularBuffer<int> analogBuffer(SAMPLES)
But, anyway, this allowed me to dump *all past samples* in chronological order.
int32_t sum = 0;
int16_t in[SAMPLES];
// Reads ENTIRE circular analog buffer starting from analogBuffer.write_index, this
//  means a lot of overlapping data, but it decouples FFT calculations from sampling
analogBuffer.available = false;   // Closes off buffer to prevent corruption
// Reads entire circular buffer, starting from analogBuffer_index
for(int i = 0; i < SAMPLES; i++){
    int val = analogBuffer.buffer[(i+analogBuffer.write_index)%SAMPLES];
    in[i] = 16*(val - 2048); // TODO: possible clipping when subtracting the average?
    sum += in[i];
analogBuffer.available = true;    // Restores access to buffer
And to fill it, I sampled then pushed the sample to the buffer on a timer interrupt. On rare occasions where interrupt goes off during the dumping stage, I had a contingency buffer big enough to hold onto the samples. The bigger the buffer, the more unlikely a sample gets lost!
/* Sampling interrupt */
hw_timer_t *timer = NULL;
void IRAM_ATTR onTimer(){
    // Basic function is to insert new values for circular analogBuffer. However, 
    // if the interrupt is triggered as analogBuffer is being read, the new value 
    // is stashed in contigBuffer and inserted later.

    int val = analogRead(INPUT_PIN);

    if(val > 4000) digitalWrite(CLIP_PIN, HIGH);
    else digitalWrite(CLIP_PIN, LOW);

    if(!analogBuffer.available) contigBuffer.insert(val);
        while(!contigBuffer.empty()) analogBuffer.insert(contigBuffer.pop());

3. Scale both frequency and magnitude logarithmically

This is how our ears perceive sound, so visualizations should scale the output of an FFT the same way. The magnitude is simple enough: we just need to convert all of the magnitudes to decibels. The reference level doesn’t really matter because it’s just the scaling we’re after. Frequency, on the other hand, is tougher.

The output of an FFT is basically a histogram, revealing how much of the signal was in each “bin” of frequencies. However, all of these bins are of equal width–that’s not logarithmic. Generally, the solution is to combine these bins into the columns of your spectrum visualizer. However, an unfortunate truth is that every new column with needs exponentially more bins than the last. For an exponential growth factor of 2, you would need 1 bin in the first range, 2 more bins in the second, 4 more in the third, going up to 64 more bins in the seventh. In total, you would need 256 bins for just eight columns! However, if you let the logarithmic scaling get a little ugly, then you have the freedom to lower the growth factor to something more reasonable. Empirically, I have had decent results with a growth factor that combined 1024 bins into 32 columns, though I forget what it was.

That said, I have since switched to a more complex way to combine bins in ESP32-oled-spectrum: convolution kernels that turn an FFT into a Constant Q transform. Generally, it’s more bin-inefficient, but it produces smoother motion. I leave the internal details to my other repo cq_kernel.


So these are the insights that made ESP32-oled-spectrum, one of the fastest and highest-resolution music visualizers out there. I hope it helps anyone else who is interested in these kinds of things but refuses to settle for less like I did. I might also consider demonstrating all these tips by making an MSGEQ7-killer, so click here if I ever end up doing that second part.