STM32 & OpenCM3 5: Debugging & Fault Handlers

Companion code for this post available on Github

This is the sixth post in a series on the STM32 series of MCUs. The previous post, on memory sections and ITCM, can be found here.

When building embedded software, it can sometimes be challenging to determine the root cause of a failure. You have no operating system to fall back on, and if you are coming from a non-embedded background may not know about some of the low-level behaviours that can be leveraged to make finding certain bugs a little easier. In this post, I’ll go over in detail the steps to connect an interactive debugger to your embedded system, as well as how you can make interrupt problems easier to diagnose.

Connecting using GDB

If you have worked on applications software in either C or C++ for some time, here’s a good chance you will at least have heard of GDB. As one of the more venerable debuggers, it supports an incredible number of targets, including (to my knowledge) the entire ARM Cortex ecosystem.

There are a plethora of ARM programming and debug adapters on the market, such as:

For the purposes of this post, we will be using the cheap and powerful Black Magic Probe (BMP). This probe comes with a standard 10-pin SWD header, and a 4-pin UART connector. Both of these have level-shifted output between 1.7 and 5V, and the probe can also supply 3.3V power to targets if necessary. Even better, the probe runs an implementation of the GDB server protocol, meaning that it can be connected to GDB with no additional software dependencies, unlike some other tools.

To get started with the black magic probe, you’ll first need to connect it to your device under test. For this you can use the included 0.127mm pitch SWD connector, or other standard cable such as the popular TagConnect TC-2050 cable, which eliminates the need for costly headers.

Two example connections of the probe, one with the included SWD cable, and one with a TagConnect cable, are shown here:

Example connections of Black Magic Probe

Once your cable is connected, and the device is powered up, it’s time to fire up GDB. If you don’t already have it, you can install it and the rest of the GNU ARM toolchain on Debian based systems with the following command:

sudo apt install binutils-arm-none-eabi gcc-arm-none-eabi gdb-arm-none-eabi

For other systems, you can install GDB by following the instructions here.

Now that we have GDB ready, we need to load up the program we’re debugging and connect to our probe. To do that, we first start GDB with the elf as our first argument:

$ arm-none-eabi-gdb my_program.elf
GNU gdb (7.12-6+9+b2)
Copyright (C) 2016 Free Software Foundation, Inc.
# Connect to the black magic probe as a target. There are two serial ports
# exposed by the device - the first is the GDB server, the second is a
# passthrough to the UART pins on the device. The exact path to the serial
# devices will vary by OS.
(gdb) target extended-remote /dev/ttyACM0
# Now that we have told GDB to use the BMP as a target, we can invoke
# some extra commands. The first will be to scan for serial wire debug targets,
# which should return the MCU that we wish to connect to.
(gdb) monitor swdp scan
# Now that we've scanned and identified the device we want to connect to, we
# can start debugging it by attaching. In this case we want the first (and only)
# target. Invoking the attach command will automatically halt the device and
# show the current stack frame.
(gdb) attach 1

If you find yourself typing in these commands a lot, I’d recommend putting them all in a file called .gdbinit in your working directory. This will cause GDB to automatically run them on startup, saving you some time if you are repeatedly closing and reopening GDB to connect to the same system. Note that you may need to set the configuration value

set auto-load safe-path /

in the .gdbinit in your home directory to allow loading of arbitrary .gdbinit scripts. To be more secure, use a more specific path than /, otherwise malicious source trees could cause you to run arbitrary GDB commands.

Now that we are connected to the device, we can interact with it almost as though it were a program runnig on our local machine, but with a few extra commands. I’ll list a couple things I use frequently here, but do consult the GDB manual both for more details about these commands and for other useful invocations. All commands shown below can be shortened to the character in [].

# Print an expression. Expression can be a variable, function, macro, etc.
> [p]rint myvar / MY_DEBUG_MACRO / *((uint32_t*)0xDEADBEEF)
# Exampine one or more memory addresses.
> e[x]amine my_array
> x/10ub 0xDEADBEEF # Print 10 unsigned bytes starting at 0xDEADBEEF
> x/s 0xDEADBEEF # Print the C-string starting at 0xDEADBEEF
# Breakpoints
> [b]reakpoint main # Create breakpoint on entry of method main()
> b my_code.cpp:123 # Create breakpoint at line 123 of my_code.cpp
> [d]elete 2 # Delete breakpoint number 2
# Memory watchpoints
> watch my_var # Trigger breakpoint if my_var changes
# Trigger breakpoint if the integer value located at 0xDEADBEEF changes
> watch watch *(int *) 0xDEADBEEF
# Control flow
> [n]ext # Advance execution to next line of code
> [ni] (next instruction) # Advance execution to next assembly instruction
> [c]ontinue # Run until breakpoint or user interrupt (ctrl-c)
> [r]un # Start program over from beginning
# Information
> info locals # Print all local variables and their values
> info registers # Print the contents of all CPU registers
> info breakpoints # Print all the currently active breakpoints
# Flashing
> file my_program.elf # Select the active binary
> load # Flash the microcontroller with the active file
> compare-sections # Verify the microcontroller code matches the active file
# Dump memory range 0x0 to 0xFFFF to file out.bin as raw binary data
> dump binary memory out.bin 0x0 0xFFFF

As an added tip, if your program is compiled using a makefile any make commands entered inside of GDB will be run normally, allowing you to rebuild and re-flash your microcontroller without ever leaving your GDB session.

Fault Handlers

While the GDB commands above should handle a lot of debugging needs, there will still be some cases (generally due to interrupts) where the control flow of the CPU is hard to follow. In cases like these, implementation of the ARM hard fault interrupt, as well as providing default implementations for all user interrupts, can be very useful.

If your code enables an interrupt but doesn’t implement the associated handler, or causes a processor fault by attempting to execute an invalid instruction or access invalid memory, the ARM core will jump to the Hard Fault Handler, which is an interrupt common to the entire Cortex-M family. Implementing this interrupt handler, and using it to provide error feedback, can save many hours of second guessing your code.

Generally, your HAL library (such as CMSIS, or in this example series, libopencm3) will provide a weakly linked implementation of these system interrupt handlers. In order to provide your own, all you need to do is implement it, which will override the weakly linked one in the HAL.

When you do implement the hardfault handler, the first thing you will want to do is determine which stack pointer was in use when the program crashed (this is mainly relevant for applications making use of an RTOS or other system that takes advantage of multiple hardware stacks). When an exception or interrupt handler is entered, the processor updates the link register with a special value, EXC_RETURN. The full details of this behaviour can be found in the ARMv7-M reference manual section B1.5.8, but for our purposes the salient bit is bit 2, which determines whether the return stack is the main stack, or a process stack. By testing the link register against the pattern (1 << 2), we can determine which stack point was in use when the exception occurred, and pass the appropriate one through to our generalized exception handler.

I’ve seen a couple hardfault handler implementations online that require a CPU supporting conditional execution, which isn’t present on the Cortex-M0 series processors. Here’s a fault handler that should be generic enough to work on all ARMv7-M processors, at the cost of a few extra instructions:

void hard_fault_handler(void) {
    "MRS r0, MSP\n" // Default to the Main Stack Pointer
    "MOV r1, lr\n"  // Load the current link register value
    "MOVS r2, #4\n" // Load constant 4
    "TST r1, r2\n"  // Test whether we are in master or thread mode
    "BEQ base_fault_handler\n" // If in master mode, MSP is correct.
    "MRS r0, PSP\n" // If we weren't in master mode, load PSP instead
    "B base_fault_handler"); // Jump to the fault handler.

With this bridge method in place, we can write the meat of our fault handler code. We take as input the stack pointer address that was determined by our assembly bridge, and gather some pertinent information about the crash into local variables for inspection.

// Core ARM interrupt names. These interrupts are the same across the family.
static const char *system_interrupt_names[16] = {
    "SP_Main",      "Reset",    "NMI",        "Hard Fault",
    "MemManage",    "BusFault", "UsageFault", "Reserved",
    "Reserved",     "Reserved", "Reserved",   "SVCall",
    "DebugMonitor", "Reserved", "PendSV",     "SysTick"};

void base_fault_handler(uint32_t stack[]) {
  // The implementation of these fault handler printf methods will depend on
  // how you have set your microcontroller up for debugging - they can either
  // be semihosting instructions, write data to ITM stimulus ports if you
  // are using a CPU that supports TRACESWO, or maybe write to a dedicated
  // debug UART
  fault_handler_printf("Fault encountered!\n");
  static char buf[64];
  // Get the fault cause. Volatile to prevent compiler elision.
  const volatile uint8_t active_interrupt = arm::scb::ICSR & 0xFF;
  // Interrupt numbers below 16 are core system interrupts, we know their names
  if (active_interrupt < 16) {
    sprintf_(buf, "Cause: %s (%u)\n", system_interrupt_names[active_interrupt],
  } else {
    // External (user) interrupt. Must be looked up in the datasheet specific
    // to this processor / microcontroller.
    sprintf_(buf, "Unimplemented user interrupt %u\n", active_interrupt - 16);

  fault_handler_printf("Saved register state:\n");
  __asm volatile("BKPT #01");
  while (1) {

If you were to have GDB attached to your microcontroller when this handler is hit, you will automatically hit the breakpoint triggered by __asm volatile ("BKPT 01"), and be able to get a summary of what went wrong by asking GDB for info locals, as well as investigate the additional information printed out over our serial console:

GDB info locals command output

In addition to the variables in the above method, we call a dump_registers method to interpret and print the values of the calling stack frame that were saved by the CPU before it jumped to the exception handler. The list of registers, and the order in which they appear, is listed in section B1.5.6 of the ARM reference manual. We can use this info to generate some more debug output, like so:

enum { r0, r1, r2, r3, r12, lr, pc, psr };
void dump_registers(uint32_t stack[]) {
  static char msg[32];
  sprintf_(msg, "r0  = 0x%08x\n", stack[r0]);
  sprintf_(msg, "r1  = 0x%08x\n", stack[r1]);
  sprintf_(msg, "r2  = 0x%08x\n", stack[r2]);
  sprintf_(msg, "r3  = 0x%08x\n", stack[r3]);
  sprintf_(msg, "r12 = 0x%08x\n", stack[r12]);
  sprintf_(msg, "lr  = 0x%08x\n", stack[lr]);
  sprintf_(msg, "pc  = 0x%08x\n", stack[pc]);
  sprintf_(msg, "psr = 0x%08x\n", stack[psr]);

This works well as a generic fault handler, but there are some cases where we may want to also include some additional information. For example, if a memory fault occurs, there are several potential causes that we can flag up, as well as the address at which the fault occurred. So for handling memory faults, we could add a handler function such as the following:

void mem_manage_handler(void) {
  // Pull the MMFSR data out into variables for easy inspection
  // Variables are volatile to prevent compiler elision
  const volatile bool mmfar_valid =
      arm::scb::CFSR & arm::scb::CFSR_MMFSR_MMARVALID;
  const volatile bool fp_lazy_error =
      arm::scb::CFSR & arm::scb::CFSR_MMFSR_MLSPERR;
  const volatile bool exception_entry_error =
      arm::scb::CFSR & arm::scb::CFSR_MMFSR_MSTKERR;
  const volatile bool exception_exit_error =
      arm::scb::CFSR & arm::scb::CFSR_MMFSR_MUNSTKERR;
  const volatile bool data_access_error =
      arm::scb::CFSR & arm::scb::CFSR_MMFSR_DACCVIOL;
  const volatile bool instruction_access_error =
      arm::scb::CFSR & arm::scb::CFSR_MMFSR_IACCVIOL;

  // Pull the MMFAR address
  const volatile uint32_t mmfar_address = arm::scb::MMFAR;

  // Trigger a breakpoint
  __asm volatile("BKPT #01");

If you have an MPU and trigger a write violation, or try and perform operations with an invalid alignment, you will trigger this method and like before can get the problem at a glance with info locals:

Memfault handler locals after null deference

In the output above, we can see that the MemManage Fault Address Register (MMFAR) has been loaded with the address of the error, that the MMFAR address is at 0x0, and that the access type was a data access. In other words, a null pointer dereference!

If you want to have several exception vectors map to the same handler, a useful trick is to alias those methods using an __attribute__ directive like so:

void bus_fault_handler(void) __attribute__((alias("hard_fault_handler")));
void usage_fault_handler(void) __attribute__((alias("hard_fault_handler")));

Keeping things clean

In order to keep my registers organized, I like to nest them in C++ namespaces instead of having a huge list of preprocessor macros (as can be seen in the methods above). In order to make defining registers simple, and still allow the compiler to optimize nicely, I use a template like the one below to generate references to each ARM register:

namespace arm {

// Convenience template for taking an integer register address and converting
// to a reference to that address.
template <typename T> constexpr T &Register(uint32_t addr) {
  return *reinterpret_cast<T *>(addr);

// Typedefs for register references of 32, 16 and 8 bits.
using Reg32 = volatile uint32_t;
using Reg16 = volatile uint16_t;
using Reg8 = volatile uint8_t;

} // namespace arm

Using that template, we can then go through and quickly define each of the registers in the ARM System Control Block (SCB).

namespace arm {
namespace scb {
// Interrupt control and state register (RW)
static Reg32 &ICSR = Register<uint32_t>(0xE000ED04);
// Configurable Fault Status Register
static Reg32 &CFSR = Register<uint32_t>(0xE000ED28);
// MemManage Fault Address Register
static Reg32 &MMFAR = Register<uint32_t>(0xE000ED34);

//// Register subfields
// 1 if MMFAR has valid contents
const uint32_t CFSR_MMFSR_MMARVALID = (1 << 7);
// 1 if fault occurred during FP lazy state preservation
const uint32_t CFSR_MMFSR_MLSPERR = (1 << 5);
// 1 if fault occurred on exception entry
const uint32_t CFSR_MMFSR_MSTKERR = (1 << 4);
// 1 if fauly occurred on exception return
const uint32_t CFSR_MMFSR_MUNSTKERR = (1 << 3);
// 1 if a data access violation occurred
const uint32_t CFSR_MMFSR_DACCVIOL = (1 << 1);
// 1 if an eXecute Never violation has occurred
const uint32_t CFSR_MMFSR_IACCVIOL = (1 << 0);
} // namespace scb
} // namespace arm

With any modern compiler, these constants will be nicely inlined, resulting in zero runtime overhead compared to the old-school #define method. Depending on how you like to write your code, either method will work just as well; this is more a personal preference point than anything else.

Hopefully some of this information comes in useful when debugging your own embedded projects. As ever, a Github repo containg some example code is available here.

DEFCON 27: Badge Writeup

Companion code for this post available on Github

If you just want to cut to the chase and flash your own badge with the Chameleon firmware, grab this build and jump to the “Flashing the badge” section.

This year at DEFCON, we were lucky enough to be provided with another electronic badge, this time courtesy of Joe Grand. The badge is a very sleek design featuring a quartz face, lanyard mounting straps and a Kinetis KL27 series microcontroller (specifically, a KL27P64M48SF2 ). The badge also has an unusual communication mechanism, an NXH2261UK Near-Field Magnetic Induction chipset and antenna.

This year the core badge hardware was the same across badge types, the only differences being a ‘badge type’ byte in the firmware for each badge, and various colours of quartz on the non-human badges.

Human Badge, Front and Back

Some more information on the badge itself, including pictures of all the badge types, can be found on Joe Grand’s Website.

If one wanted to complete the badge without any trickery, they would need to go around the conference interacting with all of the other badge types, including a select few ‘magic’ badges, in order to complete their badge. If we peek the source code of the badge, we can see exactly what’s needed:

// Bit masks for badge quest flags
#define FLAG_0_MASK 0x01 // Any Valid Communication
#define FLAG_1_MASK 0x02 // Talk/Speaker
#define FLAG_2_MASK 0x04 // Village
#define FLAG_3_MASK 0x08 // Contest & Events
#define FLAG_4_MASK 0x10 // Arts & Entertainment
#define FLAG_5_MASK 0x20 // Parties
#define FLAG_6_MASK 0x40 /* Group Chat (all 6 gemstone colors:
                          Human/Contest/Artist/CFP/Uber +
                          Goon + Speaker + Vendor + Press + Village) */

To save yourself some walking and learn a bit more about the badge firmware, read on and we’ll cover two ways to complete the badge the hardware hacking way.


In order to debug or flash your device, you’ll need one of the many ARM programmers available. Joe Grand recommended NXP’s own LPC-Link 2 but you can likely use any debug probe like the Segger J-Link ($$), or the Black Magic Probe (much more affordable), which is what I’ll be using.

You will also need a particular TagConnect cable, the TC-2050-IDC-NL-050-ALL, or some fine gauge wire and a steady hand. If you plan on developing many of your own ARM based designs, I would strongly recommend you pick up the cable. The convenience and cost savings of not having to place .127” pin headers quickly makes up for the price of the cable. You may also want to pick up some cable retaining clips, which make extended debugging require one fewer arms.

If you intend to compile for or flash your badge, you will also need the GCC ARM toolchain, which you can install using your package manager of choice:

sudo apt install binutils-arm-none-eabi gcc-arm-none-eabi gdb-arm-none-eabi

First Approach: No firmware rewriting

Our initial goal was to solve the badge “legit” (for some definition of the word), by not rewriting the firmware in any way. For this method, you will need to populate the 1.8V serial headers on the opposite side to the tag-connect pads. This method will require two badges, and the workflow goes like this:

  • On badge A, we connect our favourite debugger (GDB) over SWD
  • We then overwrite the game state in memory, tricking that badge into thinking it is solved. This is not persistent across reboots (since it’s only a change to SRAM), but will be good enough for now.
  • We then un-halt the CPU on badge A, and connect to it over UART.
  • On the UART, now that the badge is ‘complete’ we have three extra options - one of which is ‘craft packet’. We can use this to spoof packets from other badge types.
  • On badge A, we iterate through broadcasting all badge types (with magic bit set), and after two rounds of this bade B will be complete, as though it had actually interacted with the real badges
  • We can now reboot badge A, which reverts to being a normal, zero progress badge.

In order to trick badge A into thinking it’s complete, we first need to figure our what memory location holds the game state variable. I’m not skilled at RE, so instead I’ll cheat a little, and use the linker map, which is available on the DEFCON media server. As a quick recap for those that haven’t seen linker maps before, the map contains the load and virtual address of all variables and functions in the finished binary. It can come in extremely helpful when debugging embedded systems, as we’ll see here. If we search through the map file for the badge_state variable, we can see that it’s located at SRAM address 0x1ffffcdc:

                0x1ffffcdb        0x1 ./source/dc27_badge.o
                0x1ffffcdc        0x1 ./source/dc27_badge.o
                0x1ffffcdd        0x1 ./source/dc27_badge.o

This means that to trick our badge, we just need to overwrite this one memory address. To do that, we’ll connect to the badge over SWD (check the “Flashing the badge” section below for a more thorough explanation of this process) in order to debug it using GDB. Once you’ve attached to the badge, there’s only one necessary command to set the flags:

set {char}0x1ffffcdc = 7

Once you’ve done that, hit c for continue to un-halt the badge CPU. You can now connect the four serial lines of the black magic probe to the UART pinout on the opposite side of the battery. With the badge face down and the SWD connector to the south, the staggered pinout for serial is GND (black), TX (green), RX(purple), VCC (red). Unlike with some of the other programmers, you must connect the power line of the black magic probe in order to power the on-board level shifter. With other programmers, be aware that they may attempt to power the badge themselves, and applying voltages over the expected 1.8V badge voltage may cook your badge.

Now that you have your game state updated, when you connect to the serial console using screen /dev/ttyACM1 115200, you should be greeted with three additional options:

Extended serial commandset on complete badge

The last of these, ‘Update Transmit Packet’, is what we’ll be using to get our B badge to complete. If we take a look at the firmware, we can see how the packets are constructed:

struct packet_of_infamy // data packet for NFMI transfer
  uint32_t uid;   // unique ID
  uint8_t type;   // badge type
  uint8_t magic;  // magic token (1 = enabled)
  uint8_t flags;  // game flags (packed, MSB unused)
  uint8_t unused; // unused

With this info, we can craft our own packets, masquerading as any badge we want. Here’s a complete list of badge types, and the command you need to send in order to become them:

Human   U 772502840001ff00
Goon    U 772502840101ff00
Speaker U 772502840201ff00
Vendor  U 772502840301ff00
Press   U 772502840401ff00
Village U 772502840501ff00
Contest U 772502840601ff00
Artist  U 772502840701ff00
CFP     U 772502840800ff00
Uber    U 772502840901ff00

After manually cycling through these on badge A to unlock badge B, you can then reset badge A and do the same for it, resulting in two ‘legitimately’ unlocked badges. Of course, if that’s not enough for you, we can take it one step further: automating the process by building a Chameleon badge.

Second Approach: Building a Chameleon

After spending several hours manually rotating packets to advance other badges, we decided it was time to automate the process. Since the firmware is freely available on the DEFCON Media Server we can grab it, modify it to our heart’s content and then flash it to our badge. The first stumbling block I hit is that the software is written to rely on NXP’s own libc implementation, Redlib, and downloading the official NXP toolchain on the DEFCON wifi was going to take 8 hours. Instead of that, I rewrote the software slightly to use Newlib, which is packaged along with the GCC ARM toolchain. The full modified firmware building against newlib-nano (and including the chameleon patches) is available in this Github repo.

Once we have the original firmware building, we can go about editing it to broadcast as every other badge type. In order to hook this in, there are two main changes to be made. The first is that we need some way of keeping track of time - the systick implementaion Joe used here only acts as a blocking countdown timer, since he’d just been using it for delays. Since we want to keep an idea of how long we’ve been broadcasting as one badge, we need to add a second counter we can use as monotonic time. This is a quick two line change:

--- a/source/dc27_badge.c
+++ b/source/dc27_badge.c
@@ -423,6 +423,7 @@ static uint32_t pflashSectorSize = 0;

 // Timer
 volatile uint32_t g_systickCounter;
+// Monotonic count-up timer
+volatile uint32_t g_monotonicTime = 0;
 volatile bool g_lptmrFlag;

 // UART2 (to/from host)
@@ -2935,6 +2952,7 @@ void SysTick_Handler(void)
+    g_monotonicTime++;

Now, every time the SysTick interrupt is generated, as well as decrementing the delay timer we will also increment our own monotonic timer. The SysTick timer is configured to interrupt every 1ms, however it is also paused when the badge enters sleep mode, which it does while not actively transmitting / receiving packets. Since our timer will only advance during transmit, we can use a relatively short interval in our code, since it will get stretched out as the chipset sleeps. With the timer in place, the patch for a ‘chameleon’ badge is relatively straightforward, and is added at the top of the while (1) block in main():

// Outside our loop, declare our state variables:
static uint32_t state_change_timer = 0;
// Inside our while(1) loop, handle the chameleon code
if (g_monotonicTime > state_change_timer + 1000) {
  // If it's been 1000 systicks since we last changed our badge state, it's
  // time to update. First, reset our timer to the current monotonicTime.
  state_change_timer = g_monotonicTime;
  // Now, increment our badge type by 1, changing our identity.
  nxhTxPacket.type = nxhTxPacket.type + 1;
  // If we've cycled through all the way to UBER (which is currently read only
  // as Human, and so not particularly useful to broadcast) then go back to the
  // beginning, skipping human and starting at Goon.
  if (nxhTxPacket.type >= UBER) {
    nxhTxPacket.type = GOON;
  // Having changed our packet struct, we now need to load it into the
  // NXH2261 to be broadcast.
  if (KL_UpdatePacket_NXH2261(nxhTxPacket)) {
    // If we fail once, try again. Joe seems to do this elsewhere in the code.
    if (KL_UpdatePacket_NXH2261(nxhTxPacket)) {
      // If we fail a second time, give up and log a message.

There are some other fun things you can do in the firmware, such as enabling a longer version of everyone’s favourite song, or editing your LED pattern, but for the purposes of this post those are left as exercise to the reader. Once you’ve made your mods, from the Firmware/Debug folder you can run make dc27_badge.axf to rebuild the firmware. If all goes well, you should get a nice printout of your memory utilization and a success message:

Memory region         Used Size  Region Size  %age Used
   PROGRAM_FLASH:       58224 B        64 KB     88.84%
            SRAM:        5228 B        16 KB     31.91%
         USB_RAM:          0 GB        512 B      0.00%
Finished building target: dc27_badge.axf

Flashing the badge

Now that we have our updated badge firmware, it’s time to flash. I’ll assume a black magic probe here, for other probes please consult their manuals. The first thing we need to do is update the firmware on our black magic probe. There seems to be a bug in the latest official firmware (at time of writing, 1.6.1) where the KL27x64 series is not recognized properly, and will show up as a generic Cortex-M part. This causes the flashing to fail, since the KL27x64 require a specific flash unlock code before programming. To update your black magic probe, clone the firmware repo from Github build it with make and then perform a DFU update on your probe with the following command:

sudo dfu-util -d 1d50:6018,:6017 -s 0x08002000:leave -D src/blackmagic.bin

Now that the firmware is updated, we can connect to the badge. You will first need to either

  • Remove the quartz face of your badge (gently) with a shim
  • Cut down the length of your tagconnect locating pins so that they are ~0.5mm shorter than the pogo pins

Removing the quartz will keep your cable intact, and allow you to clip the cable in place for longer development sessions. The adhesive is strong enough that it can survive being carefully removed and reattached a few times.

Once you have connected the tag-connect cable to the badge one way or the other, it is time to fire up arm-none-eabi-gdb and do our dirty work. Once you have GDB open, we first need to point it to the black magic probe as our remote debugging tool. To do this, we use the command target extended-remote /dev/ttyACM0, where /dev/ttyACM0 should be the first of the two serial endpoints exposed by the black magic probe. Now that we have GDB connected to the probe, we can scan for devices, using monitor swdp scan. This should return a list of two devices for the DEFCON badge: the chipset, and a recovery mode that I have not explored. N.B: If your scan doesn’t return KL27x64 M0+, and instead returns ‘Generic Cortex-M’, close GDB and retry. This seems to be a race condition of some sort. Since we want to debug the main chip, we attach to it using attach 1, which halts the core and prints our current stack frame. A successful attach session should look somewhat like the following:

Successful attach over GDB

Now that we are hooked up and ready to go, flashing the badge is relatively straightforward: we need to select the file to load using file /path/to/dc27_badge.axf, and then to program the badge we just need to run load. If you don’t want to compile your own firmware, you can use this build of a chameleon badge that I’ve created. Once you run load, you should see output like the following:

Firmware flashing over GDB

Once the load is complete, your badge will still be in a halt state. To get it running, either hit c for continue in GDB, or detach the probe and power cycle the badge. You should now have your own chameleon or otherwise custom firmware loaded up!


If this all seemed interesting, don’t be afraid to try it! For some more reading on developing for embedded ARM systems you can check out this tutorial series, and to keep abreast of the progress hacking next year’s badge, join the Hack the Badge slack for discussion.

You can also check out some other writeups of the badge, by some of the great people I met at the HHV this year:

STM32 & OpenCM3 4: Memories and Latency

This is the fifth post in a series on the STM32 series of MCUs. The previous post, on CANbus, can be found here.

As core frequencies increase, the performance penalty of loading instructions and data from slow flash memories increase. For a modern core such as the STM32F750 running at 216MHz, a single flash read can stall the CPU for 8 cycles. Luckily, on these faster cores exist mechanisms to ameliorate or eliminate these stalls. Here I will go over two methods: loading functions into SRAM for zero-wait-state execution, and enabling of the built-in I-cache on certain Cortex-M processors.

Executing from SRAM

Above a certain core frequency, it is no longer possible for the attached flash memory to keep up. This results in the need for ‘wait states’ inside the CPU - in order to progress to the next instruction, the CPU must stall and wait for the fetch from flash to complete. Even worse, the faster your CPU frequency gets, the more pronounced this problem becomes.

A first solution to this, that works on all embedded microcontrollers which allow executing from memory, is to simply copy the code to be executed from flash into SRAM once at startup, and then run it from there afterwards. Since most microcontrollers have single-cycle access latency for SRAM, this can increase execution time significantly. Even on microcontrollers that support some amount of prefetching, copying functions to memory can be useful for code that must execute in deterministic time, or that is frequently jumped to from unpredictable locations (for example, interrupt service routines).

Luckily, there is a way to achieve this with GNU utilities. By default, all code will be placed into the .text section by your compiler. But it doesn’t have to! We can create our own sections, and do as we please with them. So for now, let’s create a section called sram_func and designate a function as being part of this section. First, let’s edit our linker script to tell it where the new section should go, and what it should contain.

/* Presumably, you already have a section like this defining the physical layout
   of your particular microcontroller */
    rom (rx) : ORIGIN = 0x08000000, LENGTH = 64K
    ram (rwx) : ORIGIN = 0x20000000, LENGTH = 320K

/* Other existing directives should likely be kept above this new section,
   unless you already have something fancy going on and know better */

/* Now, we can create a new section definition */
.sram_func : { /* Our new section will be called .sram_func */
    /* This creates a new variable, accessible from our C code, which points
       to the start address of our newly created section, in memory.
       The reason for this will become apparent later. */
    __sram_func_start = .;
    /* We now tell the linker that into the .sram_func section it should place
       all of the section data we will later tag as 'sram_func'. */
    /* Pad the end of our section if necessary to ensure that it is aligned on a
       32-bit word boundary */
    . = ALIGN(4);
    /* We now take the end address of this new section, and make it also
       available to the program. */
    __sram_func_end = .;
} >ram AT>rom /* These two directives control the LMA and VMA for this section:
                 we state that it should be stored in rom (so that it can
                 actually be programmed onto your microcontroller), but
                 referenced at a location inside our ram segment. */
/* For our final directive, we need to know the location in ROM to load
   the data _from_ at the start of execution. */
__sram_func_loadaddr = LOADADDR(.sram_func);

Now that we have a place to put these functions, we can tell GCC to place them there using a small __attribute__ directive:

void exti1_isr(void) {
    // Body of a time-sensitive interrupt goes here
    gpio_set(GPIOA, GPIO1);
    // [...]
    gpio_clear(GPIOA, GPIO1);

With the addition of our __attribute__ field, gcc now knows to keep our function in a new section, and we can verify this with objdump:

$ arm-none-eabi-nm -f sysv -C my_elf_file
Name                  Value   Class        Type         Size     Line  Section
exti0_isr           |080017fc|   W  |              FUNC|00000014|     |.text
exti1_isr           |20000330|   T  |              FUNC|00000028|     |.sram_func
exti2_isr           |080017fc|   W  |              FUNC|00000014|     |.text

As you can see, the ISR we just tagged as being destined for sram_func is no longer in the .text section, and the location of the symbol is not in the 0x0800 0000 ROM section, but the 0x2000 0000 SRAM area. So far so good, but if we were to deploy this code to the device now, as soon as we actually triggered the ISR we would almost certainly encounter a segmentation fault.

This is because we’ve told the linker that this code is located in RAM, but haven’t actually set up a method to actually move that code into RAM - when the microcontroller resets, that memory space will be initialized to junk. In order to fix this, we need to add some code at the very start of our application to actually read the data for the sram_func section out of ROM and copy it into RAM, where it can then be called. To do so, we use the three variables we defined earlier as part of the linker script:

// The variables defined in our linker script are available to us as externed
// unsigned words, the locations of which denote our sections.
extern unsigned __sram_func_start, __sram_func_end, __sram_func_loadaddr;

// Our sram_func_start and sram_func_end variables are located at the start and
// end of the memory space that we want to copy data into.
// The loadaddr variable is located in ROM, at the start of where the data to
// be copied is stored.
// Using these three variables, we can quickly copy the code across.
volatile unsigned *src = &__sram_func_loadaddr;
volatile unsigned *dest = &__sram_func_start;
while (dest < &__sram_func_end) {
  *dest = *src;

Now that we’ve done that, let’s verify that this is indeed faster. As a testbed, I have a STM32F7 series MCU running at 216MHz. I have configured an EXTI interrupt that is triggered on rising edges, and when fired sets and clears a GPIO. The pin driving the EXTI interrupt is then connected to a function generator running at 1Hz, and the input function and GPIO pin are connected to a scope. Here is a representative trigger of the ISR running from flash memory, where the blue trace is the input signal, and the purple trace is the pin toggled by the ISR:

Interrupt latency running from flash

We can see that after the trigger pin goes high, we have a delay of almost exactly 400ns before the ISR triggers and pulls the GPIO pin high. At 216MHz, that’s close to 100 clock cycles! This is rather poor overhead, and for applications that make extensive use of interrupts, the delays will add up. Now let’s see what happens when we instead run our ISR out of SRAM:

Interrupt latency running from SRAM

Not bad! The latency between the trigger going high and the GPIO going high has been cut in half, and the total execution time of the ISR has also dropped by a little under 200ns itself.


While loading code into memory can be useful for performance, it comes at the obvious tradeoff of taking up additional space. Luckily, on some ARM platforms exists a section of memory called ITCM, or the Instruction Tightly Coupled Memory. Unlike its sister memory, the DTCM, the ITCM block can only be accessed by the CPU, and not at all by peripherals such as DMA controllers. It is also not located in a contiguous memory space with the rest of the system memories: at least on the STM32F750, it is located at address 0x0000 0000, unlike the rest of the volatile memories located at 0x2000 0000. Since it’s so isolated, if you have it it is an excellent location for any functions you may want to load into RAM.

Block diagram of STM32F750 with ITCM highlighted

To use the ITCM as a space for functions that need to be able to execute quickly, we can edit our linker script from above with two small changes:

    rom (rx) : ORIGIN = 0x08000000, LENGTH = 64K
    ram (rwx) : ORIGIN = 0x20000000, LENGTH = 320K
    /* Here we add a new memory region: the ITCM RAM */
    itcm (rwx) : ORIGIN = 0x00000000, LENGTH = 16K

.sram_func : {
    __sram_func_start = .;
    . = ALIGN(4);
    __sram_func_end = .;
} >itcm AT>rom /* Instead of ram, we load to itcm */
__sram_func_loadaddr = LOADADDR(.sram_func);

No changes need to be made to our code that loads the function data from ROM, since after a recompilation the variables we used will automatically point at the new data location. Any functions loaded into memory in this block will not count against the memory space available to the program for heap, stack and globals.


Eagle-eyed readers may also notice another memory hidden away in the block diagram above:

Block diagram of STM32F750 with I-Cache highlighted

Regardless of whether you use the manual memory loading above (which I would recommend for methods you need to guarantee run without wait states), you can gain an often significant general performance enhancement simply by enabling the L1 I-cache built directly into the ARM core (on models that have it, that is. Consult your data sheet!)

Unlike the ITCM RAM, the cache does not require manual management other than being turned on. To do so, we can follow the instructions from the ARM V7-M Architecture Reference Manual in section B2.2, “Caches and branch predictors”. As the document states, all caches are disabled at startup. To enable the I-cache, we need to first invalidate it, and then set a bit in the Cache Control Register to enable it. We only need two registers for this, so the code is relatively straightforward:

// Writing to the ICIALLU completely invalidates the entire instruction cache
#define ICIALLU (*(volatile uint32_t *)(0xE000EF50))
// The Configuration and Control Register contains the control bits for
// enabling/disabling caches
#define ARM_CCR (*(volatile uint32_t *)(0xE000ED14))

// We will also define two macros for data and instruction barriers
define __dsb asm __volatile__("dsb" ::: "memory")
define __isb asm __volatile__("isb" ::: "memory")

// Synchronize
__dsb; __isb;
// Invalidate the instruction cache
// Re-synchronize
__dsb; __isb;
// Enable the I-cache
SCB->CCR |= (1 << 17);
// Force a final resync and clear of the instruction pipeline
__dsb; __isb;

You’ll note the inclusion of several dsb and isb blocks - since we’re messing with instruction caches, it’s a good idea to add some explicit synchronization barriers to the code. We will use DSB to prevent execution from continuing before all in-flight memory accesses are complete. We will also use ISB to flush the processor pipeline, forcing all following instructions to be re-fetched. Note that our inline assembly also specifies the ‘memory’ clobber flag since it may change the global state, and so should not be reordered away by the compiler.

When this cache is enabled, common codepaths will start to benefit passively. Even our already-optimized ISR will gain a little more performance from I-caching of GPIO manipulation code called from the ISR that wasn’t itself loaded into ITCM ram:

Interrupt latency running from RAM with I-Cache enabled

With these two tricks, you can unlock a significant amount of extra performance from your embedded system, and can even go further by enabling the D-cache also present in higher-spec ARM cores. However, be advised that with the D-cache comes pitfalls when it comes to non-TCM memories and DMA transfers.

STM32 & OpenCM3 Part 3: CANBus

Companion code for this post available on Github

This is the fourth post in a series on the STM32 series of MCUs and libopencm3. The previous post, on SPI and DMA, can be found here.

What is CANBus?

The CAN bus is a multi-master, low data rate bus for communicating between controllers in environments with potentially high EMI. Initially designed for automotive applications, it is becoming increasingly used in general automation environments as well as by hobbyists. Electrically, CAN uses a differential pair of signals, CANH and CANL, to send data on the bus. In order to transmit a logic ‘1’ (also known as ‘recessive’ in CAN parlance), the differential voltage of the lines is left at 0. To transmit a logic ‘0’ (dominant), the voltage between the lines is driven high. This means that any node transmitting a 0 will override the transmission of a node that is simultaneously trying to transmit a 1. It is this mechanism that allows for the priority system in a CAN network - since each CAN message begins with the message ID, starting from the MSB, any controller asserting a logic ‘0’ on the bus will clobber a controller attempting to transmit a logic ‘1’. Since all transmitters read the bus as they transmit, this clobbering can be detected by the controller with the lower priority transmission, which will back off until the bus is clear again. This protocol is therefore categorized as ‘CSMA/CD+AMP’, or Carrier Sense Multiple Access / Collision Detection + Arbitration on Message Priority.

CAN Signalling, courtesy of Wikipedia

Why would I use CAN?

When transferring data between two microcontroller systems, people are probably already familiar with I2C and SPI, which are commonly used for low (I2C, 100-400kHz) or high (SPI, 100MHz+) speed data transfer between ICs. However, both of these protocols are really intended for operation over a short distance, ideally on the same board. Running I2C or SPI off-board, even for relatively short distances, can start to result in bit errors at higher speeds or in the presence of interference. The electrical integrity problems with I2C and SPI can be alleviated by using differential signals, as is the case with RS422/485. This allows RS485 to transmit data at high (multiple megabit) speeds over distances of 300-900 feet. This might satisfy our reliability or distance requirements, but none of these protocols bake in support for multi-master communication - SPI is very strongly based around a single-master design, and while I2C does allow for multiple devices to control the bus, there is no built-in arbitration support. Similarly for RS485, the application developer must roll their own packet structure and arbitration to handle bus contention.

CANBus performs quite well on some of these points, being:

  • Differential for signal integrity
  • Inherently multi-master
  • Low component count (single transceiver IC + termination)
  • Available in MCUs costing as little as a dollar
  • Checksummed for data integrity

However, CANbus does have some drawbacks that make it a poor fit for other applications. These include:

  • Very limited packet size of 8 bytes
  • Maximum bus frequency much lower then SPI or RS485
  • Maximum bus size of ~64 nodes
  • Termination may need to be adjusted as nodes are added/removed

When deciding whether or not to use CAN, be sure to think carefully about the requirements of your application and whether or not CAN is the best fit.

Electrical specifications

For ‘High speed’ CAN (~512 Kbps), all controllers (nodes) in the system must be connected to a linear bus, with appropriate termination. This is to mitigate signal reflections, which can cause bit errors at receiving nodes. This does however mean that CAN buses can be slightly more work to add or remove nodes from, compared to systems that allow a ‘star’ topology (e.g. an ethernet switch). Instead each node must be connected directly to a previous node and to a subsequent node, or, in the case of the last node on either end, a terminating resistor of 120 ohms.

If one is willing to sacrifice some speed, ‘fault tolerant’ CAN (~128 Kbps) can be operated in a star topology, with the termination divided up and placed at each node. For more information, the Wikipedia page on CAN has some diagrams.

As an example implementation, I have created a small demo board in KiCad with switchable termination to be used for high-speed CAN communication. The design files are available here if you are interested in producing some yourself, or you can directly order them from PCBway here.

CAN Demo Board using STM32F091

Message format

CAN frames follow a defined format: all standard frames have an 11-bit identifier and up to 8 bytes of data. Extended frames allow 29 bit identifiers, but only the same 8 bytes of data. CAN frames also include checksums, and most CAN implementations in microcontrollers will automatically insert / verify checksums in hardware. The appearance on the wire of CAN frames is as follows:

CAN frame formats

  • SOF: Start of frame bit (dominant). Used for synchronization.
  • Identifier: The 11bit (standard) or 29 bit (extended) message ID
  • RTR: Request to Transmit. Can be used by the application to indicate it wants another device to transmit.
  • IDE: Whether or not this is an extended CAN frame. The IDE bit is 0 (dominant) for standard frames and 1 (recessive) for extended frames, thus making all standard frames higher priority than extended frames.
  • DLC: Data length code. A 4 bit integer indicating the number of data bytes.
  • Data. Data may be between 0 and 8 bytes for both standard and extended frames.
  • CRC: 16-bit checksum for the frame data.
  • ACK: When transmitting, the controller leaves the bus in a recessive state during the ACK bit. If any other device on the bus has received the just-transmitted frame and considers it valid, it will assert the bus during this bit, and the transmitter can know that the message was successfully transmitted.
  • EOF / IFS: End of frame / interframe separator.

As may be clear from the 8 byte max payload size, CAN is not a good choice for applications that need to transfer large quantities of data. Instead it is much more suited for controls and small sensor data.

N.B: The ‘RTR’ bit in a CAN message is mutually exclusive with the data segment.

If you set the RTR bit, you may still specify a data length code (DLC) but the peripheral will not transmit any data bytes. Be careful when receiving frames that you ignore any data bytes ‘received’ in RTR frames, as they will simply be junk memory, which can led to pernicious bugs.

Using CAN with libopencm3

Now that we have an understanding of the CAN bus architecture, let’s actually build a small application that will send and receive data on the bus. Setting up the basics is relatively straightforward with a call to can_init():

// Enable clock to the CAN peripheral

// Reset the can peripheral

// Initialize the can peripheral
auto success = can_init(
    CAN1, // The can ID

    // Time Triggered Communication Mode?
    //  pdf
    false, // No TTCM

    // Automatic bus-off management?
    // When the bus error counter hits 255, the CAN will automatically
    // remove itself from the bus. if ABOM is disabled, it won't
    // reconnect unless told to. If ABOM is enabled, it will recover the
    // bus after the recovery sequence.
    true, // Yes ABOM

    // Automatic wakeup mode?
    // 0: The Sleep mode is left on software request by clearing the SLEEP
    // bit of the CAN_MCR register.
    // 1: The Sleep mode is left automatically by hardware on CAN
    // message detection.
    true, // Wake up on message rx

    // No automatic retransmit?
    // If true, will not automatically attempt to re-transmit messages on
    // error
    false, // Do auto-retry

    // Receive FIFO locked mode?
    // If the FIFO is in locked mode,
    //  once the FIFO is full NEW messages are discarded
    // If the FIFO is NOT in locked mode,
    //  once the FIFO is full OLD messages are discarded
    false, // Discard older messages over newer

    // Transmit FIFO priority?
    // This bit controls the transmission order when several mailboxes are
    // pending at the same time.
    // 0: Priority driven by the identifier of the message
    // 1: Priority driven by the request order (chronologically)
    false, // TX priority based on identifier

    //// Bit timing settings
    //// Assuming 48MHz base clock, 87.5% sample point, 500 kBit/s data rate
    // Resync time quanta jump width
    CAN_BTR_SJW_1TQ, // 16,
    // Time segment 1 time quanta width
    CAN_BTR_TS1_11TQ, // 13,
    // Time segment 2 time quanta width
    CAN_BTR_TS2_4TQ, // 2,
    // Baudrate prescaler

    // Loopback mode
    // If set, CAN can transmit but not receive

    // Silent mode
    // If set, CAN can receive but not transmit

// Enable CAN interrupts for FIFO message pending (FMPIE)

// Route the CAN signal to our selected GPIOs
const uint16_t pins = GPIO11 | GPIO12;
gpio_mode_setup(GPIOA, GPIO_MODE_AF, GPIO_PUPD_NONE, pins);
gpio_set_af(GPIOA, GPIO_AF4, pins);

In order to receive messages, in our CAN ISR we need to check to see which FIFO has pending data, and can then read off the message. For this demo, we’ll just put all of the messages in the same queue to be processed later.

void cec_can_isr(void) {
    // Message pending on FIFO 0?

    // Message pending on FIFO 1?

void receive(uint8_t fifo) {
    // Copy CAN message data into main ram
    Frame frame;
                fifo, // FIFO id
                true, // Automatically release FIFO after rx
                &, &frame.extended_id, &frame.rtr, &frame.filter_id,
                &frame.len,, &frame.ts);

    // Push the received frame onto a queue to be handled later


So far, our application will receive and try to store all messages that appear on the bus. But for many applications, we may be able to ignore a lot of messages, and save ourselves some CPU time. To this end, the CAN peripheral on the STM32F091 has a series of filter banks that can be used to selectively accept different message types. The general structure of the filters is that you have an ID register used to input the data you want to match against, and then a mask register that defines which bits of ID register are to be matched. This can be a bit complex at first glance - let’s take a look at the relevant figure in the ST reference manual:

Filter Bank Configuration

As an example, let’s say that we have a device that only wants to receive two types of message:

  • Messages with an ID less than 256, all of which are system broadcast messages
  • Messages with an ID of 342 and the RTR bit set

Since these are both standard frames, we can use 16 bit filters, to save space. From figure 315 we can see that the first 11 bits of the register match against the ID, and bit 4 in the lower byte matches the RTR flag in the CAN message. So for our first filter, we want to assert that the message ID is <= 255. Since 255 is 0xFF, or 8 bits set, we know that any ID numbers above 255 will have one of bits 9-11 set. So to match only lower IDs, we can assert that the top three bits of the ID are zero. So for our first filter, we can create it like so:

const uint16_t id1 = 0;               // We want to assert the high bits are zero
const uint16_t mask1 = (0b111 << 12); // The only bits we want to compare are STDID[10:8]

For our second filter, we want to match the ID exactly, so we will load our ID register with our actual desired message value (342) and in our mask we will select all bits of the STDID field. Since we want to assert that the RTR field is also set, we will likewise place a 1 both the ID and MASK registers at bit 5, like so:

const uint16_t id1 = (
    (342 << 5) | // STDID
    (1 << 4)     // RTR
const uint16_t mask1 = (
    (0b11111111111 << 5) | // Match all 11 bits of STDID
    (1 << 4)               // Match the RTR bit

Once we have our filters, we can configure the CAN peripheral with them like so. All messages that match either of these filters will be placed into FIFO 0.

// Create a filter mask that passes all critical broadcast & command
// CAN messages
    0,          // Filter number
    id1, mask1, // Our first filter
    id2, mask2  // Our second filter
    0,          // FIFO 0
    true);      // Enable

Putting it together

Now that we have our CAN peripheral initialized, let’s write a simple demo application. We’ll use the demo board mentioned above (which you can order directly from PCBWay here) to create a simple program that forwards bytes from the UART over the CAN bus. In our main application loop, we’ll first take any characters that have been received over the UART and transmit them over CAN. (Implementation details of the Frame class can be seen here for those curious.)

  // Loop over any characters pending in the UART Rx buffer,
  // and send each one over the CAN bus as a single message.
  char c;
  while (Uart::get(&c)) {
    // Turn on our activity LED
    gpio_set(GPIOB, GPIO12);
    // Echo this character back to the serial console so we can see what
    // we've typed
    // Create a new CAN Frame holder
    CAN::Frame frame; = 1; // Our message ID
    frame.extended_id = false; // This is not an extended ID
    frame.rtr = false; // This is not a request to transmit
    frame.len = 1; // We intend to send one data byte[0] = c; // Our uart character is the first datum
    CAN::transmit(frame); // Send the frame to the CAN output mailbox
    gpio_clear(GPIOB, GPIO12); // Clear our activity LED

We also need to receive frames off the bus and display the data. The receive interrupt we wrote earlier will queue the frames, so we can pop them off in order and print out the details.

  // Loop over any CAN frames pending in the CAN buffer, and print out
  // the ID of the message and all the data bytes.
  CAN::Frame frame;
  while (CAN::pop(frame)) {
    // Turn on our CAN activity LED
    gpio_set(GPIOB, GPIO13);
    // Print the frame ID and all data bytes as hex and plain characters
    printf("Rx ID: %u Data: ",;
    for (int i = 0; i < frame.len; i++) {
    // Turn off the activity LED
    gpio_clear(GPIOB, GPIO13);

The full firmware listing can be found here.

In order to test this, we can assemble two test boards and flash the same firmware to each. We can then connect the CANH and CANL pins of each board using jumpers, and configure the termination using jumpers. Since each board is connected to only one other board, we will set the jumper position for the connected header to pins 2-3, which connects the jumper pins directly to the transceiver. For the other set of jumpers, we select pins 1-2 to connect the terminating resistors (in this case a split termination of two 59 Ohm resistors and a 4.7nF capacitor) to the bus.

Two demo boards connected up

Once the boards are connected, we can connect a USB to UART adapter to each one and try sending some data back and forth. If everything is working properly, typing into the console of one board will cause it to send characters over CAN to the other, and vice versa.

Communicating over CANBus

This concludes our overview of CANBus, and the implementation details of the CAN peripheral on the STM32 series of microcontrollers. Using the basics in this post you should be able to create far more interesting applications.

As per usual, the code for this post is available on Github.

Thumb vs AVR Performance

The other day, I was working on an embedded application for driving stepper motors. In order to control my motor drivers, I was using the popular Atmega 328p MCU to speak with the host over serial, calculate paths, and pulse the stepper driver ICs at varying rates to control the speed of the system. Everything was working well, but I was beginning to run up against the limit of what the AVR could do in between the pulses of the motor clock, which at full tilt would request a motor update at 20KHz per motor. Since I’d been experimenting with the STM32 series of ARM based microcontrollers for other applications, I decided to spin a similar motor controller board based on the value line STM32F070 MCU, running at a respectable 48MHz (compared to the 16MHz of the AVR). My naïve assumption was that after porting the lower level parts of the code, I’d be able to deploy to the new MCU and immediately see a dramatic improvement in performance from the higher clock speed.

I was wrong.

When I deployed the code with only a single axis enabled, the MCU wasn’t able to keep up at all - the motor ready interrupt was firing at 20KHz, but the main event loop of the application, even heavily pared down to just the critical motor speed control code, was only able to execute at around 7KHz. What gives! In order to try and figure out where the performance gap is, I started off by writing a very simple benchmark loop for both devices - set a pin high, loop an 8 bit value from 0 to 255, and then clear the pin. Let’s start with the AVR version of the code:

void __attribute__ ((noinline)) test_loop() {
    PORTD |= (1 << DDD1); // Set indicator pin
    // Loop from 0..255. Include a NOP so that the compiler doesn't
    // (rightly) eliminate our useless loop.
    for (uint8_t i = 0; i < 0xFF; i++)
        asm __volatile__("nop");
    PORTD &= ~(1 << DDD1); // Clear indicator pin

AVR Cycle-counting

After compiling with avr-gcc -O2, we can disassemble with objdump to take a look at what’s actually running on our AVR core. As we’d expect, it’s fairly straight forward: the AVR instruction set has single-opcode bit set/clear ops, so the first and last lines become single instructions. The loop then consists of one initialization and three repeated instructions. Comments have been added below to explain each instruction:

00000080 <_Z4loopv>:
  80:   59 9a           sbi     0x0b, 1 ; 11    ; Set indicator pin
  82:   8f ef           ldi     r24, 0xFF       ; Load loop variable (255)
  84:   00 00           nop                     ; Loop body (nop)
  86:   81 50           subi    r24, 0x01       ; Subtract one from loop counter
  88:   e9 f7           brne    .-6             ; Jump back to loop label
  8a:   59 98           cbi     0x0b, 1 ; 11    ; Clear indicator pin
  8c:   08 95           ret                     ; Done

Now, let’s get a read on how fast this can run. I loaded it onto another of the same Atmega3 328p chips, let it run and measured the output signal on the logic analyzer. Each high section of the output waveform was a very consistent 63.88µS. This is all very well and good, but let’s prove to ourselves that this is a correct measurement. Since our cores aren’t doing any fancy prefetching/pipelining, we should be able to reason fairly well about how many cycles we expect assembly code to take. So let’s turn to the AVR instruction set datasheet: according to the tables in here, it seems we can expect the loop instructions subi and nop to execute in a single clock cycle, and the brne instruction to execute in 2 cycles if the condition is true (i != 0). Armed with this, we can check our measured number by calculating the expected clock cycle count:

                   1 // sbi
                   1 // ldi
   255 * (1 + 1 + 2) // nop + subi + brne (taken)
+                  1 // cbi
                1023 // Total instructions

At a clock speed of 16MHz, this should take 63.93µS. This lines up very nicely with our measured number! Now let’s take a look at our surprisingly underperforming ARM core.

ARM Performance test

Like before, we’re going to use a very simple loop. The only difference here is that we’ve replaced our AVR port manipulation logic with the STM32 equivalent. As before, we will compile with the -O2 flag.

void __attribute__ ((noinline)) test_loop() {
    GPIO_BSRR(GPIOB) = GPIO8; // Set indicator pin
    for (uint8_t i = 0; i < 0xFF; i++)
        asm __volatile__("nop");
    GPIO_BSRR(GPIOB) = GPIO8 << 16; // Clear indicator pin

Now, let’s flash that to our ARM core. Since it’s operating at a whopping 48MHz instead of the puny 16MHz of our AVR core, surely we’d expect it to execute our test loop in around (64 / (48/16)) = 21.3µS?

Spoiler alert: it does not

45.19µS! That’s almost half the speed we’d expect with a similarly implemented loop. Let’s take a look at the disassembly of our ARM loop and see what we can see (comments added on right):

080000c0 <_Z4loopv>:
 80000c0:       2280            movs    r2, #128        ; Load 0x80
 80000c2:       4b07            ldr     r3, [pc, #28]   ; Load GPIOB memory address
 80000c4:       0052            lsls    r2, r2, #1      ; Logical shift 0x80 left once to get 0x100
 80000c6:       601a            str     r2, [r3, #0]    ; Store 0x100 into GPIOB memory addr (set indicator)
 80000c8:       23ff            movs    r3, #255        ; Load loop variable
 80000ca:       46c0            nop                     ; Loop body
 80000cc:       3b01            subs    r3, #1          ; Subtract one from loop counter
 80000ce:       b2db            uxtb    r3, r3          ; Extend 8 bit value to 32 bit
 80000d0:       2b00            cmp     r3, #0          ; Compare to 0
 80000d2:       d1fa            bne.n   80000ca <_Z4loopv+0xa> ; Jump back into loop if != 0
 80000d4:       2280            movs    r2, #128        ; Load 0x80
 80000d6:       4b02            ldr     r3, [pc, #8]    ; Load GPIOB memory address
 80000d8:       0452            lsls    r2, r2, #17     ; Left shift again to get bit 8 set
 80000da:       601a            str     r2, [r3, #0]    ; Clear GPIOB pin 8
 80000dc:       4770            bx      lr              ; Return
 80000de:       46c0            nop                     ; (Dead code for alignment)
 80000e0:       48000418        stmdami r0, {r3, r4, sl}; GPIOB memory mapped IO location

The core of our loop seems pretty similar to the AVR: we execute a nop, a subtract, compare and a branch. We also have a sign-extend instruction on our loop counter in r3, which we can eliminate by converting our uint8_t loop counter to a uint32_t. But there are some goings on in the lean-in and lead-out that are a bit confusing - such as why are we loading 0x80 and left shifting it instead of loading 0x100? The answer is that our ARM core isn’t actually running the ARM instruction set - it’s running in Thumb mode.

Thumb mode

One of the downsides of a RISC ISA is that you tend to need more instructions in your binary than you would on a CISC ISA with more operations baked into silicon as a single opcode. For embedded and other space-sensitive applications, having lots of 32-bit ARM instructions can waste precious program ROM. To get around this, ARM created the Thumb instruction set, with (mostly) 16-bit instructions. So long as you’re mostly using common instructions, this can easily close to double your instruction density! But like all things, there are tradeoffs. One visible here is that the movs Thumb instruction only supports immediate values in the range [0..255], so loading our bit constant (1 << 8) cannot be done as a single movs - we have to load (1 << 7) and then shift it once more once it’s loaded to a register. The alternative for loading constants is what’s been done by GCC for the memory-mapped IO region for port B, which is to embed it as part of the function below the bx lr return call - 0x48000418 is our MMIO location, which is loaded using ldr r3, [pc, #28] to load a full word constant from memory. Presumably our 0x100 constant is not big enough for GCC to think it worth using the same trick here.

So let’s calculate our our cycle counts for the ARM code as we did for the AVR, using this list of instructions. We have a couple cycles overhead for our lead in and lead out, but the core of our loop has the following cycle characteristics:

80000c0:       2280            movs    r2, #128        ; 1 cycle
80000c2:       4b07            ldr     r3, [pc, #28]   ; 2 cycles
80000c4:       0052            lsls    r2, r2, #1      ; 1 cycle
80000c6:       601a            str     r2, [r3, #0]    ; 2 cycles
80000c8:       23ff            movs    r3, #255        ; 1 cycle
80000ca:       46c0            nop                     ; 1 cycle
80000cc:       3b01            subs    r3, #1          ; 1 cycle
80000ce:       b2db            uxtb    r3, r3          ; 1 cycle
80000d0:       2b00            cmp     r3, #0          ; 1 cycle
80000d2:       d1fa            bne.n   80000ca         ; 3 if taken, 1 if not taken
80000d4:       2280            movs    r2, #128        ; 1 cycle
80000d6:       4b02            ldr     r3, [pc, #8]    ; 2 cycles
80000d8:       0452            lsls    r2, r2, #17     ; 1 cycle
80000da:       601a            str     r2, [r3, #0]    ; 2 cycles
80000dc:       4770            bx      lr              ; 3 cycles

Calculating it out, that’s approximately 7 + (7 * 255) + 6 = 1,789 cycles! Close to double the count of our AVR code. At 48MHz, that’s a theoretical execution time of 37.27µS. Our actual measured time (45.19µS) is somewhat slower - my assumption is that this is due to flash wait states, as the STM32F070 this test was run on has 1 wait state for clock speeds > 24MHz. These delays should only be incurred for non-linear flash accesses, and so our repeated loop branching is a worse-case scenario here. Inserting an extra cycle into our loop for a prefetch miss results in a total cycle count of 2,053, and an expected runtime of 42.77µS, which is much closer to our measured time.

What should I take away from all this

At the end of digging into all this, I have a newfound respect for the AVR architecture - single cycle IO access makes it much more competitive for applications with lots of standalone IO calls (but perhaps not for standard protocols like SPI, which tend to be offloaded to baked-in peripherals on the ARM cores, and with DMA support can be extremely powerful). It’s also useful to be aware of the limitations and quirks of the Thumb instruction set - keeping constants small, avoiding spurious resize instructions and minimizing branches are ways to eke more performance out of tightly looped controllers. If I were to summarize some key takeaways, they might be:

  • A higher clock speed doesn’t necessarily mean more performance
  • Cycle count estimations can be useful for ballpark runtime calculation
  • If you need to know for sure, measure it! It’s easy to make mistaken assumptions, but luckily it can also be simple to verify them.

STM32 & OpenCM3 Part 2: SPI and DMA

Companion code for this post available on Github

In the previous section, we covered alternate functions, and configured a log console over UART. This time, we’ll take a look at the SPI peripherals available on the STM32F0, use them to quickly shift out data to some shift registers, and then demonstrate how to then offload that transfer from the main CPU using DMA. Since we have some other ICs involved here, instead of the simple breakout from before I will be using this MIDI relay board as a demonstration piece:

MIDI solenoid control board (a work in progress).

The ICs of interest here are the row of shift registers down the middle, each of which is responsible for driving the eight FETs by each of the solenoid connection points. For this example I have only populated the first row of 8, but this will be enough to demonstrate. Our STM32F070 IC, SWD header and uart breakout are visible in the bottom left corner of the board. Shift registers, for those that aren’t familiar, allow one to take serial data and convert it to a parallel output. These ICs are 74HC595 models, which are 8-bit shift registers with separate shift and storage registers. But how do we get the data in there? From the datasheet, we can find this table defining the behaviour as we manipulate the control inputs:

Functional description of the 595 shift register

So in order to shift data, we cycle SRCLK and on each rising edge the data on SER will be shifted in to the shift register, and all data presently in the shift register will be shifted over one. Once we have repeated this to load as much data as we might want to, we can then clock the RCLK line from low to high to shift the data in the shift register to the storage register, making it visible on the outputs.

Now, to drive this we could write a method that carefully takes each byte we want to send and iterates along each bit inside it, manually toggling the SER and SRCLK lines to shift data in. But this would be tedious, slow, and duplicating a built-in peripheral that does exactly the same thing: SPI!

Serial Peripheral Interface

The SPI protocol is a simple communication interface usually consisting of 4 signals:

  • MOSI (Master Out Slave In)
  • MISO (Master In Slave Out)
  • SCK (Serial ClocK)
  • SS (Slave Select)

Unlike UART, this protocol has a clock signal - as a result, SPI buses can be operated at far higher speeds since both sides can know precisely when to latch each bit. Like UART, it is a duplex - the MOSI and MISO lines are each unidirectional, and can both transmit data during the same clock pulse. However, SPI also allows for multiple slaves (and, in more complex setups, multiple masters) on the same MISO and MOSI lines. In order to prevent slaves from reading / writing data not intended for them, the Slave Select signal is used to identify which chip is being addressed. Notably, the SS signal is active low - this means that we can use our SPI MOSI, SCK and SS lines to map perfectly to the SER, SRCLK and RCLK lines of our shift registers. Using this information, we can codify it in our schematic like so:

SPI connections for driving shift register. Additional 595s are fed the same SCK and NSS signals, but chained QH* -> SER.

So now that we have our SPI pins mapped to our shift register (in this case, we are using PB12-PB15 and the SPI2 peripheral), we can start start work on initializing our SPI peripheral in preparation for sending data through it.

void spi_setup() {
    // Enable clock for SPI2 peripheral

    // Configure GPIOB, AF0: SCK = PB13, MISO = PB14, MOSI = PB15
    gpio_mode_setup(GPIOB, GPIO_MODE_AF, GPIO_PUPD_NONE, GPIO13 | GPIO14 | GPIO15);
    gpio_set_af(GPIOB, GPIO_AF0, GPIO13 | GPIO14 | GPIO15);

    // We will be manually controlling the SS pin here, so set it as a normal output

    // SS is active low, so pull it high for now
    gpio_set(GPIOB, GPIO12);

    // Reset our peripheral

    // Set main SPI settings:
    // - The datasheet for the 74HC595 specifies a max frequency at 4.5V of
    //   25MHz, but since we're running at 3.3V we'll instead use a 12MHz
    //   clock, or 1/4 of our main clock speed.
    // - Set the clock polarity to be zero at idle
    // - Set the clock phase to trigger on the rising edge, as per datasheet
    // - Send the most significant bit (MSB) first

    // Since we are manually managing the SS line, we need to move it to
    // software control here.

    // We also need to set the value of NSS high, so that our SPI peripheral
    // doesn't think it is itself in slave mode.

    // The terminology around directionality can be a little confusing here -
    // unidirectional mode means that this is the only chip initiating
    // transfers, not that it will ignore any incoming data on the MISO pin.
    // Enabling duplex is required to read data back however.

    // We're using 8 bit, not 16 bit, transfers
    spi_set_data_size(SPI2, SPI_CR2_DS_8BIT);

    // Enable the peripheral

Our SPI peripheral should now be ready to transmit data. In order to make things easier for us, let’s create a simple helper method that will transmit a given amount of data over the SPI bus:

void spi_transfer(uint8_t tx_count, uint8_t *tx_data) {
    // Pull CS low to select target. In our case, this just pulls the register
    // clock low so that we can lock in the new data at the end of the
    // transfer.
    gpio_clear(GPIOB, GPIO12);

    // For each byte of data we want to transmit
    for (uint8_t i = 0; i < tx_count; i++) {
        // Wait for the peripheral to become ready to transmit (transmit buffer
        // empty flag set)
        while (!(SPI_SR(SPI2) & SPI_SR_TXE));

        // Place the next data in the data register for transmission
        SPI_DR8(SPI2) = tx_data[i];

    // Putting data into the SPI_DR register doesn't block - it will start
    // sending the data asynchronously with the main CPU. To make sure that the
    // data is finished sending before we pull the register clock high again,
    // we wait here until the busy flag is cleared on the SPI peripheral.
    while (SPI_SR(SPI2) & SPI_SR_BSY);

    // Bring the SS pin high again to latch the new data
    gpio_set(GPIOB, GPIO12);

So now we should be able to easily clock out data to our shift registers over SPI. To test this, let’s update our main loop from last time:

int main() {
    // Clock, UART, etc setup
    // [...]

    // Initialize our SPI peripheral

    // Make a very simple count up display using our 8 LEDs
    uint8_t i = 0;
    while (1) {
        spi_transfer(1, &i);

Shift register output

Perfect, we can see that we are slowly counting up. Now, this is obviously a fairly small application of SPI - we only have 8 bits to transfer here (24 for a fully populated board); it will take a truly infinitessimal time to push this data. But if you have a lot of data to move, for example bitmap data you need to push to a screen, the amount of time it takes to move that data from memory to the SPI bus might start to become a problem - while you’re looping over all the data to send and moving it piece by piece to the SPI data register, you’re losing time to process other events or start drawing the next frame. Wouldn’t it be great if something so simple as moving data from memory to a peripheral could be offloaded somehow?

Direct Memory Access

DMA controllers allow us to offload certain types of data shuffling from the main processor, freeing it to get on with business. In the STM32F0 series, the controller can be used to move data between two peripherals, from a peripheral into memory, or from memory to a peripheral. For this example, we’re going to use it to copy data from memory to our SPI peripheral, so that it can be sent our to our shift registers. Each DMA controller has multiple channels, and those channels are all bound to specific peripheral functions. If we take a look at the STM32F0 series datasheet, we can find a table showing us which channels map to which peripherals.

DMA channel mapping for the STM32F070 MCU

Based on this, we can see that in order to transmit data on SPI2, we need to use DMA channel 5. So let’s start configuring our DMA controller:

void dma_init() {
    // Enable DMA clock
    // In order to use SPI2_TX, we need DMA 1 Channel 5
    dma_channel_reset(DMA1, DMA_CHANNEL5);
    // SPI2 data register as output
    dma_set_peripheral_address(DMA1, DMA_CHANNEL5, (uint32_t)&SPI2_DR);
    // We will be using system memory as the source data
    dma_set_read_from_memory(DMA1, DMA_CHANNEL5);
    // Memory increment mode needs to be turned on, so that if we're sending
    // multiple bytes the DMA controller actually sends a series of bytes,
    // instead of the same byte multiple times.
    dma_enable_memory_increment_mode(DMA1, DMA_CHANNEL5);
    // Contrarily, the peripheral does not need to be incremented - the SPI
    // data register doesn't move around as we write to it.
    dma_disable_peripheral_increment_mode(DMA1, DMA_CHANNEL5);
    // We want to use 8 bit transfers
    dma_set_peripheral_size(DMA1, DMA_CHANNEL5, DMA_CCR_PSIZE_8BIT);
    dma_set_memory_size(DMA1, DMA_CHANNEL5, DMA_CCR_MSIZE_8BIT);
    // We don't have any other DMA transfers going, but if we did we can use
    // priorities to try to ensure time-critical transfers are not interrupted
    // by others. In this case, it is alone.
    dma_set_priority(DMA1, DMA_CHANNEL5, DMA_CCR_PL_LOW);
    // Since we need to pull the register clock high after the transfer is
    // complete, enable transfer complete interrupts.
    dma_enable_transfer_complete_interrupt(DMA1, DMA_CHANNEL5);
    // We also need to enable the relevant interrupt in the interrupt
    // controller, and assign it a priority.
    nvic_set_priority(NVIC_DMA1_CHANNEL4_5_IRQ, 0);

So now, our DMA controller is all set up to push data from memory to SPI2’s transmit buffer. But note that in our setup we didn’t specify our source memory location or how much data we’re sending - let’s add a method for that now

void dma_start(void *data, size_t data_size) {
    // Note - manipulating the memory address/size of the DMA controller cannot
    // be done while the channel is enabled. Ensure any previous transfer has
    // completed and the channel is disabled before you start another transfer.
    // Tell the DMA controller to start reading memory data from this address
    dma_set_memory_address(DMA1, DMA_CHANNEL5, (uint32_t)data);
    // Configure the number of bytes to transfer
    dma_set_number_of_data(DMA1, DMA_CHANNEL5, data_size);
    // Enable the DMA channel.
    dma_enable_channel(DMA1, DMA_CHANNEL5);

    // Since we're manually controlling our register clock, move it low now
    gpio_clear(GPIOB, GPIO12);

    // Finally, enable SPI DMA transmit. This call is what actually starts the
    // DMA transfer.

But this is only half the process - we also need to handle the termination condition of the DMA transfer, so that we can move our register clock high again to latch the data. So for this, we need to implement an interrupt handler for our DMA channel. DMA channels 4 and 5 use the same ISR - dma1_channel4_5_isr - so let’s implement that now.

void dma1_channel4_5_isr() {
    // Check that we got triggered because the transfer is complete, by
    // checking the Transfer Complete Interrupt Flag
    if (dma_get_interrupt_flag(DMA1, DMA_CHANNEL5, DMA_TCIF)) {
        // If that is why we're here, clear the flag for next time
        dma_clear_interrupt_flags(DMA1, DMA_CHANNEL5, DMA_TCIF);

        // Like the non-dma version, we don't want to latch the register clock
        // until the transfer is actually complete, so wait til the busy flag
        // is clear
        while (SPI_SR(SPI2) & SPI_SR_BSY);

        // Turn our DMA channel back off, in preparation of the next transfer
        dma_disable_channel(DMA1, DMA_CHANNEL5);

        // Bring the register clock high to latch the transferred data
        gpio_set(GPIOB, GPIO12);

To tie it all together and demonstrate that the DMA transfer is separate from normal CPU operations, let’s start a DMA transfer and then immediately write some text over the USART.

int main() {
    // Setup clock, serial, spi, etc
    // [...]

    // Initialize the DMA controller

    // Allocate a nice big slab of data
    uint8_t data[1024];
    for (int i = 0; i < 1024; i++) {
        data[i] = i;

    // Begin a DMA transfer using that data
    dma_start(data, 1024);

    // Immediately start printing some text to our console
    printf("Concurrent DMA and USART!\n");

    while (true) {
        // Nothing

    return 0;

If we now tap the UART and SPI lines on the board with a logic analyzer, we can observe that we are indeed sending both SPI and UART data concurrently:

Trace of our SPI bus and UART TX

Success! We can see that while the main thread of execution has moved on to sending data over the USART, the DMA controller has begun sending out kilobyte of data in the background. While DMA is still limited by sharing the same memory and peripheral bus as the processor, and so both must still negotiate if there are bus conflicts, it is a powerful tool for offloading simpler peripheral operations in this way. You can even do more complex DMA operations, such as pushing double-buffering video data bv taking advantage of circular DMA and the “transfer half complete” interrupt.

As per usual, the code for this post is available on Github.

The next post in this series, on CANBus, can be found here

STM32 & OpenCM3 Part 1: Alternate Functions and USART

Companion code for this post available on Github

In the previous section, we covered the basics of compiling for, and uploading to, an STM32F0 series MCU using libopencm3 to make an LED blink. This time, we’ll take a look at alternate pin functions, and use one of the four USARTs on our chip to send information back to our host machine. As before, this is all based on a small breakout board for the STM32F070CBT6, but can be applied to other boards and MCUs.

Alternate functions

In addition to acting as General Purpose I/Os, many of the pins on the STM32 microcontrollers have one or more alternate functions. These alternate functions are tied to subsystems inside the MCU, such as one or more SPI, I2C, USART, Timer, DMA or other peripherals. If we take a look at the datasheet for the STM32F070, we see that there are up to 8 possible alternate functions per pin on ports A and B.

Alternate function tables for STM32F070

Note that some peripherals are accessible on multiple different sets of pins as alternate functions - this can help with routing your designs later, since you can to some degree shuffle your pins around to move them closer to the other components to which they connect. An example would be SPI1, which can be accessed either as alternate function 0 on port A pins 5-7, or as alternate function 0 on port B, pins 3-5. But for this example, we will be looking at USART1, which from the tables above we can see is AF1 on PA9 (TX) and PA10 (RX).

Universal Synchronous/Asynchronous Receiver/Transmitter

To quickly recap - USARTs allow for sending relatively low-speed data using a simple 1-wire per direction protocol. Each line is kept high, until a transmission begins and the sender pulls the line low for a predefined time to signal a start condition (start bit). The sender then, using it’s own clock, pulls the line low for logic 1 and high for logic 0 to transmit a configurable number of bits to the receiver, followed by an optional parity bit and stop bit. The receiver calculates the time elapsed after the start condition using it’s own clock, and recovers the data. While simple to implement, they have a drawback in that they lack a separate clock line, and must rely on both sides keeping close enough time to understand each other. For our case, they work great for sending back debug information to our host computer. So let’s update our example from last time, to include a section that initializes USART1 with a baudrate of 115200, and the transmit pin connected to Port A Pin 9.

static void usart_setup() {
    // For the peripheral to work, we need to enable it's clock
    // From the datasheet for the STM32F0 series of chips (Page 30, Table 11)
    // we know that the USART1 peripheral has it's TX line connected as
    // alternate function 1 on port A pin 9.
    // In order to use this pin for the alternate function, we need to set the
    // mode to GPIO_MODE_AF (alternate function). We also do not need a pullup
    // or pulldown resistor on this pin, since the peripheral will handle
    // keeping the line high when nothing is being transmitted.
    gpio_mode_setup(GPIOA, GPIO_MODE_AF, GPIO_PUPD_NONE, GPIO9);
    // Now that we have put the pin into alternate function mode, we need to
    // select which alternate function to use. PA9 can be used for several
    // alternate functions - Timer 15, USART1 TX, Timer 1, and on some devices
    // I2C. Here, we want alternate function 1 (USART1_TX)
    gpio_set_af(GPIOA, GPIO_AF1, GPIO9);
    // Now that the pins are configured, we can configure the USART itself.
    // First, let's set the baud rate at 115200
    usart_set_baudrate(USART1, 115200);
    // Each datum is 8 bits
    usart_set_databits(USART1, 8);
    // No parity bit
    usart_set_parity(USART1, USART_PARITY_NONE);
    // One stop bit
    usart_set_stopbits(USART1, USART_CR2_STOPBITS_1);
    // For a debug console, we only need unidirectional transmit
    usart_set_mode(USART1, USART_MODE_TX);
    // No flow control
    usart_set_flow_control(USART1, USART_FLOWCONTROL_NONE);
    // Enable the peripheral

    // Optional extra - disable buffering on stdout.
    // Buffering doesn't save us any syscall overhead on embedded, and
    // can be the source of what seem like bugs.
    setbuf(stdout, NULL);

Now that we have this, we can write some helper functions for logging strings to the serial console:

void uart_puts(char *string) {
    while (*string) {
        usart_send_blocking(USART1, *string);

void uart_putln(char *string) {

With this, let’s update our main loop from last time to also log every time we turn the LED either on or off:

int main() {
    // Previously defined clock, GPIO and SysTick setup elided
    // [...]

    // Initialize our UART

    while (true) {
        uart_putln("LED on");
        gpio_set(GPIOA, GPIO11);
        uart_putln("LED off");
        gpio_clear(GPIOA, GPIO11);

Once again, run make flash to compile and upload to your target. Now, take the ground and VCC lines of the serial interface on the bottom of your Black Magic probe, and connect them to the ground / positive rails of your test board. Then connect the RX line on the probe (purple wire) to the TX pin on your board. You can then start displaying the serial output by running

$ screen /dev/ttyACM1 115200

After a couple seconds, you should have a similarly riveting console output:

Basic serial console logs, using screen

This is ok, but what if we want to actually format data into our console logs? If we have size constraints we may roll our own integer/floating point serialization logic, but printf already exists and provides a nice interface - so why not take printf and allow it to write to the serial console?

Replacing POSIX calls

When targeting embedded systems, one tends to compile without linking against a full standard library - since there is no operating system, syscalls such as open, read and exit don’t really make sense. This is part of what is done by linking with -lnosys - we replace these syscalls with stub functions, that do nothing. For example, the POSIX write call eventually calls through to a function with the prototype:

ssize_t _write      (int file, const char *ptr, ssize_t len);

(I believe that this list of prototypes covers the syscalls that can be implemented in this manner). So, if printf will eventually call write, if we re-implement the backing _write method to instead push that data to the serial console, we can effectively redirect stdout and stderr somewhere we can see them - we could even redirect stdout to one USART and stderr to another! But for simplicity, let’s just pipe both to USART1 which we set up earlier:

// Don't forget to allow external linkage if this is C++ code
extern "C" {
    ssize_t _write(int file, const char *ptr, ssize_t len);

int _write(int file, const char *ptr, ssize_t len) {
    // If the target file isn't stdout/stderr, then return an error
    // since we don't _actually_ support file handles
    if (file != STDOUT_FILENO && file != STDERR_FILENO) {
        // Set the errno code (requires errno.h)
        errno = EIO;
        return -1;

    // Keep i defined outside the loop so we can return it
    int i;
    for (i = 0; i < len; i++) {
        // If we get a newline character, also be sure to send the carriage
        // return character first, otherwise the serial console may not
        // actually return to the left.
        if (ptr[i] == '\n') {
            usart_send_blocking(USART1, '\r');

        // Write the character to send to the USART1 transmit buffer, and block
        // until it has been sent.
        usart_send_blocking(USART1, ptr[i]);

    // Return the number of bytes we sent
    return i;

Now, we could jazz up that print output from before by prefixing all of our log messages with the current monotonic time:

int main() {
    // Previously defined clock, USART, etc... setup elided
    // [...]

    while (true) {
        printf("[%lld] LED on\n", millis());
        gpio_set(GPIOA, GPIO11);
        printf("[%lld] LED off\n", millis());
        gpio_clear(GPIOA, GPIO11);

Serial console logs, with formatting

As before, the final source code for this post is available on Github.

In the next post, we will go over SPI and memory-to-peripheral DMA.

Embedded ARM Part 0: STM32 programming with libopencm3

Companion code for this post available on Github

For many years now, I have found myself building (admittedly small) electronics projects, and for almost all of that time I have found myself reaching for the same microcontroller: the humble Atmega 328p that powers so many Arduinos (and Arduino clones!). This choice was driven by the fact that being an Arduino-compatible MCU meant I could take advantage of the huge library of code built for the Arduino platform. However, the choice is somewhat limiting - with only 16KiB of flash and 2KiB of sram, some applications were simply infeasible within that range. In addition, the relatively slow 16MHz top speed can make high-data-rate applications, like display driving, somewhat of a challenge.

Luckily, there are a number of increasingly cheap ARM based MCUs on the market. In particular, ST has an excellent lineup, including a ‘value line’ of F0 series MCUs that offer a lot of features (128K flash, 16K sram, multiple SPI / I2C / Timer / DMA peripherals) for relatively cheap. However, a lot of the proprietary development environments (such as the graphical STMCube software) seem to me overly complex, and make it difficult to use preferred editors / development toolchains. Luckily, there is an open source low-level hardware abstraction library with coverage for the STM32 series of MCUs, called libopencm3. Using this, it is easy to write portable C(++) code that can easily be deployed using only GNU tools. In this tutorial, I will attempt to cover the basics of starting a new project with libopencm3, compiling, uploading code, and the use of the SysTick peripheral.

The full source code for this tutorial, along with the EAGLE and Gerber files for the breakout board I used, are available in this Github repo.


Since my goal was to be able to use these MCUs in my own designs, I am not using an off-the-shelf development board from ST, but an extremely basic breakout of my own design with nothing but an STM32F070CBT6, a Single-Wire Debug (SWD) header and all IO pins on the MCU broken out to .1” headers. These instructions should however work for any board with an STM32FX part, so long as it has a SWD header.

Breakout board and schematic. Source materials available in Github.

The only other important piece of kit is a Black Magic probe. The BMP is a fantastic little device - it provides a GDB server that makes it simple to upload code to the MCU, debug it, and also has a UART interface so that you can wire it up to get a debug console. It can be found online for ~$60USD at time of writing.


The very first thing we are going to do is acquire a copy of libopencm3 to compile against. In order to make sure that we don’t end up with any dependency drift, I am going to add it as a submodule to our new git project. So let’s start by creating a new project, and cloning it in:

$ mkdir hello_opencm3
$ cd hello_opencm3/
$ git init .
$ git submodule add
$ git commit -m "Import opencm3"

Now, we need to build opencm3 so that we can link against it later. If you do not have GCC for ARM installed, you may need to install it now. For Debian and kin, the following will install everything you might need for this tutorial:

$ sudo apt install {gcc,gdb,libnewlib,libstdc++,binutils}-arm-none-eabi

Now, we can drop into the libopencm3 dir and build. Note that if you have custom CFLAGS or CXXFLAGS set for your host machine, you may need to unset them if they are unsupported by gcc-arm.

$ pushd libopencm3
$ unset CFLAGS && unset CXXFLAGS # You may not need to do this
$ make
$ popd

Now that we have our library ready, it’s time to start writing some code. To begin, let’s just start by blinking the LED I have attached to pin PA11. If you are using a different test board, you may need to sub out GPIOA and GPIO11 to match the port and pin you are using instead.

#include <libopencm3/stm32/rcc.h>
#include <libopencm3/stm32/gpio.h>

int main() {
    // First, let's ensure that our clock is running off the high-speed
    // internal oscillator (HSI) at 48MHz

    // Since our LED is on GPIO bank A, we need to enable
    // the peripheral clock to this GPIO bank in order to use it.

    // Our test LED is connected to Port A pin 11, so let's set it as output

    // Now, let's forever toggle this LED back and forth
    while (true) {
        gpio_toggle(GPIOA, GPIO11);

    return 0;

This is very basic, but it’s something we can compile to ensure we’re able to load code at all. Later on we will refine this example. But first, we need to build a linker script

Linker scripts

Once we have compiled our code, we need to specify where in our MCU memory to place both the code as well as our memory. In order to do this, we first need to look up the address space map for our CPU. From the STM32F070 datasheet, we see that the ‘Flash Memory’ block resides at 0x08000000 and is 128KiB in size, while the main ram (SRAM) lives at 0x20000000 and is 16KiB in size. Armed with this information, we can create our linker script with these two sections:

    rom (rx) : ORIGIN = 0x08000000, LENGTH = 128K
    ram (rwx) : ORIGIN = 0x20000000, LENGTH = 16K

INCLUDE cortex-m-generic.ld

OpenCM3 proves the cortex-m-generic.ld script, which takes our rom/ram definitions to locate the stack, as well as specify the locations of the .text, .data, .bss and other sections. It’s fairly short, so it’s worth taking a look if you’re curious about the details of how sections are laid out in memory.


Now that we have our device-specific linker data, we can get to compiling. I’ve posted this example project on Github including the full Makefile that defines the build rules necessary to compile our project. The core variables in that makefile are

  • OBJS: The list of .o files to compile. Since we only have the one source file, main.cpp, we have only one object file, main.o.
  • OPENCM3_DIR: The location of opencm3 relative to our project. In this case, it is in the same directory.
  • LDSCRIPT: The linker script to use. Set this to the one we made earlier.
  • LIBNAME and DEPS: The precise chip variant that we are compiling for. This controls which includes are matched from opencm3, and which support library is linked against.

To try and keep this short, I won’t go over the entire makefile here. However, here are the two commands, in full, used to compile and link our simple binary. I have gone through and annotated each flag, which will hopefully explain why they are present.

$ arm-none-eabi-g++ \
    -Os \ # Optimize for code size
    -ggdb3 \ # Include additional debug info
    -mthumb \ # Use thumb mode (smaller instructions)
    -mcpu=cortex-m0 \ # Target CPU is a Cortex M0 series
    -msoft-float \ # Software floating-point calculations
    -Wall \ # Enable most compiler warnings
    -Wextra \ # Enable extra compiler warnings
    -Wundef \ # Warn about non-macros in #if statements
    -Wshadow \ # Enable warnings when variables are shadowed
    -Wredundant-decls \ # Warn if anything is declared more than once
    -Weffc++ \ # Warn about structures that violate effective C++ guidelines
    -fno-common \ # Do not pool globals, generates error if global redefined
    -ffunction-sections \ # Generate separate ELF section for each function
    -fdata-sections \ # Enable elf section per variable
    -std=c++11 \ # Use C++11 standard
    -MD \ # Write .d files with dependency data
    -DSTM32F0 \ # Define the series of MCU we are using. This controls the
              \ # include pathways inside libopencm3.
    -I./libopencm3/include \ # Include the opencm3 header files
    -o main.o \ # Output filename
    -c main.cpp # Input file to compile

$ arm-none-eabi-gcc \
    --static \ # Don't try and use shared libraries
    -nostartfiles \ # No standard lib startup files
    -Tstm32f0.ld \ # Use the linker script we defined earlier
    -mthumb \ # Thumb instruction set for smaller code
    -mcpu=cortex-m0 \ # Cortex M0 series MCU
    -msoft-float \ # Software floating point
    -ggdb3 \ # Include detailed debug info
    -Wl, \ # Generate a memory map in ''
    -Wl,--cref \ # Output a cross-reference table to the mem map
    -Wl,--gc-sections \ # Perform dead-code elimination
    -L./libopencm3/lib \ # Link against opencm3 libraries
    main.o \ # Our input object(s)
    -lopencm3_stm32f0 \ # The opencm3 library for our chip
    \ # Begin and end group here allow resolving of circular dependencies
    \ # while linking. This links against some common libs:
    \ # -lc: the standard C library
    \ # -lgcc: GCC-provided subroutines
    \ # -lnosys: Stub library with empty definitions for POSIX functions
    -Wl,--start-group -lc -lgcc -lnosys -Wl,--end-group
    -o main.elf # Our output binary

Once I’ve built my code, I also like to print a quick summary of the size so that I can track how much of my flash and memory is currently consumed. For this, we can use the arm-none-eabi version of size:

$ arm-none-eabi-size main.elf
  text    data     bss     dec     hex filename
  7832    2120      60   10020    2724 main.elf

So far we’re already using 7k of flash data, and 2k of initialized memory. Only 60 bytes of globally allocated variables however! Now that we have our binary, it’s time to flash it to our microcontroller.


In order to load this code, we’re going to turn to GDB. Since our Black Magic Probe runs a GDB server, we use it as a remote target. First, ensure that the 10-pin SWD cable is connected to your board in the correct orientation. You must also be sure to supply power to your board - the BMP will not supply power to the board itself. Once you’ve done this, we can open up arm-none-eabi-gdb:

$ arm-none-eabi-gdb main.elf
# First, we need to tell GDB to use our BMP as a remote target.
# /dev/ttyACM0 is where it shows up on my machine, but YMMV
(gdb) target extended-remote /dev/ttyACM0
# Now that we are connected, we need to scan for devices to attach to over
# Single Wire Debug
(gdb) monitor swdp_scan
Target voltage: 3.3V
Available Targets:
No. Att Driver
 1      STM32F07
# Perfect - our chip is powered and recognizable to GDB.
# Now we can attach to it
(gdb) attach 1
# Now that we're connected, we can upload our code
(gdb) load
Loading section .text, size 0x1d5c lma 0x8000000
Loading section .ARM.extab, size 0x30 lma 0x8001d5c
Loading section .ARM.exidx, size 0xd0 lma 0x8001d8c
Loading section .data, size 0x848 lma 0x8001e5c
Start address 0x800033c, load size 9892
Transfer rate: 13 KB/sec, 760 bytes/write.
# We can now run the program as though it were a local binary
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: main.elf

This series of steps has been formalized in the gdb script bmp_flash.scr in the Github repo for this tutorial, and can be invoked using the flash make target.

You should see that the LED has appeared to turn on solidly. Unfortunately, at 48MHz, it’s a little hard for the human eye to notice it pulsing. Let’s make it a little more obvious by pulsing it on and off each second, and to do so we’ll use our first proper peripheral: the SysTick timer.


The SysTick timer on the STM32 series of MCUs is a simple countdown timer, which triggers an interrupt every time the counter reaches zero, then reloads the counter value from a specified reload register. Since we know that our CPU frequency is 48MHz, if we want to get an approximately real time clock pulse every millisecond, we can configure our systick timer to reload with a value of 48000000 / 1000 = 48000. Then, every time the interrupt fires, we can increment a ‘milliseconds’ counter that we can then check against to implement a delay operation.

// I'll be using fixed-size types
#include <stdint.h>

// Include the systick header
#include <libopencm3/cm3/systick.h>

// Important note for C++: in order for interrupt routines to work, they MUST
// have the exact name expected by the interrupt framework. However, C++ will
// by default mangle names (to allow for function overloading, etc) and so if
// left to it's own devices will cause your ISR to never be called. To force it
// to use the correct name, we must declare the function inside an 'extern C'
// block, as below.
// For more details, see
extern "C" {
    void sys_tick_handler(void);

// Storage for our monotonic system clock.
// Note that it needs to be volatile since we're modifying it from an interrupt.
static volatile uint64_t _millis = 0;

static void systick_setup() {
    // Set the systick clock source to our main clock
    // Clear the Current Value Register so that we start at 0
    STK_CVR = 0;
    // In order to trigger an interrupt every millisecond, we can set the reload
    // value to be the speed of the processor / 1000 - 1
    systick_set_reload(rcc_ahb_frequency / 1000 - 1);
    // Enable interrupts from the system tick clock
    // Enable the system tick counter

// Get the current value of the millis counter
uint64_t millis() {
    return _millis;

// This is our interrupt handler for the systick reload interrupt.
// The full list of interrupt services routines that can be implemented is
// listed in libopencm3/include/libopencm3/stm32/f0/nvic.h
void sys_tick_handler(void) {
    // Increment our monotonic clock

// Delay a given number of milliseconds in a blocking manner
void delay(uint64_t duration) {
    const uint64_t until = millis() + duration;
    while (millis() < until);

Once we have this framework, we can update our main from before to make our blinking more obvious.

int main() {
    // Previously defined clock and GPIO setup elided
    // [...]

    // Initialize our systick timer

    while (true) {
        gpio_set(GPIOA, GPIO11);
        gpio_clear(GPIOA, GPIO11);

Once we’ve made these changes, we can now make flash to compile and upload our program. We should now see our LED blink at a nice visible rate.

Expected result: a 0.5Hz blink.

In the next post, we take a look at alternate functions, USART, and redirecting printf to serial console.

Erlang Liveness Checks in Kubernetes

I have an ever-increasing number of small projects and deployments that I use either internally or with some availability to the public, and have been relying on Kubernetes to make managing them easy. Not too long ago, I started adding a liveness probe to each pod definition as a contingency against a hung runtime. My pod definition at the time looked like this example from my Aflame project:

- name: aflame
image: docker-registry:5000/erlang-aflame
    - /deploy/bin/aflame
    - ping
  initialDelaySeconds: 5

This worked fine, and I was able to verify that nodes that failed the ping test would be taken down. However, some weeks later I was poking around and realized that CPU utilization on my server was significantly higher than I would expect.

Note the recent spike in CPU utilization

Looking at htop, it was difficult to see a precise culprit, except for a large number of processes called erl_child_setup coming into existence, pegging a CPU core and disappearing again. After googling around, I landed on the source code for this task and found this section of main:

/* We close all fds except the uds from beam.
   All other fds from now on will have the
   CLOEXEC flags set on them. This means that we
   only have to close a very limited number of fds
   after we fork before the exec. */
#if defined(HAVE_CLOSEFROM)
    for (i = 4; i < max_files; i++)
#if defined(__ANDROID__)
        if (i != system_properties_fd())
        (void) close(i);

According to the ps output on my system, max_files was getting set to 1,048,576 - so every time this program run, it was hot looping over a million possible file descriptors and calling close on each! No wonder it was resulting in so much system time. But what was actually causing all of these erl_child_setup calls? I had initially suspected a misbehaving deployment code, but the spike in load lined up with when I added the liveness checks. It turns out that the relx ping command is surprisingly heavyweight; or at least ends up that way when run in an environment with a very high max file count, which was the case under docker.

The fix

In order to fix this, I altered my liveness probe to run the ping check with a low ulimit on the number of open files. We still need to loop, but we are at least looping over a considerably more restrained number of descriptors. I also took the opportunity to increase the interval between checks, since the default check period is rather quick.

 - name: aflame
 image: docker-registry:5000/erlang-aflame
+    - softlimit
+    - -o
+    - "128"
     - /deploy/bin/aflame
     - ping
   initialDelaySeconds: 5
+  periodSeconds: 60

Note that in order to get softlimit in your container, you may need to rebuild with daemontools package installed, or another source that contains a limit utility.

Noticably reduced CPU usage

With these changes, system load rapidly dropped from nearly 20 to a more reasonable 3. The default performance of this wrapper would likely be helped by an adoption of the closefrom syscall into the Linux kernel, but unfortunately the only references I can find to this are a pessimistic ticket from 2009 and an unmerged patch from Zheng Liu in 2014.

Respecting the scheduler in Erlang NIFs

Companion code for this post available on Github

Recently, I did a small writeup on creating natively implemented functions (NIFs) in Erlang. But as was brought up by several people on reddit and, that example did not account for any sort of coöperative scheduling with the VM.

The BEAM has one scheduler per CPU core, and these (and, in later versions of Erlang, the dirty schedulers) are the executors on which all code is run. In order for the BEAm to try and guarantee a soft-real-time execution environment, it needs to be able to track how much work each process has done, so that it can keep CPU hogs in check and not starve other processes. To do this, it uses the concept of reductions. Each time a statement is evaluated (reduced), the number of reductions performed by that process is incremented, until it reaches a set limit and the process is context switched out.

If we take a look at the erl_nif man page, we can see that there is one method that deals with accounting for nif time: enif_consume_timeslice. This method expects to be called relatively regularly from your NIF, with an argument that is the percentage of a time slice that you believe you have used since the last time you called that method. The idea of a time slice is somewhat lackluster in its specificity:

The time is specified as a percent of the timeslice that a process is allowed to execute Erlang code until it can be suspended to give time for other runnable processes. The scheduling timeslice is not an exact entity, but can usually be approximated to about 1 millisecond.

If we take a look at the actual implementation we can see what’s actually happening here:

int enif_consume_timeslice(ErlNifEnv* env, int percent) {
    Process *proc;
    Sint reds;

    execution_state(env, &proc, NULL);

    ASSERT(is_proc_bound(env) && percent >= 1 && percent <= 100);
    if (percent < 1) percent = 1;
    else if (percent > 100) percent = 100;

    reds = ((CONTEXT_REDS+99) / 100) * percent;
    ASSERT(reds > 0 && reds <= CONTEXT_REDS);
    BUMP_REDS(proc, reds);
    return ERTS_BIF_REDS_LEFT(proc) == 0;

When we call this method, we increase the number of reductions this process has executed by CONTEXT_REDS (the number of reductions a process may perform before it should be context-switched out) divided by our percentage.

So far so good. But now, we need to see how we can actually restructure our code to allow it to be context switched out. The best way to do this is to use the primitive enif_schedule_nif, which allows you to specify a function pointer and arguments that should be called to continue your calculations. Note that the NIF scheduled in this way does not need to be exported - meaning that it is fairly safe to create helper NIFs under the assumption that end users will not try and call them. So let’s take a look at how we might re-write our Levenshtein example to allow it to perform work in chunks:

// The first thing you'll want to do is create a struct to maintain any
// internal state you want to pass between calls to your NIF.
// In this case, I need to keep track of my matrix, the input strings,
// and which x, y position I had iterated up to.
struct LevenshteinState {
    // The matrix being used to calculate the distance
    unsigned int *matrix;

    // The input strings + their sizes
    unsigned char *s1;
    unsigned s1len;
    unsigned char *s2;
    unsigned s2len;

    // The index of the last processed row of the matrix,
    // so that the next iteration can pick up where we left off
    unsigned int lastX;
    unsigned int lastY;

// Now, let's rewrite our entry point so that all it does is read in the
// command arguments, validate them and then yield a call to our internal NIF
// that will do all the actual work.
static ERL_NIF_TERM erl_levenshtein(ErlNifEnv* env, int argc, const
                                    ERL_NIF_TERM argv[]) {
    // Not pictured: verifying argc, the type of the arguments,
    // and casting the binaries to ErlNifBinary structs.

    // Retrieve the state resource descriptor from our priv data,
    // and allocate a new structure
    // PrivData here is a custom struct, initialized during our module load
    // callback. See the code on github for the full implementation with
    // regards to this.
    struct PrivData *priv_data = enif_priv_data(env);
    struct LevenshteinState* state = enif_alloc_resource(
        sizeof(struct LevenshteinState)

    //// Initialize the calculation state
    // Allocate our matrix
    size_t matrix_size = (
        sizeof(unsigned int) * (binary1.size + 1) * (binary2.size + 1)
    state->matrix = malloc(matrix_size);
    // Copy the binary term info
    state->s1 =;
    state->s1len = binary1.size;
    state->s2 =;
    state->s2len = binary2.size;
    // Set our initial X and Y values
    state->lastX = 1;
    state->lastY = 1;

    // In the full version, here is where I also initialize the first row and
    // column of the matrix, in order to simplify the code in the helper NIF.

    // Now that the setup is complete, we can call erl_schedule_nif to
    // tell the beam our continuation function.
    // First, we need to wrap our data resource so that it can be passed
    // through the BEAM. enif_make_resource takes our state pointer and returns
    // an ERL_NIF_TERM.
    ERL_NIF_TERM state_term = enif_make_resource(env, state);
    // The NIF name here does not seem to be used for determining what code to
    // call, and is likely only used when debugging what code is running.
    return enif_schedule_nif(
        "levenshtein_yielding", // NIF to call
        0, // Flags

Now that the glue code is out of the way, we can create our erl_levenshtein_yielding method, which for workloads greater than a millisecond we can expect will be called multiple times for the given input. It will take a single argument, our wrapped state from before, unwrap it, and continue wherever the previous call left off.

static ERL_NIF_TERM erl_levenshtein_yielding(ErlNifEnv* env, int argc,
                                             const ERL_NIF_TERM argv[]) {
    // Not pictured: argc check

    // Extract the state term. In the same way we wrapped it before, we need
    // to now unwrap the resource we used to pass our struct through.
    struct PrivData *priv_data = enif_priv_data(env);
    struct LevenshteinState* state;
    if (!enif_get_resource(env, argv[0],
                           ((void*) (&state)))) {
        return mk_error(env, "bad_internal_state");

    // Start processing wherever the previous slice left off
    const unsigned int xsize = state->s1len + 1;
    unsigned int x = state->lastX;
    unsigned int y = state->lastY;

    // Specs for tracking function runtime
    struct timespec start_time;
    struct timespec current_time;

    // Grab the function start time
    clock_gettime(CLOCK_MONOTONIC, &start_time);

    // Create a tracker for the number of loop iterations we've done.
    // This operation count will act as a punctuator for us to check
    // whether it's time for us to yield again.
    unsigned long operations = 0;

    // This is a bit slimy, but is the simplest way to preload
    // the x and y loop vars for the first inner loop iteration
    goto loop_inner;

    // Loop over the matrix
    for (x = state->lastX; x <= state->s2len; x++) {
        for (y = 1; y <= state->s1len; y++) {
            // Ordinary Levenshtein implementation
            MATRIX_ELEMENT(state->matrix, xsize, x, y) = MIN3(
                MATRIX_ELEMENT(state->matrix, xsize, x-1, y) + 1,
                MATRIX_ELEMENT(state->matrix, xsize, x, y-1) + 1,
                MATRIX_ELEMENT(state->matrix, xsize, x-1, y-1) +
                    (state->s1[y-1] == state->s2[x-1] ? 0 : 1)

            // For each cell, increment the op count until we hit our
            // check threshold.
            if (unlikely(operations++ > OPERATIONS_BETWEN_TIMECHEKS)) {
                // When we do, get the current time
                clock_gettime(CLOCK_MONOTONIC, &current_time);

                // Figure out how many nanoseconds have passed since we started
                // calculating
                unsigned long nanoseconds_diff = (
                    (current_time.tv_nsec - start_time.tv_nsec) +
                    (current_time.tv_sec - start_time.tv_sec) * 1000000000

                // Convert that to a percentage of a timeslice, assuming that
                // a time slice is 1 millisecond.
                int slice_percent = (nanoseconds_diff * 100) / TIMESLICE_NANOSECONDS;

                // enif_consume_timeslice requires a percentage in the range
                // 1 <= timeslice <= 100
                if (slice_percent < 1) {
                    slice_percent = 1;
                } else if (slice_percent > 100) {
                    slice_percent = 100;

                // Consume that amount of a timeslice.
                // If the result is 1, then we have consumed the entire slice and
                // should now yield.
                if (enif_consume_timeslice(env, slice_percent)) {
                    // Break out of both loops
                    goto loop_exit;

                // If we're not done, shift the times over and keep looping
                start_time.tv_sec = current_time.tv_sec;
                start_time.tv_nsec = current_time.tv_nsec;
                operations = 0;

    // If we exited the loop via jump, we must have run out of time
    // in this slice. Update our state and yield the next cycle.
    if (likely(x <= state->s2len || y <= state->s1len)) {
        // Update the state with the next row value to process
        state->lastX = x;
        state->lastY = y;

        // Yield another call to ourselves.
        // We can re-use our argv, since we're reusing the same state struct.
        return enif_schedule_nif(
            "levenshtein_yielding", // NIF to call
            0, // Flags

    // If we are done, grab the result
    unsigned int result = MATRIX_ELEMENT(
        state->matrix, xsize, state->s2len, state->s1len);

    // We've finished, so it's time to free the work state
    // state.

    // Return the calculated value
    return enif_make_int(env, result);

Now that we’ve added all that complexity, let’s see whether it was worth it. The hypothesis is that without the timeslice accounting, we would hog the scheduler and not allow other processes to run in time. So to test that, let’s create a process that tries to sleep for exactly one second, and prints how much over/under one second it actually slept for. While it’s running, we’ll also saturate all of our cores with processes that do nothing but run Levenshtein on large inputs. Here’s how we’ll do it:

realtime_test() ->
    % Allocate two large binaries
    A = << <<0>> || _ <- lists:seq(1, 10000) >>,
    B = << <<1>> || _ <- lists:seq(1, 10000) >>,

    % Create a printer process that tries to print regularly
    _Printer = spawn_link(fun() -> realtime_printer(os:system_time()) end),

    % Create enough adversarial worker processes to saturate all cores
    _Workers = [
        spawn_link(fun() -> realtime_worker(A, B) end)
        || _ <- lists:seq(1, erlang:system_info(logical_processors_available))

% Spins forever, running our NIF on the input strings
realtime_worker(A, B) ->
    levenshtein:levenshtein(A, B),
    realtime_worker(A, B).

% Attempt to run exactly every second, and print how much we were off by.
realtime_printer(LastRan) ->
    Delta = os:system_time() - LastRan,
    DeltaMs = Delta / 1000000,
    Jitter = 1000000000 - Delta,
    JitterMs = Jitter / 1000000,
    io:format("Time since last schedule: ~p ms, Jitter: ~p ms~n", [
        DeltaMs, abs(JitterMs)

First, as a baseline, let’s actually run this with an entirely Erlang version of Levenshtein to see what amount of jitter we should expect:

2> perftest:realtime_test().
Time since last schedule: 1004.905694 ms, Jitter: -4.905694 ms
Time since last schedule: 1003.176042 ms, Jitter: -3.176042 ms
Time since last schedule: 1003.292757 ms, Jitter: -3.292757 ms
Time since last schedule: 1003.264791 ms, Jitter: -3.264791 ms

As expected, we have a fairly low amount of deviation from our expected one second print loop. Now let’s see how it looks with our scheduler-friendly NIF implementation:

1> % With the yielding NIF implementation
1> perftest:realtime_test().
Time since last schedule: 1002.378093 ms, Jitter: -2.378093 ms
Time since last schedule: 1002.232311 ms, Jitter: -2.232311 ms
Time since last schedule: 1003.469838 ms, Jitter: -3.469838 ms
Time since last schedule: 1002.724563 ms, Jitter: -2.724563 ms
Time since last schedule: 1002.120373 ms, Jitter: -2.120373 ms
Time since last schedule: 1003.110727 ms, Jitter: -3.110727 ms
Time since last schedule: 1002.924888 ms, Jitter: -2.924888 ms
Time since last schedule: 1002.408802 ms, Jitter: -2.408802 ms
Time since last schedule: 1002.575524 ms, Jitter: -2.575524 ms

Hardly any difference! Looks like our time slice management is enough to keep call latencies in check. Let’s see what happens when we run this test case on our previous non-yielding NIF implementation:

1> perftest:realtime_test().
[shell becomes unresponsive]

In the end I wasn’t able to compare the jitter between this scheduler-friendly implementation and older implementation, because the older version completely hogs the scheduler, rendering the shell entirely inoperable. So I think we can consider this a good reason to make NIFs that perform heavy lifting report their status!

While much better at respecting the soft-realtime native of the BEAM, the naïve implementation here adds a lot of calls to clock_gettime and the extra overhead of having to yield. If we compare the performance against the old version, we do have a performance decrease:

1> perftest:perftest(100000, fun levenshtein:unfair_levenshtein/2).
2> perftest:perftest(100000, fun levenshtein:yielding_levenshtein/2).

Our unfair code is approximately 1.4x the speed of the fair code. A noticeable amount, but worth it if you are running this calclation in an environment that also has time-sensitive processes you do not want to disrupt.

As before, a full working copy of the source code can be seen on Github

Next Page >