Recently, profiling one of my programs revealed that printing integers took a majority of the execution time. If I wanted it to run faster, I needed to print integers faster! While planning for this, I was heavily inspired by a 2012 guide to optimization in C/C++ written by Facebook engineering that happened to cover the same problem, but I wanted to develop my own approach from scratch.

To lay down the facts, printing an integer comprises two separate tasks. One, we need to know how many digits the integer has in decimal. Two, we need to calculate each digit.

### Calculating the number of digits

Looking at a decimal 10, you can just count the two digits, but again counting two for a binary 0x0A is not so obvious. Instead, I used a numerical test.

By continuously comparing the number against, say, 10 then 100 then 1000 then so on, one can eventually find the number of digits. That said, starting from a large power of ten then working down also works so long as the number being tested is not larger. Given this flexibility, I chose a binary search: assuming no more than ten digits–the range of uint32_t–I started from 100000 and continuously split the possibilities from there.

int count_digits(uint32_t num){
if(num < 100000)
if(num < 1000)
if(num < 100)
if(num < 10) return 1;
else return 2;
else return 3;
else
if(num < 10000) return 4;
else return 5;
else
if(num < 10000000)
if(num < 1000000) return 6;
else return 7;
else
if(num < 100000000) return 8;
else
if(num < 1000000000) return 9;
else return 10;
}


Though it looks convoluted, it guarantees that we will never need more than four comparisons. However, note that a number of 3, 4, 5, 6, 7, or 8 digits will complete in only three. Because ten isn’t a power of two, we had to decide which numbers to privilege over others.

### Converting from binary to decimal

If, say, we knew that we only needed to convert one digit, nothing beats

arr[0] = num + '0';


because it’s just a single addition. However, this doesn’t expand to any more than one. If we wanted to use this approach to convert three digits, we would need to isolate the hundreds, tens, and ones place.

arr[0] = num/100%10 + '0';
arr[1] = num/10%10 + '0';
arr[2] = num%10 + '0';


And this would require a modulo and division per digit. In fact, these three lines represent way more work than they seem to because modulo and division are very slow operations. In the end, one of the fastest ways to calculate digits is to use a lookup table instead, just as if we wanted to bypass some other kind of heavy logic.

Because a lookup table is just a large array in memory, we would be at the mercy of memory speed, and accessing the RAM is possibly even slower than just doing the math! So, the lookup table absolutely must fit in the CPU cache.

Let’s say that we have only 16KB of CPU cache. With so little cache, we can build a lookup table that converts three digits–that’s three chars per number for 1000 numbers (000 to 999), or 3KB.

void lookup_3_digits(char *arr, int num){
static const char table[3*1000] =
"000" "001" "002" "003" "004" "005" "006" "007" "008" "009"
...
; // Table abbreviated for readabilty
memcpy(arr, table+3*num, 3);
}


### Putting it together

Unfortunately, any more digits and the lookup table would no longer fit in the cache, and that’s no good if we want speed. Where does that leave printing larger numbers, then? We need to use modulo and division to isolate three digits at a time. (Three digits happens to be the fastest option for our hypothetical system because the lookup table still fits in the cache and we minimize here the number of times we hit it.) Specifically, we pop three digits at a time, starting from the least significant, until we reach nothing. This sometimes tells us if we can stop converting early.

int fast_print_int(char *buf, uint32_t num){
char temp_buf[12];
char *temp_buf_ptr = temp_buf+12;
int n_digits = 0;

if(num == 0){ // Edge case, never enters for loop if num == 0
temp_buf_ptr -= 3;
lookup_3_digits(temp_buf_ptr, 0);
n_digits = 1;
}
else{
for(long working_num = num; working_num != 0; working_num /= 1000){
temp_buf_ptr -= 3;
lookup_3_digits(temp_buf_ptr, working_num % 1000);
}
n_digits = count_digits(num);
}

memcpy(buf, temp_buf+12-n_digits, n_digits);
buf[n_digits] = '\0';
return n_digits;
}


However, a side effect of converting many digits at a time is that some numbers get printed with leading zeroes. There are a couple ways to deal with this, but here is the simplest: we build the ASCII number only in a temporary buffer first, then we use the number of digits we calculated to find the exact position where copying to the original buffer should start.

### The results

So far we have provided three functions:

• count_digits, which counts the number of digits,
• lookup_3_digits, which converts an integer under 1000 to a three-digit decimal number in ASCII, and
• fast_print_int, which puts the above two together.

And this is everything we need to print integers fast. How does this compare against sprintf, the normal way to do it? To test it, I printed every number from 0 to 100000000 ten times on an i5-2500K.

[email protected]:~/PROJECTS/lookup_benchmark$make test gcc test.c lookup_c_str.c -o test -Wall -O3 [email protected]:~/PROJECTS/lookup_benchmark$ ./test
Degrees of Magnitude Test
Testing... testing complete, passed for expected 1 of length 1
Testing... testing complete, passed for expected 11 of length 2
Testing... testing complete, passed for expected 111 of length 3
Testing... testing complete, passed for expected 1111 of length 4
Testing... testing complete, passed for expected 11111 of length 5
Testing... testing complete, passed for expected 111111 of length 6
Testing... testing complete, passed for expected 1111111 of length 7
Testing... testing complete, passed for expected 11111111 of length 8
Testing... testing complete, passed for expected 111111111 of length 9
Testing... testing complete, passed for expected 1111111111 of length 10

Iterated Test, passes will be omitted
Iterated Test complete!
[email protected]:~/PROJECTS/lookup_benchmark$make benchmark gcc benchmark.c lookup_c_str.c facebook.c -o benchmark -Wall -O3 -lm [email protected]:~/PROJECTS/lookup_benchmark$ ./benchmark
Benchmarking sprintf... Done, loop time is 9526.1 +/- 3.3 ms
Benchmarking fast_print_int... Done, loop time is 1194.1 +/- 1.0 ms
Benchmarking (facebook) u64ToAsciiTable... Done, loop time is 1428.2 +/- 4.9 ms


It’s roughly 7.5 times faster. That’s a pretty good speedup, and it kills most of the overhead I suffered from before. At the same time, it even beats the example provided by Facebook engineering. I’m not suprised about that:

• I used a three-digit lookup table while they used a two-digit lookup table, and
• a count from 0 to 100000000 is biased toward bigger-digit numbers, but my digit counter was essentially constant-time.

This isn’t to say that my method is better in all ways. In the case of printing many, smaller-digit numbers, the Facebook engineering method would probably be on par. In particular, the work saved by encompassing a third digit in the lookup table doesn’t matter if the inputs are rarely larger than 100. My use case for my method, though, was very similar to a count from 0 to 100000000, so I saw gains from doing so after all.

### Takeaways

In general, anyway, the key here is to replace heavy work with lookup tables to the extent that our cache lets us.

For example, if our hypothetical CPU in fact had 64KB of cache, I could’ve gone even faster by converting four digits at a time, using up four chars per number for 10000 numbers or 40KB. To do that yourself, the code I provided just has to be adapted. Simply, replace every mention of 3 with 4 and 1000 with 10000 in the code I’ve provided, then rebuild the lookup table using this python script.

N_DIGITS = 4
with open('lookup.txt', 'w') as outfile:
n_lines = (10**N_DIGITS)//10 # At ten entries per line
for i in range(n_lines):
for j in range(10):
value = '%0*d' % (N_DIGITS, i*10+j)
outfile.write('\"'+value+'\" ')
outfile.write('\n')


Finally, the benefits of going beyond the cache after all depends on the application. Real systems have many tiers of memory, from L1 to RAM to hard drives, and the slowness of one level below might be offset by the savings of encompassing even more work in a lookup table.