Arguably, today’s media-rich internet runs on JPEG. the file name of any image online usually ends in .jpeg, and “JPEG” (like “GIF”) has long become internet lingo. It’s a file format that integrates image compression—an algorithm that makes images take less space. So more fundamentally, today’s media-rich internet runs on image compression. It’s a critical technology in a world with limited computer storage and internet speeds, so that is not surprising.

Rather, technologically superior formats like HEIC [2] and JPEG 2000 [3] exist, and the strange thing is that the world still runs on old JPEG. So, what led to this consensus? At the very least, no newcomer has been effective enough to compel everyone to switch. On the other hand, some clever thing about JPEG was so effective that it did. Unfortunately, that cleverness is tough to appreciate when the JPEG algorithm uses dense, specialized machinery. Let this essay try to cut through that machinery and reach that key insight without getting too tangled in domain knowledge.

## Level 0: Raster images

We all know that digital images aren’t like photographic film. They’re bits that can be copied, sent, and shared by computers and phones. But what is the structure of that binary data? Most digital images are “raster” images, meaning they are a two-dimensional grid of pixels [4]. That nature becomes obvious when zooming too far into one and finding it looks too blocky. From there, many ways exist to represent that grid as binary data—a process called encoding. That encoding is the structure.

In this essay, we’ll work on a 16x16 grayscale example. Here, we can start for now with a very conventional and simple encoding: we will read pixels on the grid in left-to-right then top-to-bottom order, like in Figure 1. For each pixel we encounter, we will measure its brightness with a number between 0 and 255. If we recall binary numbering, we find that the reason for this choice is that any integer in this range can be represented with a byte or eight bits.

Figure 1. Raster scan pattern of a CRT screen [https://commons.wikimedia.org/wiki/File:Raster-scan.svg]

However, to begin building on this encoding, we need a more tangible way to print these bits. Since we have eight bits per pixel, we can represent that with two hexadecimal digits. For example, the number 120 has the binary representation 01111000, then 0111 becomes 7 in hexadecimal and 1000 becomes 8. Altogether, 120 in hexadecimal would be 78. On top of that, we can highlight and annotate the raw data to make a point or two. The raw data shown in Figure 2 will be our running example, and we can use highlighting in this case to show the picture it encodes.

Figure 2. A series of bytes. Each byte is highlighted by the color it represents, forming a 16x16 raster image.

Printing aside, this encoding scheme would mean a 12-megapixel photo would take 12 megabytes. However, JPEG applies two kinds of methods toward breaking this one-to-one rule. From information theory, “lossless” techniques seek to eliminate redundant bits—finding smaller ways to store the exact same thing. Using signal processing and psychology, “lossy” techniques seek to eliminate parts of the image in ways that would be hard to see—finding smaller ways to store something close to the same.

## Level 1: Run-length Encoding

The first technique, run-length encoding, handles the low-hanging fruit. It’s not effective on all images, but on others it’s essential. At one byte per pixel, a half-grey, half-white 16x16 image would be encoded by the previously mentioned method as something akin to grey, grey, ..., grey, white, white, ..., white. Instead, a shorter way to say the same is 128 white, 128 grey. A computer would use one byte to store the number and another byte to store the pixel being repeated [5]. It’s simple, but (given this best-case scenario) we can already turn 256 bytes into just four.

But now let us use our running example in Figure 2. Using left-to-right then top-to-bottom order, we first encounter the byte FF 14 times in a row. 14 is represented as the byte 0E, and FF immediately follows. The complete process looks like Figure 3, where the bytes that store numbers are in green. Using run-length encoding, we would have already reduced the size from 256 bytes to 130 bytes!

Figure 3. Demonstration of run-length encoding.

## Level 2: Huffman coding

The second technique, Huffman coding, functions like Morse code. In Morse code, the most common letter (E) was represented by a single dot, the second most common (T) was represented by a single dash, then less frequent letters were represented by longer combinations of dots and dashes [6]. After a combination was tapped out, a rest period signaled that a letter had been completed. Using this approach, words were tapped out quickly.

Previously, we represented the brightness 255 with the byte FF. However, this also was only just a convention! Binary numbering is just a conventional scheme we both recognize, but nothing stops us from choosing a new one. When we look again at the result of run-length encoding in Figure 3, the byte FF looks to be common. Like in Morse code, we should want to represent it with a short bit combination instead. After counting how often each byte appears, however, what bit combinations do we assign them? We know that they need to be short, but we cannot simply reuse Morse code because binary has no way to “rest”—there is no third number after one and zero.

One approach to this issue is this rule: no valid bit combination can be the beginning of another bit combination. For example, if we’re given the code ((a, 0), (b, 10), (c, 11)), we’re done decoding a byte if we see a 0 first but not if we see a 1 first. Huffman’s innovation was an algorithm that always presents the most powerful code that followed this rule, mathematically guaranteed [7]. Applying that algorithm to our running example, we get the most powerful code for reducing the run-length encoding result even further.

Figure 4. The code produced by Huffman coding (partial), showing more common bytes getting shorter binary combinations.

Using this code, we arrive at Figure 5, where the counts of repeated bytes are again in green.

Figure 5. Result of Huffman coding on past example in binary, code absent.

We can represent this in hexadecimal again for comparison, getting Figure 5 with only a size of 59 bytes! One technicality is that the code must be included with this compressed data, and this is because the common binary numbering scheme can no longer be assumed. Still, the code and compressed data together are usually smaller.

Figure 6. Result of Huffman coding on past example in hexadecimal, code absent.

### Intermission: From “lossless” to “lossy” techniques

Run-length encoding and Huffman coding together are the “lossless” techniques involved in the JPEG process. These techniques alone can only effectively compress images with large, consistent regions. However, that limits us to just basic digital graphics like corporate logos. Sometimes we do have that kind of image, so a variety of just “lossless” techniques get applied in PNG and GIF [8]. But sometimes we have a photograph instead, for example. The consequences of shoving those kinds of images through anyway can be simulated with a variation of the running example.

Figure 7. Lossless 'compressions' of running example and variation

So, why would these techniques still be useful in the JPEG format? JPEG’s huge insight was (and the missing step is) a “lossy” technique, involving a change in perspective.

## Level 3: Discrete Cosine Transform Quantization

That change in perspective is a transform: a mathematical method to represent data in a new way. The one used in JPEG is the Discrete Cosine Transform, or DCT. The DCT has long been a powerful tool in signal processing, and here is one of its biggest uses. The first step is splitting a photograph into many 8x8 images [1]. After that, a mathematically proven principle of the DCT is that any 8x8 image is a weighted average of some patterns called the DCT basis functions [1].

Figure 8. DCT basis functions. [https://commons.wikimedia.org/wiki/File:DCT-8x8.png]

Given a certain 8x8 image, the DCT finds these weights as an 8x8 table. A value in some location on this table would correspond to the weight of the basis function in the same location on Figure 6. Notably, the weights of the lower-right, upper-right, and lower-left “high-frequency” patterns often happen to be close to zero [1].

Figure 9. Demonstration of Discrete Cosine Transform

At the same time, this observation is helped by a psychological concept called the just-noticeable difference [9]. In short, it is the smallest difference that humans can detect, and it depends heavily on the context. In this context, studies show that our just-noticeable difference of “high-frequency” patterns is large enough that we often cannot distinguish between close-to-zero and truly zero [10]!

Figure 10. Spectrum of DCT quantization [https://commons.wikimedia.org/wiki/File:Felis_silvestris_silvestris_small_gradual_decrease_of_quality.png]

From here, many steps in the JPEG process are just specialized machinery, but it is correct to say that high-frequency close-to-zero weights become truly zero [1]. This rounding effect, a part of what is called quantization, makes this technique “lossy.” At the same time, these zeroed-out weights often form a large, mostly consistent space in the 8x8 table. As a result, more machinery compresses that table in a way analogous to the “lossless” techniques, just as if it was an 8x8 image!

Although the “lossless” techniques were originally only useful for simple digital graphics with large consistent regions, DCT quantization makes those regions. It takes any image, chops it up, transforms the pieces, then levels any almost-flat region into an utterly flat one. This “lossy” technique is the key insight of JPEG, and as a result it can usually compress any color image to a tenth of its original size [1]. If we put our running example and the previously disastrous variation through the same process, we can see that it treats both well*.

Figure 10. JPEG-like compressions of running example and variation

Such techniques violate a very fundamental convention that we assume about encodings: they don’t destroy information. Any image that goes through the process is irreversibly damaged by it, though in a usually indistinguishable way. And yet this is precisely what unlocks the kind of image compression that drives today’s media-rich internet.

### Level 4(?): Chroma Subsampling

As an aside, another application of the just-noticeable difference is chroma subsampling. It comes from observations that we have a weaker sensitivity to “high-frequency” patterns of color versus patterns of brightness. It’s first worth mentioning that the three RGB channels get transformed into one brightness channel and two color channels [1]. The rough idea, then, is to merge pixels of the color channels into—essentially—superpixels. The rest of the JPEG algorithm follows from that.

It’s another “lossy” technique that achieves somewhat better compression. I don’t have enough research to give proper citations, but I ought to mention it because it’s an optional but popular component of the JPEG algorithm. If it sounds interesting, I’d give it a google.

## Conclusion

Now, after cutting through the specialized machinery that drives JPEG, the founding insight is easier to see. But notably, the inventors of JPEG credit run-length encoding and Huffman coding to someone else. On top of that, “lossless” image compression predated JPEG through the GIF format [8]. So, they must have simply asked: how could they extend those solutions beyond simple digital graphics? Their answer was the DCT quantization: a solution layered on a solution.

But if that’s all it was, what caused JPEG to knock out its competitors, and what stops newcomers from doing the same to JPEG? Rather, the key is in the DCT quantization. The JPEG algorithm violated a convention that the competitors followed: encoders stay faithful to what they encode. So, this lossy technique wasn’t just an additional layer—it was a new foundation. Consequently came its dramatic reductions in file size, and from that came its domination of today’s media-rich internet. Therefore, even though new formats now outclass old JPEG, what we might still be looking for is a breakthrough—a new bedrock idea–with a dramatic performance lead to match.

## References

[1] G. K. Wallace, “The JPEG still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii-xxxiv, Feb. 1992.

[2] “HEIC vs. JPEG,” Adobe Inc, [Online]. Available: https://www.adobe.com/creativecloud/file-types/image/comparison/heic-vs-jpeg.html. [Accessed 13 March 2022].

[3] “JPEG vs. JPEG 2000,” Adobe Inc, [Online]. Available: https://www.adobe.com/creativecloud/file-types/image/comparison/jpeg-vs-jpeg-2000.html. [Accessed 13 March 2022].

[4] “Raster Images,” University of Michigan Library, 11 February 2021. [Online]. Available: https://guides.lib.umich.edu/c.php?g=282942&p=1885352. [Accessed 13 March 2022].

[5] “Run-Length Encoding (RLE),” FileFormat, [Online]. Available: https://www.fileformat.info/mirror/egff/ch09_03.htm. [Accessed 13 March 2022].

[6] J. D. Cook, “How efficient is Morse code?,” 8 February 2017. [Online]. Available: https://www.johndcook.com/blog/2017/02/08/how-efficient-is-morse-code/. [Accessed 14 March 2022].

[7] D. A. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098-1101, Sept. 1952.

[8] “LZW and Three Graphics File Formats,” Orpalis Imaging Technologies, 17 October 2019. [Online]. Available: https://www.orpalis.com/blog/lzw-three-graphics-file-formats/. [Accessed 14 March 2022].

[9] R. M. Spielman, W. J. Jenkins and M. D. Lovett, “Sensation versus Perception,” in Psychology 2e, Houston, OpenStax, 2020.

[10] H. Lohscheller, “A Subjectively Adapted Image Communication System,” IEEE Transactions on Communications, vol. 32, no. 12, pp. 1316-1322, 1984.