Let's develop a QR Code Generator, part I: basic concepts

Recently I was developing a little puzzle web game, and I thought that it would be cool if people could share the puzzles using something commonly shareable like a QR Code. After all, Chromium-based browsers support the Barcode Detection API since v83 came out in May 2020, so if it's there why not using it?

Unfortunately, it's an API to just read a QR Code (which, admittedly, it's the hardest part), not to generate it, so we have to do it ourselves. How hard could it be?!

Oh boy.

Let's dive into a series of fairly advanced mathematical concepts and a long sequence of rules ranging from "almost reasonable" to seemingly "totally ludicrous", only to create a bunch of monochromatic pixels.

For the sake of the articles in this series, we'll concentrate in just QR code generation, leaving detection aside. We'll also start studying the simpler case of smaller codes with 8-bit characters.

Data types

QR codes contain data, that's fair to say. The type of data can be decided, but of course it determines the maximum amount of information that can be stored:

  • numbers (up to 7089);
  • alphanumeric (numbers, uppercase letters, a bunch of symbols: ~65% more expensive than numbers);
  • bytes (just 8 bit Latin-1 encoded characters, ~140% more expensive);
  • kanji characters (~290% more expensive).

It shouldn't surprise that kanji is one of the main symbol sets, since QR codes have been developed by Denso Wave, a Japanese automation company.

There actually are other encoding modes in more recent versions, but as mentioned before we'll focus on 8-bit bytes for now. And in the end, a QR code is a series of bits - so if you want to encode your information as you wish, you can.

Moreover, QR codes can also switch to a different encoding mode in the middle of its data, but we won't consider this case for now.

Sizes

QR codes are always square, but their sizes vary. The size is determined by the unusual term "version", so that version 1 is 21×21 pixels large, while version 40 (the largest) is 177×177 pixels. A QR code 1 version larger is 4 pixels wider and taller, so the size is (17 + version * 4) pixels.

Also, we shouldn't call them pixels, but rather "modules" (again, unusual but maybe something has been lost in translation from Japanese).

Since larger QR codes are more difficult to decode (and computationally more expensive), the goal is to use the smallest possible "version" for the amount of data we want to store.

Larger QR codes split their data in multiple blocks (up to 81).

Error correction

Every QR code contain error correction "modules" - and no, we can't remove them to maximize the available space. But we can choose among 4 levels of error correction:

Level Letter Data recovery
Low L ~7%
Medium M ~15%
Quartile Q ~25%
High H ~30%

A higher error correction can be abused to create QR codes partially covered by logos and pictures, but that can be still read thanks to error correction.

Fixed patterns

Many of us can recognize what's possibly a QR code with a glance, and it's because of some common characteristics:

  • they're square pictures;
  • they're black and white - or at least of two colors very distant in the luminance spectrum (so we'll call them "dark" and "light" instead);
  • they're composed of a grid of square dots;
  • they have some easily recognizable patterns all around the pictures.

About the last one, the point of being easy to recognize is exactly why they have been designed like that: let's remember that QR (= Quick Response) codes have been developed to be used by industrial automated machines.

These patterns are:

  • finder patterns: 7×7 squares placed on top left, top right and bottom left corners, separated by a line of empty modules; QR code with the finder patterns highlighted in green
  • alignment patterns: 5×5 squares placed on the corners and intersections of an n×n grid (unless occupied by the finder patterns); n ranges between 2 and 6, so there are n2 - 3 of these, except for version 1 which has no alignment pattern; QR code with the alignment patterns highlighted in red
  • timing patterns: a horizontal and a vertical line of alternating dark and light modules, connecting the finder patterns (you've noticed it only if you deeply examined some QR codes); QR code with the timing patterns highlighted in green
  • a dark module: just a module that's always dark, placed on the 9th column and (4 * version) + 10)-th row (I bet you never noticed it!). QR code with the dark module circled in red

Moreover, in larger QR codes (from version 7 and up) a couple of areas are reserved for format data.

Capacity

Given the version, the encoding mode and the error correction level, the capacity of a QR code is determined. The available space that's not occupied by fixed patterns or reserved areas is divided in groups of 8 modules called "codewords": think of them as classic 8-bit bytes.

Therefore, the total available codewords is fixed for every version: 26 for version 1, 44 for version 2 and so on, up to 3706 for version 40.

For each version, the codewords reserved for error correction, is also determined, and can be found in tables like this one.

Without much further ado, let's begin to build some small QR code, with ISO-8859-1 byte encoding!

Wait, ISO-8859-1?

Yes, QR codes use ISO-8859-1 (also known as Latin-1) to encode their byte strings. Today UTF-8 is more common, but a while ago it wasn't.

The basic issue here is that, while UTF-8 can encompass millions of characters (or "code points"), Latin-1 has only 255 symbols. That's it. No emojis, no other alphabets. If you want to check if a string is valid for Latin-1, the check is simple:

const LATIN1_RE = /^[\x00-\xff]*$/;
function isLatin1(string) {
  return LATIN1_RE.test(string);
}

If some characters are outside ISO-8859-1, well... you either discard them, or use ECI mode. Also, some readers automatically recognize if UTF-8 is used instead, but it might not be a reliable choice for public QR codes.

Stay in touch for the next part: encoding the data!

15