Let's develop a QR Code Generator, part II: sequencing data

In the first part we've seen the basic concepts of QR codes. Now let's actively do something to generate one. Let's suppose we want to create a QR code for the string https://www.qrcode.com/ (the official QR code site).

Encoding mode

First of all, we need to find the correct encoding mode. Each mode has a corresponding value, according to the following table:

Encoding Mode Value bits
Numeric 0001 (1)
Alphanumeric 0010 (2)
Byte 0100 (4)
Kanji 1000 (8)
ECI 0111 (7)

Determining the optimal encoding mode is easily done by just checking what characters are included in the string. The only issue is detecting Kanji characters correctly. I'm no expert in Kanji by any means, so I'll just rely on the new ES2018 Unicode support of RegExp in JavaScript:

const KANJI_RE = /^[\p{Script_Extensions=Han}\p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}]*$/u;

I don't actually know if it's a perfect fit for Kanji mode, so if anyone knows just hit the comments! (I'll come up with a better solution later in the series, probably.)

In the end, we'll have something like this:

const NUMERIC_RE = /^\d*$/;
const ALPHANUMERIC_RE = /^[\dA-Z $%*+\-./:]*$/;
const LATIN1_RE = /^[\x00-\xff]*$/;
const KANJI_RE = /^[\p{Script_Extensions=Han}\p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}]*$/u;
function getEncodingMode(string) {
  if (NUMERIC_RE.test(string)) {
    return 0b0001;
  }
  if (ALPHANUMERIC_RE.test(string)) {
    return 0b0010;
  }
  if (LATIN1_RE.test(string)) {
    return 0b0100;
  }
  if (KANJI_RE.test(string)) {
    return 0b1000;
  }
  return 0b0111;
}

In the end, we have getEncodingMode('https://www.qrcode.com/') === 4.

Version

Let's aim for the smallest version possible: since it's 23 characters long, we can check in various tables around (here, for example) that we're going to need at least a version 2 code. Also, since we're there, we can get the highest possibile correction level - medium, in our case.

Also this other table tells us that a version 2 can contain 28 data codewords for medium correction: those 2 spare codewords are going to be used for data information.

If we wanted a higher error correction level, we should have chosen a larger version.

Data bits

The first 4 bits of our data sequence are 0100, our encoding mode.

Then, we'll tell how long our string is going to be. We need a table again for that, as the amount of bits reserved for this value is variable:

Encoding Mode Version 1-9 Version 10-26 Version 27-40
Numeric 10 12 14
Alphanumeric 9 11 13
Byte 8 16 16
Kanji 8 10 12

Turning this into a handy function:

const LENGTH_BITS = [
  [10, 12, 14],
  [9, 11, 13],
  [8, 16, 16],
  [8, 10, 12]
];
function getLengthBits(mode, version) {
  // ECI mode folds into byte mode
  // Basically it's `Math.floor(Math.log2(mode))` but much faster
  // See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/clz32
  const modeIndex = 31 - Math.clz32(mode);
  const bitsIndex = version > 26 ? 2 : version > 9 ? 1 : 0;
  return LENGTH_BITS[modeIndex][bitsIndex];
}

Since getLengthBits(4, 2) === 8 we'll need 8 bits, and 23 (the length of our string) in binary is 10111, our first bits are:

01000001 0111....

Next, the real data. All we have to do is getting the code of the characters of the string in ISO-8859-1:

h   t   t   p   s   :  /  /  w   w   w   .  q   r   c  o   d   e   .  c  o   m   /
104 116 116 112 115 58 47 47 119 119 119 46 113 114 99 111 100 101 46 99 111 109 47

Now convert everything to binary and concatenate it to the previous sequence:

01000001 01110110 10000111 01000111 01000111
00000111 00110011 10100010 11110010 11110111
01110111 01110111 01110010 11100111 00010111
00100110 00110110 11110110 01000110 01010010
11100110 00110110 11110110 11010010 1111....

Now we have to put a termination block, which is exactly 4 zeroes, so the last codewords will be 11110000. We still have 3 of the availabe 28 codewords to be filled.

Remaining space

We have filled all the 8 bits of the last codeword, otherwise we'd have to fill the remaining bits with zeroes (it's always like that in byte mode).

With the remaining codewords we can do two things:

  • instead of the termination 4-bit block, we can put another encoding mode block (maybe a different one) and start another sequence - but with just 3 codewords we can't do much;
  • fill the remaining codewords with the sequences 11101100 00010001 (which translates into 236 and 17 in decimal) until the limit is reached.

Why 236 and 17? I have no idea, but my guess is that they (Denso Wave?) made a lot of tries and verified that those two are the sequence that produces the most easily recognizable codes.

In the end we have:

65 118 135 71 71 7 51 162 242 247 119 119 114 231 23 38 54 246 70 82 230 54 246 210 240 236 17 236

Or, in binary:

01000001 01110110 10000111 01000111 01000111
00000111 00110011 10100010 11110010 11110111
01110111 01110111 01110010 11100111 00010111
00100110 00110110 11110110 01000110 01010010
11100110 00110110 11110110 11010010 11110000
11101100 00010001 11101100

Translating into code

Our function getByteData will need three things:

  • the content to be sequenced into codewords, of course;
  • how many bits are needed to state the length of the content: as we've seen, this depends on the encoding mode (byte, in this case) and the version (for our case it's 8);
  • the amount of codewords to be filled: this again depends on the QR code version and the error correction level (for our case it's 28).

For QR codes from version 10 and up we'll need 16 bits to express the length of our content, so the actual data will begin on the third codeword.

function getByteData(content, lengthBits, dataCodewords) {
  const data = new Uint8Array(dataCodewords);
  const rightShift = (4 + lengthBits) & 7;
  const leftShift = 8 - rightShift;
  const andMask = (1 << rightShift) - 1;
  const dataIndexStart = lengthBits > 12 ? 2 : 1;

  data[0] = 64 /* byte mode */ + (content.length >> (lengthBits - 4));
  if (lengthBits > 12) {
    data[1] = (content.length >> rightShift) & 255;
  }
  data[dataIndexStart] = (content.length & andMask) << leftShift;

  for (let index = 0; index < content.length; index++) {
    const byte = content.charCodeAt(index);
    data[index + dataIndexStart] |= byte >> rightShift;
    data[index + dataIndexStart + 1] = (byte & andMask) << leftShift;
  }
  const remaining = dataCodewords - content.length - dataIndexStart - 1;
  for (let index = 0; index < remaining; index++) {
    const byte = index & 1 ? 17 : 236;
    data[index + content.length + 2] = byte;
  }
  return data;
}

At the end of this first step, we should have this result:

getByteData('https://www.qrcode.com/', 8, 28)
// Uint8Array(26) [65, 166, 135, 71, 71, 7, 51, 162, 242, 247, 119, 119, 114, 231, 23, 38, 54, 246, 70, 82, 230, 54, 246, 210, 240, 236, 17, 236]

Remember: the above function will work for byte mode only!

Other encoding modes

We won't go into details for now, but for numeric mode we have to split the number in groups of 3 digits and encode each group with 10 bits (210 = 1024, so the space wasted is minimal).

Alphanumeric mode contains 45 symbols instead, so the string must be splitted into groups of 2 characters. Each symbol has a value (first the digits, then the uppercase Latin letters, then the space and the symbols $, %, *, +, \, -, ., /, :), so every pair of characters can be translated into a number ranging from 0 to 2024 (= 452 - 1). Therefore, we need 11 bits for every two alphanumeric characters (211 = 2048).

For Kanji mode... oh dear. First, we'd have to get the Shift JIS code of the pictogram, and the best way to do it is using a library like iconv-lite or, if you want to to it on your own, use its symbol table. Also, not every symbol can be used, but only those in the ranges 0x8140 to 0x9FFC and 0xE040 to 0xEBBF. In the end, a Kanji character will take 13 bits.

Next steps

Keep in touch, because it's all been quite easy so far. Now we'll have to deal with error data correction (EDC), and Math will be rolling!

33