Handling Binary Data — Building a HTTP Server from scratch

On the last post of BTS: HTTP Server series.
I wrote a barebone HTTP server that can handle requests and respond appropriately.
I think I covered the basics, but that server is limited in what it can do.
It can only handle text-based Requests and Responses... That means no image or other media exchange.
And then, if the Request or the Response is larger than a KB, I'm out of luck. Again, not great for media...

This article is a transcript of a Youtube video I made.

Oh, hey there...

That's my challenge for today, refactor my server to handle arbitrarily sized Requests and avoid treating everything as
text...

If I want to be able to handle large requests, the first thing I can do is to read the stream in chunks, 1KB at a time
until there's nothing left to read.
Once I have all of my chunks, I can concatenate them together into one Typed Array. And voila, arbitrarly sized Request!

const concat = (...chunks) => {
  const zs = new Uint8Array(chunks.reduce((z, ys) => z + ys.byteLength, 0));
  chunks.reduce((i, xs) => zs.set(xs, i) || i + xs.byteLength, 0);
  return zs;
};

const chunks = [];
let n;
do {
    const xs = new Uint8Array(1024);
    n = await r.read(xs);
    chunks.push(xs.subarray(0, n));
} while (n === 1024);

const request = concat(...chunks);

The second challenge is to figure out how much of the data stream is the Request line and the Headers versus the body...
I want to avoid reading too far into the body, since it might be binary data.
I know that the body starts after the first empty line of the Request.
So I could technically, search for the first empty line and then I'll know that the rest is the body and only parse the first part.

So I wrote this function that will try to find a sequence within the array. First tries to find the first occurence of
a byte, and then I can just test the following bytes until I have a match.
In our case, I want to find a two CRLF sequences. So I try to find the first CR, then check if it is followed by LF, CR
and LF... And, I repeat this until I find the empty line.

export const findIndexOfSequence = (xs, ys) => {
  let i = xs.indexOf(ys[0]);
  let z = false;

  while (i >= 0 && i < xs.byteLength) {
    let j = 0;
    while (j < ys.byteLength) {
      if (xs[j + i] !== ys[j]) break;
      j++;
    }
    if (j === ys.byteLength) {
      z = true;
      break;
    }
    i++;
  }

  return z ? i : null;
};

🐙 You will find the code for this post here: https://github.com/i-y-land/HTTP/tree/episode/03

The problem with this approach is that I have to traverse the whole request, and it might end up that the request doesn't
have a body, and therefore I wasted my time.

Instead, I will read the bytes one line at a time, finding the nearest CRLF and parse them in order.
On the first line, I will extract the method and the path.
Whenever I find an empty line, I will assume the is body is next and stop.
For the remaining lines, I will parse them as header.

// https://github.com/i-y-land/HTTP/blob/episode/03/library/utilities.js#L208
export const readLine = (xs) => xs.subarray(0, xs.indexOf(LF) + 1);

export const decodeRequest = (xs) => {
  const headers = {};
  let body, method, path;
  const n = xs.byteLength;
  let i = 0;
  let seekedPassedHeader = false;
  while (i < n) {
    if (seekedPassedHeader) {
      body = xs.subarray(i, n);
      i = n;
      continue;
    }

    const ys = readLine(xs.subarray(i, n));

    if (i === 0) {
      if (!findIndexOfSequence(ys, encode(" HTTP/"))) break;
      [method, path] = decode(ys).split(" ");
    } else if (
      ys.byteLength === 2 &&
      ys[0] === CR &&
      ys[1] === LF &&
      xs[i] === CR &&
      xs[i + 1] === LF
    ) {
      seekedPassedHeader = true;
    } else if (ys.byteLength === 0) break;
    else {
      const [key, value] = decode(
        ys.subarray(0, ys.indexOf(CR) || ys.indexOf(LF)),
      ).split(/(?<=^[A-Za-z-]+)\s*:\s*/);
      headers[key.toLowerCase()] = value;
    }

    i += ys.byteLength;
  }

  return { body, headers, method, path };
};

On the other hand, the function to encode the Response is absurdly simpler, I can pretty much use the function I already made
and just encode the result. The biggest difference, is that I have to be aware that the body might not
be text and should be kept as a Typed Array. I can encode the header and then concat the result with the body.

// https://github.com/i-y-land/HTTP/blob/episode/03/library/utilities.js#L248
export const stringifyHeaders = (headers = {}) =>
  Object.entries(headers)
    .reduce(
      (hs, [key, value]) => `${hs}\r\n${normalizeHeaderKey(key)}: ${value}`,
      "",
    );

export const encodeResponse = (response) =>
  concat(
    encode(
      `HTTP/1.1 ${statusCodes[response.statusCode]}${
        stringifyHeaders(response.headers)
      }\r\n\r\n`,
    ),
    response.body || new Uint8Array(0),
  );

From there, I have enough to write a simple server using the serve function I've implemented previously.
I can decode the request... then encode the response.

...
serve(
  Deno.listen({ port }),
  (xs) => {
    const request = decodeRequest(xs);

    if (request.method === "GET" && request.path === "/") {
      return encodeResponse({ statusCode: 204 })
    }
  }
).catch((e) => console.error(e));

I could respond to every requests with a file. That is a good start to a static file server.

...
    if (request.method === "GET" && request.path === "/") {
      const file = Deno.readFile(`${Deno.cwd()}/image.png`); // read the file
      return encodeResponse({
        body: file,
        headers: {
          "content-length": file.byteLength,
          "content-type": "image/png"
        },
        statusCode: 200
      });
    }

I can start my server and open a browser to visualize the image.

With a bit more effort, I can serve any file withing a given directory.
I would attempt to access the file and cross-reference the MIME type from a currated list using the extension.
If the system can't find the file, I will return 404 Not Found.

const sourcePath =
    (await Deno.permissions.query({ name: "env", variable: "SOURCE_PATH" }))
            .state === "granted" && Deno.env.get("SOURCE_PATH") ||
    `${Deno.cwd()}/library/assets_test`;
...
    if (request.method === "GET") {
      try {
        const file = await Deno.readFile(sourcePath + request.path); // read the file
        return encodeResponse({
          body: file,
          headers: {
            "content-length": file.byteLength,
            ["content-type"]: mimeTypes[
              request.path.match(/(?<extension>\.[a-z0-9]+$)/)?.groups?.extension
                .toLowerCase()
              ].join(",") || "plain/text",
          },
          statusCode: 200
        });
      } catch (e) {
        if (e instanceof Deno.errors.NotFound) { // if the file is not found
          return encodeResponse({
            body: new Uint8Array(0),
            headers: {
              ["Content-Length"]: 0,
            },
            statusCode: 404,
          });
        }

        throw e;
      }
    }

With a broadly similar approach, I can receive any file.

const targetPath =
    (await Deno.permissions.query({ name: "env", variable: "TARGET_PATH" }))
            .state === "granted" && Deno.env.get("TARGET_PATH") ||
    `${Deno.cwd()}/`;
...
    if (request.method === "GET") { ... }
    else if (request.method === "POST") {
      await Deno.writeFile(targetPath + request.path, request.body); // write the file
      return encodeResponse({ statusCode: 204 });
    }

Now, you can guess if you look at the position of your scrollbar that things can't be that simple...

I see two problems with my current approach.
I have to load whole files into memory before I can offload it to the File System which that can become a bottle neck at
scale.
Another surprising issue is with file uploads...
When uploading a file, some clients, for example curl will make the request in two steps... The first request is
testing the terrain stating that it wants to upload a file of a certain type and length and requires that the server
replies with 100 continue before sending the file.
Because of this behaviour I need to retain access to the connection, the writable resource.
So I think I will have to refactor the serve function from accepting a function that takes a Typed Array as an
argument, to a function that takes the connection.
This could also be positive change that would facilitate implementing powerful middleware later on...

export const serve = async (listener, f) => {
  for await (const connection of listener) {
    await f(connection);
  }
};

There's two ways that my server can handle file uploads.
One possibility is that the client tries to to post the file directly,
I have the option to read the header and refuse the request if it's too large. The other possibility is that the
client expects me to reply first.
In both case I will read the first chunk and then start creating the file with the data processed. Then I want to
to read one chunk at a time from the connection and systematically write them to the file. This way, I never hold
more than 1KB in memory at a time... I do this until I can't read a whole 1KB, this tells me that the file has been
completely copied over.

export const copy = async (r, w) => {
  const xs = new Uint8Array(1024);
  let n;
  let i = 0;
  do {
    n = await r.read(xs);
    await w.write(xs.subarray(0, n));
    i += n;
  } while (n === 1024);

  return i;
};
...
    let xs = new Uint8Array(1024);
    const n = await Deno.read(r.rid, xs);
    const request = xs.subarray(0, n);
    const { fileName } = request.path.match(
      /.*?\/(?<fileName>(?:[^%]|%[0-9A-Fa-f]{2})+\.[A-Za-z0-9]+?)$/,
    )?.groups || {};

    ...

    const file = await Deno.open(`${targetPath}/${fileName}`, {
      create: true,
      write: true,
    });

    if (request.headers.expect === "100-continue") {
      // write the `100 Continue` response
      await Deno.write(connection.rid, encodeResponse({ statusCode: 100 }));

      const ys = new Uint8Array(1024);
      const n = await Deno.read(connection.rid, ys); // read the follow-up
      xs = ys.subarray(0, n);
    }

    const i = findIndexOfSequence(xs, CRLF); // find the beginning of the body

    if (i > 0) {
      await Deno.write(file.rid, xs.subarray(i + 4)); // write possible file chunk
      if (xs.byteLength === 1024) {
        await copy(connection, file); // copy subsequent chunks
      }
    }

    await connection.write(
      encodeResponse({ statusCode: 204 }), // terminate the exchange
    );
...

From there, I can rework the part that responds with a file.
Similarly to the two-step request for receiving a file, a client may opt to request the headers for a given file
with the HEAD method.
Because I want to support this feature, I can first get information from the requested file, then I can start writing
the headers and only if the request's method is GET -- not HEAD -- I will copy the file to the connection.

...
    try {
      const { size } = await Deno.stat(`${sourcePath}/${fileName}`);

      await connection.write(
        encodeResponse({
          headers: {
            ["Content-Type"]: mimeTypes[
              fileName.match(/(?<extension>\.[a-z0-9]+$)/)?.groups?.extension
                .toLowerCase()
              ].join(",") || "plain/text",
            ["Content-Length"]: size,
          },
          statusCode: 200,
        }),
      );

      if (request.method === "GET") {
        const file = await Deno.open(`${sourcePath}/${fileName}`);
        await copy(file, connection);
      }
    } catch (e) {
      if (e instanceof Deno.errors.NotFound) {
        Deno.write(
          connection.rid,
          encodeResponse({
            headers: {
              ["Content-Length"]: 0,
            },
            statusCode: 404,
          }),
        );
      }

      throw e;
    }
...

Wow. At this point I have to be either very confident with my programming skills or sadistic...
I need to implement a slew of integrations tests before going any further.
I created four static files for this purpose, a short text file, less than a KB, a longer text file, an image and
music...
For that purpose, I wrote a higher-order-function that will initialize the server before calling the test function.

// https://github.com/i-y-land/HTTP/blob/episode/03/library/integration_test.js#L6
const withServer = (port, f) =>
  async () => {
    const p = await Deno.run({ // initialize the server
      cmd: [
        "deno",
        "run",
        "--allow-all",
        `${Deno.cwd()}/cli.js`,
        String(port),
      ],
      env: { LOG_LEVEL: "ERROR", "NO_COLOR": "1" },
      stdout: "null",
    });

    await new Promise((resolve) => setTimeout(resolve, 1000)); // wait to be sure

    try {
      await f(p); // call the test function passing the process
    } finally {
      Deno.close(p.rid);
    }
  };

With that, I generate a bunch of tests to download and upload files; this ensures that my code is working as expected.

// https://github.com/i-y-land/HTTP/blob/episode/03/library/integration_test.js#L58
[...]
  .forEach(
    ({ headers = {}, method = "GET", path, title, f }) => {
      Deno.test(
        `Integration: ${title}`,
        withServer(
          8080,
          async () => {
            const response = await fetch(`http://localhost:8080${path}`, {
              headers,
              method,
            });
            await f(response);
          },
        ),
      );
    },
  );

When I got to that point, I realized that my serve function was starting to be very... long.
I knew I needed to refactor it into two functions receiveStaticFile and sendStaticFile.
But, because I need to be able to check the Request line to route to the right function, and I can only read the request
once...
I knew that I was in trouble.

I need something that can keep part of the data in memory while retaining access to the raw connection...

...
    if (method === "POST") {
      return receiveStaticFile(?, { targetPath });
    } else if (method === "GET" || method === "HEAD") {
      return sendStaticFile(?, { sourcePath });
    }
...

I could have decoded the request and shove the connection in there and call it a day...
But it didn't feel right aaaand I guess I love making my life harder.

const request = decodeRequest(connection);
request.connection = connection;

...
    if (method === "POST") {
      return receiveStaticFile(request, { targetPath });
    } else if (method === "GET" || method === "HEAD") {
      return sendStaticFile(request, { sourcePath });
    }
...

The solution I came up with was to write a buffer. It would hold in memory only a KB at a time, shifting the bytes
each time I read a new chunk. The advantage of that is I can move the cursor back to the beginning of the buffer
and read-back parts that I need.
Best of all, the buffer has the same methods as the connection; so the two could be used interchangeably.
I won't go into the details because it's a bit dry, but if you want to checkout the code, it's currently on Github.

// https://github.com/i-y-land/HTTP/blob/episode/03/library/utilities.js#L11
export const factorizeBuffer = (r, mk = 1024, ml = 1024) => { ... }

With this new toy I can read a chunk from the connection, route the request, move the cursor back to the beginning and
pass the buffer to the handler function like nothing happened.

The peek function specifically has a similar signature to read, the difference is that it will move the cursor
back, read a chunk from the buffer in memory and then finally move the cursor back again.

serve(
  Deno.listen({ port }),
  async (connection) => {
    const r = factorizeBuffer(connection);

    const xs = new Uint8Array(1024);
    const reader = r.getReader();
    await reader.peek(xs);
    const [method] = decode(readLine(xs)).split(" ");

    if (method !== "GET" && method !== "POST" && method !== "HEAD") {
      return connection.write(
        encodeResponse({ statusCode: 400 }),
      );
    }

    if (method === "POST") {
      return receiveStaticFile(r, { targetPath });
    } else {
      return sendStaticFile(r, { sourcePath });
    }
  }
)

To finish this, like a boss, I finalize the receiveStaticFile (https://github.com/i-y-land/HTTP/blob/episode/03/library/server.js#L15) and sendStaticFile (https://github.com/i-y-land/HTTP/blob/episode/03/library/server.js#L71) functions, taking care of all
the edge cases.
Finally, I run all the integration tests to confirm that I did a good job. And uuugh. Sleeeep.

This one turned out to be a lot more full of surprise than I was prepared for.
When I realized that some client send file in two-steps, it really threw a wrench to my plans...
But it turned out to an amazing learning opportunity.
I really hope that you are learning as much as I am.
On the bright side, this forced me to put together all the tools that I know I will need for the next post.
Next, I want to look into streaming in more details and build some middlewares, starting with a logger.
From there, I am sure that I can tackle building a nice little router which will wrap this up pretty nicely.

All of the code is available on Github, if you have a question do no hesitate to ask...
Oh speaking of that, I launched a Discord server, if you want to join.

🐙 You will find the code for this episode here: https://github.com/i-y-land/HTTP/tree/episode/03

💬 You can join the I-Y community on Discord: https://discord.gg/eQfhqybmSc

At any rate, if this article was useful to you, hit the like button, leave a comment to let me know or best of all,
follow if you haven't already!

Ok bye now...

21