How to compare two files using .NET, really really fast!

Optimization Problem

Last week, I finished my tasks at work early so I had some time at hand, which I used to optimize some tools at work!

Without any delay let us move onto what I did. SO, there is this in-company tool which is used to serialize files from "SOURCE" folder, to "TARGET" folder. If source file and target file are different, it also copies the source file to target directory. So far so good! However, when I looked at the source code, I noticed that the comparison is done by checking MD5 hashes of both files.

This is an optimization chance for the following reasons:
a) In order to calculate MD5hash, all the bytes of the file has to be read and then put through message-digest algorithm.
b) A matching MD5 hash does not 100% mean, files are the same

*Since MD5 is a 128bit hashing algorithm, Birthday problem predicts on average hashing 264 files will end up with a collision ← well, not much of a possibility anyways

Benchmark

※I used BenchmarkDotNet which is also supported by the .NET Foundation, to benchmark the code I wrote. I am comparing 128mB of same files!

Since I want to have a standardized interface, I started out by designing my FileComparer class. I think I commented good enough so it does not need any further explaining :)

MD5 Comparison

Now to implement our baseline, which is MD5 comparison.

After running BenchmarkDotNet, here we can see the result.

Method Mean Error StdDev Ratio
Md5Comparer 1.302 s 0.0090 s 0.0080 s 1.00

Compare Individual Bytes

Can we do better? Of course! We do not need to calculate the MD5 hash at all. We can do a byte-to-byte check and do an early return as soon as we find a not matching byte!

Wow! ReadWholeFileAtOnce is almost 7 times faster, but it is not memory efficient. It reads whole file into memory even before doing any comparison! But Can we do better? Of course!

Method Mean Error StdDev Ratio
Md5Comparer 1,296.4 ms 11.02 ms 10.31 ms 1.00
ReadWholeFileAtOnce 190.2 ms 2.66 ms 2.49 ms 0.15

LINQ is not that fast

But before doing better I would like to show "efficiency" of LINQ queries.

Method Mean Error StdDev Ratio
Md5Comparer 1,299.5 ms 6.09 ms 5.40 ms 1.00
CompareUsingLinq 830.0 ms 1.86 ms 1.45 ms 0.64
UseSequenceEquals 805.5 ms 7.62 ms 7.13 ms 0.62
ReadWholeFileAtOnce 190.4 ms 2.90 ms 2.71 ms 0.15

As can be seen from the results, simple for loop is 4 times faster than a LINQ query. Of course, it does not mean that this is always the case but for high performance code, it is wise to stay away from LINQ and stick to simple choices.

Read File in Chunks

Continuing on! ...But Can we do better? Of course!
Now is the time to up the optimization game! We will read the files in chunks of certain size and then compare the bytes. Here is the new base class which inherits from FileComparer.cs.

Next thing is to choose a suitable ChunkSize, which should yield a faster speed than reading the whole file at once and also must be better with memory efficiency as well. Here is a benchmark I run on the suitable BufferSize. I chose my candidates as
4096 * 1, 4096 * 2, 4096 * 4, 4096 * 8, 4096 * 16, 4096 * 32, 4096 * 64, 4096 * 128 bytes and turns out 4096 * 32 is a good choice.

So currently, reading files in chunks of 4096 * 32 and comparing them one byte at a time seems to be the fastest method, almost 8 times faster than MD5 comparison

Method BufferSize Mean Ratio
Md5Comparer 1,298.1 ms 1.00
AllFile 190.6 ms 0.15
Chunks_One 131072 168.7 ms 0.13

Reading in Chunks comparing 8 bytes at a time

But Can we do better? Of course! There is a pretty neat trick which you can find in some implementations in glibc, about strlen or memcmp as well. It makes use of comparing word length of bytes at a time! Here is the source code!

So, it looks like comparing 8 bytes at a time is a reasonable trick and we end up with 10 times the speed of MD5 comparison!

Method BufferSize Mean Ratio
Md5Comparer 1,298.1 ms 1.00
AllFile 190.6 ms 0.15
Chunks_One 131072 168.7 ms 0.13
Chunks_Eight 131072 126.9 ms 0.10

Using SIMD

But Can we do better? Of course! By vectorizing our problem at hand, we can use SIMD instructions as long as our hardware supports it. Here is a class using general purpose Vecotr class.

After Vectorizing, we are more than 15 times faster the MD5 comparison speed!

Method BufferSize Mean Error StdDev Ratio
Md5 131072 1,293.05 ms 7.763 ms 6.882 ms 1.00
Vector 131072 81.92 ms 0.240 ms 0.188 ms 0.06

But can we do better? Of course!
I am currently using a CPU which supports AVX2 SIMD instructions. Instead general purpose Vector class, let us use hardware specific instrinsics.

※I guess almost any CPU released after the year 2013 should support it, but I am not sure.

After AVX2, we are more than 16 times faster the MD5 comparison speed!

Method BufferSize Mean Error StdDev Ratio
Md5 131072 1,293.05 ms 7.763 ms 6.882 ms 1.00
AVX2 131072 81.92 ms 0.240 ms 0.188 ms 0.06

20