How to compare two files using .NET, really really fast!

Optimization Problem

Last week, I finished my tasks at work early so I had some time at hand, which I used to optimize some tools at work!

Without any delay let us move onto what I did. SO, there is this in-company tool which is used to serialize files from "SOURCE" folder, to "TARGET" folder. If source file and target file are different, it also copies the source file to target directory. So far so good! However, when I looked at the source code, I noticed that the comparison is done by checking MD5 hashes of both files.

This is an optimization chance for the following reasons:
a) In order to calculate MD5hash, all the bytes of the file has to be read and then put through message-digest algorithm.
b) A matching MD5 hash does not 100% mean, files are the same

*Since MD5 is a 128bit hashing algorithm, Birthday problem predicts on average hashing 2⁶⁴ files will end up with a collision ← well, not much of a possibility anyways

Benchmark

※I used BenchmarkDotNet which is also supported by the .NET Foundation, to benchmark the code I wrote. I am comparing 128mB of same files!

Since I want to have a standardized interface, I started out by designing my FileComparer class. I think I commented good enough so it does not need any further explaining :)

	public abstract class FileComparer {

	/// <summary>
	/// Fileinfo for source file
	/// </summary>
	protected readonly FileInfo FileInfo1;

	/// <summary>
	/// Fileinfo for target file
	/// </summary>
	protected readonly FileInfo FileInfo2;

	/// <summary>
	/// Base class for creating a file comparer
	/// </summary>
	/// <param name="filePath01">Absolute path to source file</param>
	/// <param name="filePath02">Absolute path to target file</param>
	protected FileComparer(string filePath01, string filePath02) {
	FileInfo1 = new FileInfo(filePath01);
	FileInfo2 = new FileInfo(filePath02);
	EnsureFilesExist();
	}

	/// <summary>
	/// Compares the two given files and returns true if the files are the same
	/// </summary>
	/// <returns>true if the files are the same, false otherwise</returns>
	public bool Compare() {
	if (IsDifferentLength()) {
	return false;
	}
	if (IsSameFile()) {
	return true;
	}
	return OnCompare();
	}

	/// <summary>
	/// Compares the two given files and returns true if the files are the same
	/// </summary>
	/// <returns>true if the files are the same, false otherwise</returns>
	protected abstract bool OnCompare();

	private bool IsSameFile() {
	return string.Equals(FileInfo1.FullName, FileInfo2.FullName, StringComparison.OrdinalIgnoreCase);
	}

	/// <summary>
	/// Does an early comparison by checking files Length, if lengths are not the same, files are definetely different
	/// </summary>
	/// <returns>true if different length</returns>
	private bool IsDifferentLength() {
	return FileInfo1.Length != FileInfo2.Length;
	}

	/// <summary>
	/// Makes sure files exist
	/// </summary>
	private void EnsureFilesExist() {
	if (FileInfo1.Exists == false) {
	throw new ArgumentNullException(nameof(FileInfo1));
	}
	if (FileInfo2.Exists == false) {
	throw new ArgumentNullException(nameof(FileInfo2));
	}
	}

	}

view raw FileComparer.cs hosted with ❤ by GitHub

MD5 Comparison

Now to implement our baseline, which is MD5 comparison.

	public class Md5Comparer : FileComparer {

	public Md5Comparer(string filePath01, string filePath02) : base(filePath01, filePath02) {
	}

	protected override bool OnCompare() {

	using var fileStream01 = FileInfo1.OpenRead();
	using var fileStream02 = FileInfo2.OpenRead();
	using var md5Creator = MD5.Create();

	var fileStream01Hash = md5Creator.ComputeHash(fileStream01);
	var fileStream02Hash = md5Creator.ComputeHash(fileStream02);

	for (var i = 0; i < fileStream01Hash.Length; i++) {
	if (fileStream01Hash[i] != fileStream02Hash[i]) {
	return false;
	}
	}
	return true;
	}
	}

view raw Md5Comparer.cs hosted with ❤ by GitHub

After running BenchmarkDotNet, here we can see the result.

Method	Mean	Error	StdDev	Ratio
Md5Comparer	1.302 s	0.0090 s	0.0080 s	1.00

Compare Individual Bytes

Can we do better? Of course! We do not need to calculate the MD5 hash at all. We can do a byte-to-byte check and do an early return as soon as we find a not matching byte!

	public class ReadWholeFileAtOnce : FileComparer {

	public ReadWholeFileAtOnce(string filePath01, string filePath02) : base(filePath01, filePath02) {
	}

	protected override bool OnCompare() {
	var fileContents01 = File.ReadAllBytes(FileInfo1.FullName);
	var fileContents02 = File.ReadAllBytes(FileInfo2.FullName);
	for (var i = 0; i < fileContents01.Length; i++) {
	if (fileContents01[i] != fileContents02[i]) {
	return false;
	}
	}
	return true;
	}
	}

view raw ReadWholeFileAtOnce.cs hosted with ❤ by GitHub

Wow! ReadWholeFileAtOnce is almost 7 times faster, but it is not memory efficient. It reads whole file into memory even before doing any comparison! But Can we do better? Of course!

Method	Mean	Error	StdDev	Ratio
Md5Comparer	1,296.4 ms	11.02 ms	10.31 ms	1.00
ReadWholeFileAtOnce	190.2 ms	2.66 ms	2.49 ms	0.15

LINQ is not that fast

But before doing better I would like to show "efficiency" of LINQ queries.

	public class ReadWholeFileAtOnceAndCompareUsingLinq : FileComparer {

	public ReadWholeFileAtOnceAndCompareUsingLinq(string filePath01, string filePath02) : base(filePath01, filePath02) {
	}

	protected override bool OnCompare() {
	var fileContents01 = File.ReadAllBytes(FileInfo1.FullName);
	var fileContents02 = File.ReadAllBytes(FileInfo2.FullName);
	return !fileContents01.Where((t, i) => t != fileContents02[i]).Any();
	}
	}

	public class ReadWholeFileAtOnceAndUseSequenceEquals : FileComparer {

	public ReadWholeFileAtOnceAndUseSequenceEquals(string filePath01, string filePath02) : base(filePath01, filePath02) {
	}

	protected override bool OnCompare()
	{
	var fileContents01 = File.ReadAllBytes(FileInfo1.FullName);
	var fileContents02 = File.ReadAllBytes(FileInfo2.FullName);
	return fileContents01.SequenceEqual(fileContents02);
	}
	}

view raw LinqQuery.cs hosted with ❤ by GitHub

Method	Mean	Error	StdDev	Ratio
Md5Comparer	1,299.5 ms	6.09 ms	5.40 ms	1.00
CompareUsingLinq	830.0 ms	1.86 ms	1.45 ms	0.64
UseSequenceEquals	805.5 ms	7.62 ms	7.13 ms	0.62
ReadWholeFileAtOnce	190.4 ms	2.90 ms	2.71 ms	0.15

As can be seen from the results, simple for loop is 4 times faster than a LINQ query. Of course, it does not mean that this is always the case but for high performance code, it is wise to stay away from LINQ and stick to simple choices.

Read File in Chunks

Continuing on! ...But Can we do better? Of course!
Now is the time to up the optimization game! We will read the files in chunks of certain size and then compare the bytes. Here is the new base class which inherits from FileComparer.cs.

	public abstract class ReadIntoByteBufferInChunks : FileComparer {

	protected readonly int ChunkSize;

	protected ReadIntoByteBufferInChunks(string filePath01, string filePath02, int chunkSize) : base(filePath01, filePath02) {
	ChunkSize = chunkSize;
	}

	protected int ReadIntoBuffer(in Stream stream, in byte[] buffer) {
	var bytesRead = 0;
	while (bytesRead < buffer.Length) {
	var read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead);
	// Reached end of stream.
	if (read == 0) {
	return bytesRead;
	}
	bytesRead += read;
	}
	return bytesRead;
	}

	}

view raw ReadIntoByteBufferInChunks.cs hosted with ❤ by GitHub

Next thing is to choose a suitable ChunkSize, which should yield a faster speed than reading the whole file at once and also must be better with memory efficiency as well. Here is a benchmark I run on the suitable BufferSize. I chose my candidates as
4096 * 1, 4096 * 2, 4096 * 4, 4096 * 8, 4096 * 16, 4096 * 32, 4096 * 64, 4096 * 128 bytes and turns out 4096 * 32 is a good choice.

    

Method
BufferSize
Mean
Error
StdDev
Ratio


Md5Comparer
4096
1,300.1 ms
9.20 ms
8.60 ms
1.00

ReadWholeFileAtOnce

190.2 ms
2.66 ms
2.49 ms
0.15

OneByOne
4096
964.6 ms
4.46 ms
4.17 ms
0.74








Md5Comparer
8192
1,295.1 ms
4.54 ms
4.25 ms
1.00

OneByOne
8192
552.5 ms
1.60 ms
1.50 ms
0.43








Md5Comparer
16384
1,305.8 ms
7.28 ms
6.81 ms
1.00

OneByOne
16384
362.5 ms
7.11 ms
10.19 ms
0.28








Md5Comparer
32768
1,296.4 ms
4.16 ms
3.25 ms
1.00

OneByOne
32768
247.4 ms
1.06 ms
0.94 ms
0.19








Md5Comparer
65536
1,300.9 ms
9.75 ms
9.12 ms
1.00

OneByOne
65536
195.0 ms
1.01 ms
0.95 ms
0.15








Md5Comparer
131072
1,303.8 ms
6.85 ms
6.07 ms
1.00

OneByOne
131072
169.6 ms
0.63 ms
0.59 ms
0.13








Md5Comparer
262144
1,296.3 ms
3.57 ms
3.34 ms
1.00

OneByOne
262144
163.2 ms
0.56 ms
0.47 ms
0.13








Md5Comparer
524288
1,288.7 ms
6.41 ms
5.99 ms
1.00

OneByOne
524288
161.6 ms
1.45 ms
1.29 ms
0.13



  

Method	BufferSize	Mean	Error	StdDev	Ratio
Md5Comparer	4096	1,300.1 ms	9.20 ms	8.60 ms	1.00
ReadWholeFileAtOnce		190.2 ms	2.66 ms	2.49 ms	0.15
OneByOne	4096	964.6 ms	4.46 ms	4.17 ms	0.74

Md5Comparer	8192	1,295.1 ms	4.54 ms	4.25 ms	1.00
OneByOne	8192	552.5 ms	1.60 ms	1.50 ms	0.43

Md5Comparer	16384	1,305.8 ms	7.28 ms	6.81 ms	1.00
OneByOne	16384	362.5 ms	7.11 ms	10.19 ms	0.28

Md5Comparer	32768	1,296.4 ms	4.16 ms	3.25 ms	1.00
OneByOne	32768	247.4 ms	1.06 ms	0.94 ms	0.19

Md5Comparer	65536	1,300.9 ms	9.75 ms	9.12 ms	1.00
OneByOne	65536	195.0 ms	1.01 ms	0.95 ms	0.15

Md5Comparer	131072	1,303.8 ms	6.85 ms	6.07 ms	1.00
OneByOne	131072	169.6 ms	0.63 ms	0.59 ms	0.13

Md5Comparer	262144	1,296.3 ms	3.57 ms	3.34 ms	1.00
OneByOne	262144	163.2 ms	0.56 ms	0.47 ms	0.13

Md5Comparer	524288	1,288.7 ms	6.41 ms	5.99 ms	1.00
OneByOne	524288	161.6 ms	1.45 ms	1.29 ms	0.13

view raw BufferSizeSelection.md hosted with ❤ by GitHub

So currently, reading files in chunks of 4096 * 32 and comparing them one byte at a time seems to be the fastest method, almost 8 times faster than MD5 comparison

Method	BufferSize	Mean	Ratio
Md5Comparer		1,298.1 ms	1.00
AllFile		190.6 ms	0.15
Chunks_One	131072	168.7 ms	0.13

Reading in Chunks comparing 8 bytes at a time

But Can we do better? Of course! There is a pretty neat trick which you can find in some implementations in glibc, about strlen or memcmp as well. It makes use of comparing word length of bytes at a time! Here is the source code!

	public class ReadWholeFileAtOnceCompareEightByteAtOnce : FileComparer
	{

	public ReadWholeFileAtOnceCompareEightByteAtOnce(string filePath01, string filePath02) : base(filePath01, filePath02)
	{
	}

	protected override bool OnCompare()
	{
	var fileContents01 = File.ReadAllBytes(FileInfo1.FullName);
	var fileContents02 = File.ReadAllBytes(FileInfo2.FullName);

	int lastBlockIndex = fileContents01.Length - (fileContents01.Length % sizeof(ulong));

	var totalProcessed = 0;
	while (totalProcessed < lastBlockIndex) {
	if (BitConverter.ToUInt64(fileContents01, totalProcessed) != BitConverter.ToUInt64(fileContents02, totalProcessed)) {
	return false;
	}
	totalProcessed += sizeof(ulong);
	}
	return true;
	}
	}

view raw ReadWholeFileAtOnceCompareEightByteAtOnce.cs hosted with ❤ by GitHub

So, it looks like comparing 8 bytes at a time is a reasonable trick and we end up with 10 times the speed of MD5 comparison!

Method	BufferSize	Mean	Ratio
Md5Comparer		1,298.1 ms	1.00
AllFile		190.6 ms	0.15
Chunks_One	131072	168.7 ms	0.13
Chunks_Eight	131072	126.9 ms	0.10

Using SIMD

But Can we do better? Of course! By vectorizing our problem at hand, we can use SIMD instructions as long as our hardware supports it. Here is a class using general purpose Vecotr class.

	public class ReadFileInChunksAndCompareVector : ReadIntoByteBufferInChunks {
	public ReadFileInChunksAndCompareVector(string filePath01, string filePath02, int chunkSize)
	: base(filePath01, filePath02, chunkSize) {
	}

	protected override bool OnCompare() {
	return StreamAreEqual(FileInfo1.OpenRead(), FileInfo2.OpenRead());
	}

	private bool StreamAreEqual(in Stream stream1, in Stream stream2) {
	var buffer1 = new byte[ChunkSize];
	var buffer2 = new byte[ChunkSize];

	while (true) {
	var count1 = ReadIntoBuffer(stream1, buffer1);
	var count2 = ReadIntoBuffer(stream2, buffer2);

	if (count1 != count2) {
	return false;
	}

	if (count1 == 0) {
	return true;
	}

	var totalProcessed = 0;
	while (totalProcessed < buffer1.Length) {
	if (Vector.EqualsAll(new Vector<byte>(buffer1, totalProcessed), new Vector<byte>(buffer2, totalProcessed)) == false) {
	return false;
	}
	totalProcessed += Vector<byte>.Count;
	}
	}
	}
	}

view raw ReadFileInChunksAndCompareVector.cs hosted with ❤ by GitHub

After Vectorizing, we are more than 15 times faster the MD5 comparison speed!

Method	BufferSize	Mean	Error	StdDev	Ratio
Md5	131072	1,293.05 ms	7.763 ms	6.882 ms	1.00
Vector	131072	81.92 ms	0.240 ms	0.188 ms	0.06

But can we do better? Of course!
I am currently using a CPU which supports AVX2 SIMD instructions. Instead general purpose Vector class, let us use hardware specific instrinsics.

※I guess almost any CPU released after the year 2013 should support it, but I am not sure.


	public class ReadFileInChunksAndCompareAvx2 : ReadIntoByteBufferInChunks {

	public ReadFileInChunksAndCompareAvx2(string filePath01, string filePath02, int chunkSize)
	: base(filePath01, filePath02, chunkSize) {
	}

	protected override bool OnCompare() {
	return StreamAreEqual(FileInfo1.OpenRead(), FileInfo2.OpenRead());
	}

	private unsafe bool StreamAreEqual(in Stream stream1, in Stream stream2) {
	var buffer1 = new byte[ChunkSize];
	var buffer2 = new byte[ChunkSize];

	fixed (byte* oh1 = buffer1) {
	fixed (byte* oh2 = buffer2) {
	while (true) {
	var count1 = ReadIntoBuffer(stream1, buffer1);
	var count2 = ReadIntoBuffer(stream2, buffer2);

	if (count1 != count2) {
	return false;
	}

	if (count1 == 0) {
	return true;
	}

	var totalProcessed = 0;
	while (totalProcessed < buffer1.Length) {
	var result = Avx2.CompareEqual(Avx.LoadVector256(oh1), Avx.LoadVector256(oh2));
	if (Avx2.MoveMask(result) != -1) {
	return false;
	}
	totalProcessed += Vector256<byte>.Count;
	}
	}
	}
	}
	}