Method | BufferSize | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Md5Comparer | 4096 | 1,300.1 ms | 9.20 ms | 8.60 ms | 1.00 |
ReadWholeFileAtOnce | 190.2 ms | 2.66 ms | 2.49 ms | 0.15 | |
OneByOne | 4096 | 964.6 ms | 4.46 ms | 4.17 ms | 0.74 |
Md5Comparer | 8192 | 1,295.1 ms | 4.54 ms | 4.25 ms | 1.00 |
OneByOne | 8192 | 552.5 ms | 1.60 ms | 1.50 ms | 0.43 |
Md5Comparer | 16384 | 1,305.8 ms | 7.28 ms | 6.81 ms | 1.00 |
OneByOne | 16384 | 362.5 ms | 7.11 ms | 10.19 ms | 0.28 |
Md5Comparer | 32768 | 1,296.4 ms | 4.16 ms | 3.25 ms | 1.00 |
OneByOne | 32768 | 247.4 ms | 1.06 ms | 0.94 ms | 0.19 |
Md5Comparer | 65536 | 1,300.9 ms | 9.75 ms | 9.12 ms | 1.00 |
OneByOne | 65536 | 195.0 ms | 1.01 ms | 0.95 ms | 0.15 |
Md5Comparer | 131072 | 1,303.8 ms | 6.85 ms | 6.07 ms | 1.00 |
OneByOne | 131072 | 169.6 ms | 0.63 ms | 0.59 ms | 0.13 |
Md5Comparer | 262144 | 1,296.3 ms | 3.57 ms | 3.34 ms | 1.00 |
OneByOne | 262144 | 163.2 ms | 0.56 ms | 0.47 ms | 0.13 |
Md5Comparer | 524288 | 1,288.7 ms | 6.41 ms | 5.99 ms | 1.00 |
OneByOne | 524288 | 161.6 ms | 1.45 ms | 1.29 ms | 0.13 |
How to compare two files using .NET, really really fast!
Last week, I finished my tasks at work early so I had some time at hand, which I used to optimize some tools at work!
Without any delay let us move onto what I did. SO, there is this in-company tool which is used to serialize files from "SOURCE" folder, to "TARGET" folder. If source file and target file are different, it also copies the source file to target directory. So far so good! However, when I looked at the source code, I noticed that the comparison is done by checking MD5 hashes of both files.
This is an optimization chance for the following reasons:
a) In order to calculate MD5hash, all the bytes of the file has to be read and then put through message-digest algorithm.
b) A matching MD5 hash does not 100% mean, files are the same
*Since MD5 is a 128bit hashing algorithm, Birthday problem predicts on average hashing 264 files will end up with a collision ← well, not much of a possibility anyways
※I used BenchmarkDotNet which is also supported by the .NET Foundation, to benchmark the code I wrote. I am comparing 128mB of same files!
Since I want to have a standardized interface, I started out by designing my FileComparer
class. I think I commented good enough so it does not need any further explaining :)
public abstract class FileComparer { | |
/// <summary> | |
/// Fileinfo for source file | |
/// </summary> | |
protected readonly FileInfo FileInfo1; | |
/// <summary> | |
/// Fileinfo for target file | |
/// </summary> | |
protected readonly FileInfo FileInfo2; | |
/// <summary> | |
/// Base class for creating a file comparer | |
/// </summary> | |
/// <param name="filePath01">Absolute path to source file</param> | |
/// <param name="filePath02">Absolute path to target file</param> | |
protected FileComparer(string filePath01, string filePath02) { | |
FileInfo1 = new FileInfo(filePath01); | |
FileInfo2 = new FileInfo(filePath02); | |
EnsureFilesExist(); | |
} | |
/// <summary> | |
/// Compares the two given files and returns true if the files are the same | |
/// </summary> | |
/// <returns>true if the files are the same, false otherwise</returns> | |
public bool Compare() { | |
if (IsDifferentLength()) { | |
return false; | |
} | |
if (IsSameFile()) { | |
return true; | |
} | |
return OnCompare(); | |
} | |
/// <summary> | |
/// Compares the two given files and returns true if the files are the same | |
/// </summary> | |
/// <returns>true if the files are the same, false otherwise</returns> | |
protected abstract bool OnCompare(); | |
private bool IsSameFile() { | |
return string.Equals(FileInfo1.FullName, FileInfo2.FullName, StringComparison.OrdinalIgnoreCase); | |
} | |
/// <summary> | |
/// Does an early comparison by checking files Length, if lengths are not the same, files are definetely different | |
/// </summary> | |
/// <returns>true if different length</returns> | |
private bool IsDifferentLength() { | |
return FileInfo1.Length != FileInfo2.Length; | |
} | |
/// <summary> | |
/// Makes sure files exist | |
/// </summary> | |
private void EnsureFilesExist() { | |
if (FileInfo1.Exists == false) { | |
throw new ArgumentNullException(nameof(FileInfo1)); | |
} | |
if (FileInfo2.Exists == false) { | |
throw new ArgumentNullException(nameof(FileInfo2)); | |
} | |
} | |
} |
Now to implement our baseline, which is MD5 comparison.
public class Md5Comparer : FileComparer { | |
public Md5Comparer(string filePath01, string filePath02) : base(filePath01, filePath02) { | |
} | |
protected override bool OnCompare() { | |
using var fileStream01 = FileInfo1.OpenRead(); | |
using var fileStream02 = FileInfo2.OpenRead(); | |
using var md5Creator = MD5.Create(); | |
var fileStream01Hash = md5Creator.ComputeHash(fileStream01); | |
var fileStream02Hash = md5Creator.ComputeHash(fileStream02); | |
for (var i = 0; i < fileStream01Hash.Length; i++) { | |
if (fileStream01Hash[i] != fileStream02Hash[i]) { | |
return false; | |
} | |
} | |
return true; | |
} | |
} |
After running BenchmarkDotNet, here we can see the result.
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
Md5Comparer | 1.302 s | 0.0090 s | 0.0080 s | 1.00 |
Can we do better? Of course! We do not need to calculate the MD5 hash at all. We can do a byte-to-byte check and do an early return as soon as we find a not matching byte!
public class ReadWholeFileAtOnce : FileComparer { | |
public ReadWholeFileAtOnce(string filePath01, string filePath02) : base(filePath01, filePath02) { | |
} | |
protected override bool OnCompare() { | |
var fileContents01 = File.ReadAllBytes(FileInfo1.FullName); | |
var fileContents02 = File.ReadAllBytes(FileInfo2.FullName); | |
for (var i = 0; i < fileContents01.Length; i++) { | |
if (fileContents01[i] != fileContents02[i]) { | |
return false; | |
} | |
} | |
return true; | |
} | |
} |
Wow! ReadWholeFileAtOnce
is almost 7 times faster, but it is not memory efficient. It reads whole file into memory even before doing any comparison! But Can we do better? Of course!
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
Md5Comparer | 1,296.4 ms | 11.02 ms | 10.31 ms | 1.00 |
ReadWholeFileAtOnce | 190.2 ms | 2.66 ms | 2.49 ms | 0.15 |
But before doing better I would like to show "efficiency" of LINQ queries.
public class ReadWholeFileAtOnceAndCompareUsingLinq : FileComparer { | |
public ReadWholeFileAtOnceAndCompareUsingLinq(string filePath01, string filePath02) : base(filePath01, filePath02) { | |
} | |
protected override bool OnCompare() { | |
var fileContents01 = File.ReadAllBytes(FileInfo1.FullName); | |
var fileContents02 = File.ReadAllBytes(FileInfo2.FullName); | |
return !fileContents01.Where((t, i) => t != fileContents02[i]).Any(); | |
} | |
} | |
public class ReadWholeFileAtOnceAndUseSequenceEquals : FileComparer { | |
public ReadWholeFileAtOnceAndUseSequenceEquals(string filePath01, string filePath02) : base(filePath01, filePath02) { | |
} | |
protected override bool OnCompare() | |
{ | |
var fileContents01 = File.ReadAllBytes(FileInfo1.FullName); | |
var fileContents02 = File.ReadAllBytes(FileInfo2.FullName); | |
return fileContents01.SequenceEqual(fileContents02); | |
} | |
} | |
Method | Mean | Error | StdDev | Ratio |
---|---|---|---|---|
Md5Comparer | 1,299.5 ms | 6.09 ms | 5.40 ms | 1.00 |
CompareUsingLinq | 830.0 ms | 1.86 ms | 1.45 ms | 0.64 |
UseSequenceEquals | 805.5 ms | 7.62 ms | 7.13 ms | 0.62 |
ReadWholeFileAtOnce | 190.4 ms | 2.90 ms | 2.71 ms | 0.15 |
As can be seen from the results, simple for loop is 4 times faster than a LINQ query. Of course, it does not mean that this is always the case but for high performance code, it is wise to stay away from LINQ and stick to simple choices.
Continuing on! ...But Can we do better? Of course!
Now is the time to up the optimization game! We will read the files in chunks of certain size and then compare the bytes. Here is the new base class which inherits from FileComparer.cs
.
public abstract class ReadIntoByteBufferInChunks : FileComparer { | |
protected readonly int ChunkSize; | |
protected ReadIntoByteBufferInChunks(string filePath01, string filePath02, int chunkSize) : base(filePath01, filePath02) { | |
ChunkSize = chunkSize; | |
} | |
protected int ReadIntoBuffer(in Stream stream, in byte[] buffer) { | |
var bytesRead = 0; | |
while (bytesRead < buffer.Length) { | |
var read = stream.Read(buffer, bytesRead, buffer.Length - bytesRead); | |
// Reached end of stream. | |
if (read == 0) { | |
return bytesRead; | |
} | |
bytesRead += read; | |
} | |
return bytesRead; | |
} | |
} |
Next thing is to choose a suitable ChunkSize, which should yield a faster speed than reading the whole file at once and also must be better with memory efficiency as well. Here is a benchmark I run on the suitable BufferSize. I chose my candidates as
4096 * 1, 4096 * 2, 4096 * 4, 4096 * 8, 4096 * 16, 4096 * 32, 4096 * 64, 4096 * 128
bytes and turns out 4096 * 32
is a good choice.
So currently, reading files in chunks of 4096 * 32 and comparing them one byte at a time seems to be the fastest method, almost 8 times faster than MD5 comparison
Method | BufferSize | Mean | Ratio |
---|---|---|---|
Md5Comparer | 1,298.1 ms | 1.00 | |
AllFile | 190.6 ms | 0.15 | |
Chunks_One | 131072 | 168.7 ms | 0.13 |
But Can we do better? Of course! There is a pretty neat trick which you can find in some implementations in glibc, about strlen or memcmp as well. It makes use of comparing word length of bytes at a time! Here is the source code!
public class ReadWholeFileAtOnceCompareEightByteAtOnce : FileComparer | |
{ | |
public ReadWholeFileAtOnceCompareEightByteAtOnce(string filePath01, string filePath02) : base(filePath01, filePath02) | |
{ | |
} | |
protected override bool OnCompare() | |
{ | |
var fileContents01 = File.ReadAllBytes(FileInfo1.FullName); | |
var fileContents02 = File.ReadAllBytes(FileInfo2.FullName); | |
int lastBlockIndex = fileContents01.Length - (fileContents01.Length % sizeof(ulong)); | |
var totalProcessed = 0; | |
while (totalProcessed < lastBlockIndex) { | |
if (BitConverter.ToUInt64(fileContents01, totalProcessed) != BitConverter.ToUInt64(fileContents02, totalProcessed)) { | |
return false; | |
} | |
totalProcessed += sizeof(ulong); | |
} | |
return true; | |
} | |
} |
So, it looks like comparing 8 bytes at a time is a reasonable trick and we end up with 10 times the speed of MD5 comparison!
Method | BufferSize | Mean | Ratio |
---|---|---|---|
Md5Comparer | 1,298.1 ms | 1.00 | |
AllFile | 190.6 ms | 0.15 | |
Chunks_One | 131072 | 168.7 ms | 0.13 |
Chunks_Eight | 131072 | 126.9 ms | 0.10 |
But Can we do better? Of course! By vectorizing our problem at hand, we can use SIMD instructions as long as our hardware supports it. Here is a class using general purpose Vecotr class.
public class ReadFileInChunksAndCompareVector : ReadIntoByteBufferInChunks { | |
public ReadFileInChunksAndCompareVector(string filePath01, string filePath02, int chunkSize) | |
: base(filePath01, filePath02, chunkSize) { | |
} | |
protected override bool OnCompare() { | |
return StreamAreEqual(FileInfo1.OpenRead(), FileInfo2.OpenRead()); | |
} | |
private bool StreamAreEqual(in Stream stream1, in Stream stream2) { | |
var buffer1 = new byte[ChunkSize]; | |
var buffer2 = new byte[ChunkSize]; | |
while (true) { | |
var count1 = ReadIntoBuffer(stream1, buffer1); | |
var count2 = ReadIntoBuffer(stream2, buffer2); | |
if (count1 != count2) { | |
return false; | |
} | |
if (count1 == 0) { | |
return true; | |
} | |
var totalProcessed = 0; | |
while (totalProcessed < buffer1.Length) { | |
if (Vector.EqualsAll(new Vector<byte>(buffer1, totalProcessed), new Vector<byte>(buffer2, totalProcessed)) == false) { | |
return false; | |
} | |
totalProcessed += Vector<byte>.Count; | |
} | |
} | |
} | |
} |
After Vectorizing, we are more than 15 times faster the MD5 comparison speed!
Method | BufferSize | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Md5 | 131072 | 1,293.05 ms | 7.763 ms | 6.882 ms | 1.00 |
Vector | 131072 | 81.92 ms | 0.240 ms | 0.188 ms | 0.06 |
But can we do better? Of course!
I am currently using a CPU which supports AVX2 SIMD instructions. Instead general purpose Vector class, let us use hardware specific instrinsics.
※I guess almost any CPU released after the year 2013 should support it, but I am not sure.
public class ReadFileInChunksAndCompareAvx2 : ReadIntoByteBufferInChunks { | |
public ReadFileInChunksAndCompareAvx2(string filePath01, string filePath02, int chunkSize) | |
: base(filePath01, filePath02, chunkSize) { | |
} | |
protected override bool OnCompare() { | |
return StreamAreEqual(FileInfo1.OpenRead(), FileInfo2.OpenRead()); | |
} | |
private unsafe bool StreamAreEqual(in Stream stream1, in Stream stream2) { | |
var buffer1 = new byte[ChunkSize]; | |
var buffer2 = new byte[ChunkSize]; | |
fixed (byte* oh1 = buffer1) { | |
fixed (byte* oh2 = buffer2) { | |
while (true) { | |
var count1 = ReadIntoBuffer(stream1, buffer1); | |
var count2 = ReadIntoBuffer(stream2, buffer2); | |
if (count1 != count2) { | |
return false; | |
} | |
if (count1 == 0) { | |
return true; | |
} | |
var totalProcessed = 0; | |
while (totalProcessed < buffer1.Length) { | |
var result = Avx2.CompareEqual(Avx.LoadVector256(oh1), Avx.LoadVector256(oh2)); | |
if (Avx2.MoveMask(result) != -1) { | |
return false; | |
} | |
totalProcessed += Vector256<byte>.Count; | |
} | |
} | |
} | |
} | |
} |
After AVX2, we are more than 16 times faster the MD5 comparison speed!
Method | BufferSize | Mean | Error | StdDev | Ratio |
---|---|---|---|---|---|
Md5 | 131072 | 1,293.05 ms | 7.763 ms | 6.882 ms | 1.00 |
AVX2 | 131072 | 81.92 ms | 0.240 ms | 0.188 ms | 0.06 |
22