17
All The Ways to Compress and Archive Files in Python
Python standard library provides great modules and tools for pretty much any task you can think of and modules for working with compressed files are no exception. Whether it's basics like tar
and zip
, specific tools or formats such as gzip
and bz2
or even more exotic formats like lzma
, Python has it all. With all these options, deciding what might be the right tool for the task at hand might not be so obvious, though. So, to help you navigate through all the available options, we will in this article explore all of these modules and learn how to compress, decompress, verify, test and secure our archives of all kinds of formats with help of Python's standard library.
As mentioned above, Python has library for (almost) every tool/format imaginable. So, let's first take a look at each of them and see why you might want to use them:
zlib
is a library and Python module that provides code for working with Deflate compression and decompression format which is used byzip
,gzip
and many others. So, by using this Python module, you're essentially usinggzip
compatible compression algorithm without the convenient wrapper. More about this library can be found on Wikipedia.bz2
is a module that provides support forbzip2
compression. This algorithm is generally more effective than the deflate method, but might be slower. It also works only on individual files and therefore can't create archives.lzma
is both name of the algorithm and Python module. It can produce higher compression ratio than some older methods and is the algorithm behind thexz
utility (more specifically LZMA2).gzip
is a utility most of us are familiar with. It's also a name of a Python module. This module uses the already mentionedzlib
compression algorithm and serves as an interface similar to thegzip
andgunzip
utilities.shutils
is a module we generally don't associate with compression and decompression, but it provides utility methods for working with archives and can be a convenient way for producingtar
,gztar
,zip
,bztar
orxztar
archives.zipfile
- as the name suggests - allows us to work withzip
archives in Python. This module provides all the expected methods for creating, reading, writing or appending to ZIP files as well as classes and objects for easier manipulation of such files.tarfile
- as withzipfile
above, you can probably guess that this module is used for working withtar
archives. It can read and writegzip
,bz2
andlzma
files or archives. It also has support for other features we know fromtar
utility - list of those is available at the top of above linked docs page.
We've got a plenty of libraries to choose from. Some of them more basic, some of them with a lot of extra features, but what they all have in common is that they (obviously) include functions for compression. So, let's see how we can perform these basic operations with each of them:
First up, zlib
. This is fairly low level library and therefore might not be so commonly used so let's just look at the basic compression/decompression of whole file at once:
import zlib, sys
filename_in = "data"
filename_out = "compressed_data"
with open(filename_in, mode="rb") as fin, open(filename_out, mode="wb") as fout:
data = fin.read()
compressed_data = zlib.compress(data, zlib.Z_BEST_COMPRESSION)
print(f"Original size: {sys.getsizeof(data)}")
# Original size: 1000033
print(f"Compressed size: {sys.getsizeof(compressed_data)}")
# Compressed size: 1024
fout.write(compressed_data)
with open(filename_out, mode="rb") as fin:
data = fin.read()
compressed_data = zlib.decompress(data)
print(f"Compressed size: {sys.getsizeof(data)}")
# Compressed size: 1024
print(f"Decompressed size: {sys.getsizeof(compressed_data)}")
# Decompressed size: 1000033
In the above code we use input file that was generated with head -c 1MB </dev/zero > data
, which gives us 1MB of zeroes. We open and read this file into memory and then use the compress
function to create the compressed data. This data is then written into output file. To demonstrate that we are able to recover the data, we then again open the compressed file and use decompress
function on it. From the print statements we can see that the sizes of both compressed and decompressed data match.
Next format and library you can use is bz2
. It can be used in very similar fashion as the zlib
above:
import bz2, os, sys
filename_in = "data"
filename_out = "compressed_data.bz2"
with open(filename_in, mode="rb") as fin, bz2.open(filename_out, "wb") as fout:
fout.write(fin.read())
print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 1000000
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 48
with bz2.open(filename_out, "rb") as fin:
data = fin.read()
print(f"Decompressed size: {sys.getsizeof(data)}")
# Decompressed size: 1000033
Unsurprisingly, the interface for these modules is pretty much identical, so to show something different, in the above example we simplified and reduced the compression step to pretty much single line and used os.stat
to inspect the size of files.
The last of these low level modules is lzma
and to avoid showing the same code over and over again, let's do an incremental compression this time:
import lzma, os
lzc = lzma.LZMACompressor()
# cat /usr/share/dict/words | sort -R | head -c 1MB > data
filename_in = "data"
filename_out = "compressed_data.xz"
with open(filename_in, mode="r") as fin, open(filename_out, "wb") as fout:
for chunk in fin.read(1024):
compressed_chunk = lzc.compress(chunk.encode("ascii"))
fout.write(compressed_chunk)
fout.write(lzc.flush())
print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 972398
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 736
with lzma.open(filename_out, "r") as fin:
words = fin.read().decode("utf-8").split()
print(words[:5])
# ['dabbing', 'hauled', "seediness's", 'Iroquoian', 'vibe']
We start by creating an input file consisting of a bunch of words extracted from dictionary provided in /usr/share/dict/words
. This is so that we can actually confirm that the decompressed data is identical with original.
We then open the input and output files as in previous examples. This time around however, we iterate over the random data in 1024 bit chunks and compress them using LZMACompressor.compress
. These chunks are then written into an output file. After whole file is read and compressed we need to call flush
to finish the compression process and flush out any remaining data from the compressor.
To confirm that this worked, we open and decompress the file the usual way and print first a couple of words from the file.
Moving on to higher level modules - let's now use gzip
for the same tasks:
import os, sys, shutil, gzip
filename_in = "data"
filename_out = "compressed_data.tar.gz"
with open(filename_in, "rb") as fin, gzip.open(filename_out, "wb") as fout:
# Reads the file by chunks to avoid exhausting memory
shutil.copyfileobj(fin, fout)
print(f"Uncompressed size: {os.stat(filename_in).st_size}")
# Uncompressed size: 1000000
print(f"Compressed size: {os.stat(filename_out).st_size}")
# Compressed size: 1023
with gzip.open(filename_out, "rb") as fin:
data = fin.read()
print(f"Decompressed size: {sys.getsizeof(data)}")
# Decompressed size: 1000033
In this example we combined both gzip
and shutils
. It might seem like we did the same bulk compression as with zlib
or bz2
earlier, but thanks to shutil.copyfileobj
we get the chunked incremental compression without having to loop over the data like we did with lzma
.
One advantage of gzip
module is that it also provides commandline interface, and I'm not talking about the Linux gzip
and gunzip
but about Python integration:
python3 -m gzip -h
usage: gzip.py [-h] [--fast | --best | -d] [file [file ...]]
...
ls -l data*
-rw-rw-r-- 1 martin martin 1000000 aug 22 18:48 data
# Use fast compression on file "data"
python3 -m gzip --fast data
# File named "data.gz" was generated:
ls -l data*
-rw-rw-r-- 1 martin martin 1000000 aug 22 18:48 data
-rw-rw-r-- 1 martin martin 1008 aug 22 20:50 data.gz
If you're more comfortable with either zip
or tar
, or you need archives in formats provided by one of these, then this section will show you how to use them. Apart from the basic compression/decompression operations, these 2 modules also include some other utility methods, such as testing checksums, using passwords or listing files in archives. So, let's dive in and see all these in action.
import zipfile
# shuf -n5 /usr/share/dict/words > words.txt
files = ["words1.txt", "words2.txt", "words3.txt", "words4.txt", "words5.txt"]
archive = "archive.zip"
password = b"verysecret"
with zipfile.ZipFile(archive, "w") as zf:
for file in files:
zf.write(file)
zf.setpassword(password)
with zipfile.ZipFile(archive, "r") as zf:
crc_test = zf.testzip()
if crc_test is not None:
print(f"Bad CRC or file headers: {crc_test}")
info = zf.infolist() # also zf.namelist()
print(info) # See all attributes at https://docs.python.org/3/library/zipfile.html#zipinfo-objects
# [ <ZipInfo filename='words1.txt' filemode='-rw-r--r--' file_size=37>,
# <ZipInfo filename='words2.txt' filemode='-rw-r--r--' file_size=47>,
# ... ]
file = info[0]
with zf.open(file) as f:
print(f.read().decode())
# Olav
# teakettles
# ...
zf.extract(file, "/tmp", pwd=password) # also zf.extractall()
This is a fairly long piece of code, but covers all the important features of zipfile
module. In this snippet we start by creating ZIP archive using ZipFile
context manager in "write" (w
) mode and then add the files to this archive. You will notice that we didn't actually need to open the files that we're adding - all we needed to do is call write
passing in the file name. After adding all the files, we also set archive password using setpassword
method.
Next, to demonstrate that it worked, we open the archive. Before reading any files we check CRC and file headers, afterwards we retrieve information about all files present in the archive. In this example we just print the list of ZipInfo
objects, but you could also inspect its attributes to get CRC, size, compression type, etc.
After checking all the files we open and read one of them. We see that it has the expected content, so we can go ahead and extract it to file specified by path (/tmp/
).
In addition to creating a reading archives/files, ZIP allows us to also append files to existing archives. To do this, all we need to change is access mode to "append" ("a"
):
with zipfile.ZipFile(archive, "a") as zf:
zf.write("words6.txt")
print(zf.namelist())
# ['words1.txt', 'words2.txt', 'words3.txt', 'words4.txt', 'words5.txt', 'words6.txt']
Same as with gzip
module, Python's zipfile
and tarfile
also provide CLI. To perform basic archiving and extracting use the following:
python3 -m zipfile -c arch.zip words1.txt words2.txt # Create
python3 -m zipfile -t arch.zip # Test
Done testing
python3 -m zipfile -e arch.zip /tmp # Extract
ls /tmp/words*
/tmp/words1.txt /tmp/words2.txt
Last but not least, tarfile
module. This module is similar to zipfile
, but also implements some extra features:
import tarfile
files = ["words1.txt", "words2.txt", "words3.txt", "words4.txt"]
archive = "archive.tar.gz"
with tarfile.open(archive, "w:gz") as tar:
for file in files:
tar.add(file) # can also be dir (added recursively), symlink, etc
print(f"archive contains: {tar.getmembers()}")
# [<TarInfo 'words1.txt' at 0x7f71ed74f8e0>,
# <TarInfo 'words2.txt' at 0x7f71ed74f9a8>
# ... ]
info = tar.gettarinfo("words1.txt") # Other Linux attributes - https://docs.python.org/3/library/tarfile.html#tarinfo-objects
print(f"{tar.name} contains {info.name} with permissions {oct(info.mode)[-3:]}, size: {info.size} and owner: {info.uid}:{info.gid}")
# .../archive.tar contains words1.txt with permissions 644, size: 37 and owner: 500:500
def change_permissions(tarinfo):
tarinfo.mode = 0o100600 # -rw-------.
return tarinfo
tar.add("words5.txt", filter=change_permissions)
tar.list()
# -rw-r--r-- martin/martin 37 2021-08-23 09:01:56 words1.txt
# -rw-r--r-- martin/martin 47 2021-08-23 09:02:06 words2.txt
# ...
# -rw------- martin/martin 42 2021-08-23 09:02:22 words5.txt
We start with the basic creation of archive, but here we use access mode "w:gz"
which specifies that we want to use GZ compression. After that we add all our files to the archive. With tarfile
module we can also pass in for example symlinks or whole directories that would be recursively added.
Next, to confirm that all the files are really there, we use getmembers
method. To get insight about individual files we can use gettarinfo
, which provides all the Linux file attributes.
tarfile
provides one cool feature that we haven't seen with other modules and that is ability to modify attributes of files when they're being added to archive. In the above snippet we change permission of a file by supplying filter
argument which modifies the TarInfo.mode
. This value has to be provided as octal number, here 0o100600
sets the permissions to 0600
or -rw-------.
.
To get the complete overview of files after doing this change we can run list
method, which gives us output similar to ls -l
.
Final thing to do with tar
archive is to open it and extract it. To do this, we open it with "r:gz"
mode, retrieve an info object (member
) using file name, check whether it really is a file and extract it to desired location:
with tarfile.open(archive, "r:gz") as tar:
member = tar.getmember("words3.txt")
if member.isfile():
tar.extract(member, "/tmp/")
As you can see, Python's modules provide a lot of options, both low and high level, both specific and generic modules, both simple and more complicated interfaces. What you choose depends on your use case and requirements, but in general I would recommend going with the general purpose modules, such as zipfile
or tarfile
and resorting to the ones like lzma
only if you really have to.
I tried to cover all the common use cases of these modules to give you complete overview, but there are obviously more functions, objects, attributes, etc. in each of these modules, so be sure to check out docs linked in the first section to find some other useful bits and pieces.
17