Reproducible zip archives
I wanted to write down how to create a reproducible zip archive in Python and ended up reading the zip specification. While all code snippets are still in Python, the information gathered can be easily reused with other programming languages.
Note 📝 | I am using the term reproducible and deterministic as described in reproducible-builds. |
How to create an archive in Python
Python offers the high-level function shutil.make_archive
for creating archives of folders;
#!/usr/bin/env python3
import shutil
import sys
if __name__ == '__main__':
shutil.make_archive(sys.argv[1], 'zip', sys.argv[1])
Verify if the archive is reproducible
Since I wanted my archive to be reproducible, I created a minimal test environment.
strip-nondeterminism
is a program for removing non-deterministic information from different file formats, like timestamps.
diffoscope
is an advanced recursive diff tool that can parse a lot of different formats.
Those two tools put together made it trivial to find which non-deterministic information was embedded in the zip archives.
Another useful tool is zipinfo, also used internally by diffoscope
which shows information about zip archives.
mkdir test/{a/a,b,c};
echo a > test/a/a/a;
echo a > test/a/a/a.txt;
echo b > test/b/b.txt;
python3 ./script.py test && cp test.zip test.d.zip && strip-nondeterminism test.d.zip && diffoscope test.zip test.d.zip;
The output of diffoscope
is
--- test.zip
+++ test.d.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│ Zip file size: 628 bytes, number of entries: 7
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 a/
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 b/
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 c/
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 a/a/
│ --rw-r--r-- 2.0 unx 2 b- defN 23-May-14 17:14 a/a/a.txt
│ --rw-r--r-- 2.0 unx 2 b- defN 23-May-14 17:14 a/a/a
│ --rw-r--r-- 2.0 unx 2 b- defN 23-May-14 17:14 b/b.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/a/
│ +-rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a
│ +-rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 b/
│ +-rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 b/b.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 c/
│ 7 files, 6 bytes uncompressed, 12 bytes compressed: -100.0%
├── zipnote {}
│ @@ -1,22 +1,22 @@
│ Filename: a/
│ Comment:
│
│ -Filename: b/
│ -Comment:
│ -
│ -Filename: c/
│ +Filename: a/a/
│ Comment:
│
│ -Filename: a/a/
│ +Filename: a/a/a
│ Comment:
│
│ Filename: a/a/a.txt
│ Comment:
│
│ -Filename: a/a/a
│ +Filename: b/
│ Comment:
│
│ Filename: b/b.txt
│ Comment:
│
│ +Filename: c/
│ +Comment:
│ +
│ Zip file comment:
As expected, the Python program generates a non-reproducible archive; The main differences are
-
file ordering
-
timestamp of files
ZipFile
Python offers a middle-level interface for creating archives: ZipFile
. A straightforward way to create an archive of a directory would be
#!/usr/bin/env python3
import os
import sys
from zipfile import ZipFile, ZIP_DEFLATED
def zipdir(path, zipf):
for root, dirs, files in os.walk(path):
for d in dirs:
zipf.write(os.path.join(root, d), os.path.relpath(os.path.join(root, d), path))
for f in files:
zipf.write(os.path.join(root, f), os.path.relpath(os.path.join(root, f), path))
if __name__ == '__main__':
with ZipFile(sys.argv[1] + ".zip", 'w', ZIP_DEFLATED) as zipf:
zipdir(sys.argv[1], zipf)
Again, the archive is not reproducible
--- test.zip
+++ test.d.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│ Zip file size: 628 bytes, number of entries: 7
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 a/
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 b/
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 c/
│ -drwxr-xr-x 2.0 unx 0 b- stor 23-May-14 17:14 a/a/
│ --rw-r--r-- 2.0 unx 2 b- defN 23-May-14 17:14 a/a/a.txt
│ --rw-r--r-- 2.0 unx 2 b- defN 23-May-14 17:14 a/a/a
│ --rw-r--r-- 2.0 unx 2 b- defN 23-May-14 17:14 b/b.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/a/
│ +-rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a
│ +-rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 b/
│ +-rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 b/b.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 c/
│ 7 files, 6 bytes uncompressed, 12 bytes compressed: -100.0%
├── zipnote {}
│ @@ -1,22 +1,22 @@
│ Filename: a/
│ Comment:
│
│ -Filename: b/
│ -Comment:
│ -
│ -Filename: c/
│ +Filename: a/a/
│ Comment:
│
│ -Filename: a/a/
│ +Filename: a/a/a
│ Comment:
│
│ Filename: a/a/a.txt
│ Comment:
│
│ -Filename: a/a/a
│ +Filename: b/
│ Comment:
│
│ Filename: b/b.txt
│ Comment:
│
│ +Filename: c/
│ +Comment:
│ +
│ Zip file comment:
It is easy to see how we can change the ordering of the files, but not how to change the timestamps.
ZipFile
and ZipInfo
Unfortunately, there is no way with the given functions to change the timestamp (without changing the metadata of the files, for example with touch
), thus we need to get our hands dirty and use the lower-level ZipInfo
API.
#!/usr/bin/env python3
import os
import sys
import stat
from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED, ZIP_STORED
def zipdir(path, zipf):
for root, dirs, files in os.walk(path):
for d in dirs:
info = ZipInfo(
filename=os.path.relpath(os.path.join(root, d),path) + "/",
date_time=(1980, 1, 1, 12, 1, 0)
)
info.external_attr = 0o40755 << 16 | 0x010
info.create_system = 3
info.compress_type = ZIP_STORED
info.CRC = 0 # unclear why necessary for directories, maybe a bug?
zipf.mkdir(info)
for f in files:
with open(os.path.join(root, f), 'rb') as data:
info = ZipInfo(
filename=os.path.relpath(os.path.join(root, f),path),
date_time=(1980, 1, 1, 12, 1, 0)
)
info.external_attr = 0o100644 << 16
info.create_system = 3 # unx=3 vs fat=0
info.compress_type = ZIP_DEFLATED
zipf.writestr(info, data.read())
if __name__ == '__main__':
with ZipFile(sys.argv[1] + ".zip", 'w', ZIP_DEFLATED) as zipf:
zipdir(sys.argv[1], zipf)
The relevant piece of code is date_time=(1980, 1, 1, 12, 1, 0)
inside of ZipInfo
. By doing so, it was necessary to define external attributes (file permissions) and system type (in this case unix).
For directories, one also needs to set the CRC value, which is strange since it is generated automatically for directories. I assume it is an unnecessary restriction of the API since the CRC value is always 0.
If external attributes are not set accordingly, Python still creates a valid archive, but the permissions are all messed up.
Granted, it would be a reproducible archive, but maybe not as useful.
With the given snippet of code, the output of diffoscope
is the following
--- test.zip
+++ test.d.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│ Zip file size: 628 bytes, number of entries: 7
│ drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/
│ -drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 b/
│ -drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 c/
│ drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/a/
│ --rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ -rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a
│ +-rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 b/
│ -rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 b/b.txt
│ +drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 c/
│ 7 files, 6 bytes uncompressed, 12 bytes compressed: -100.0%
├── zipnote {}
│ @@ -1,22 +1,22 @@
│ Filename: a/
│ Comment:
│
│ -Filename: b/
│ -Comment:
│ -
│ -Filename: c/
│ +Filename: a/a/
│ Comment:
│
│ -Filename: a/a/
│ +Filename: a/a/a
│ Comment:
│
│ Filename: a/a/a.txt
│ Comment:
│
│ -Filename: a/a/a
│ +Filename: b/
│ Comment:
│
│ Filename: b/b.txt
│ Comment:
│
│ +Filename: c/
│ +Comment:
│ +
│ Zip file comment:
I’ve used zipinfo
to extract from another archive the expected values for external_attr
, also here 1 and 4 in permissions are explained.
The appropriate permission for directories is the octal value 40755
and not 755
, as 40000
, defined as S_IFDIR in POSIX, denotes a directory.
Similarly, for regular files, the correct permission is the octal value 100644
and not 644
, as 100000
denotes a regular file.
// ...
#define S_IFMT 00170000
#define S_IFSOCK 0140000
#define S_IFLNK 0120000
#define S_IFREG 0100000 // (1)
#define S_IFBLK 0060000
#define S_IFDIR 0040000 // (2)
#define S_IFCHR 0020000
#define S_IFIFO 0010000
#define S_ISUID 0004000
#define S_ISGID 0002000
#define S_ISVTX 0001000
// ...
-
Regular file
-
Directory
For the directories, I also set the MS-DOS file attributes (0x10
). I’m not sure if it is necessary, but zipinfo -v
shows if the value is set or not, and all other archives I’ve checked have this flag set.
Also strip-nondeterminism
adds the flag, so there is no value in not setting it.
The Python library could provide a better API and add those values to the structure Zipinfo
unless the user overrides them, but I can imagine there is little interest, as it would make the API less transparent for those who know what to do™.
At this point, the missing piece is sorting files and directories:
#!/usr/bin/env python3
import os
import sys
import stat
from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED, ZIP_STORED
def zipdir(path, zipf):
entries = []
for root, dirs, files in os.walk(path):
for d in dirs:
entries.append( os.path.relpath(os.path.join(root, d),path) + "/" )
for f in files:
entries.append( os.path.relpath(os.path.join(root, f),path) )
entries.sort()
for e in entries:
info = ZipInfo(
filename=e,
date_time=(1980, 1, 1, 12, 1, 0)
)
info.create_system = 3
if e.endswith("/"):
info.external_attr = 0o40755 << 16 | 0x010
info.compress_type = ZIP_STORED
info.CRC = 0 # unclear why necessary, maybe a bug?
zipf.mkdir(info)
else:
info.external_attr = 0o100644 << 16
info.compress_type = ZIP_DEFLATED
with open(os.path.join(path, e), 'rb') as data:
zipf.writestr(info, data.read())
if __name__ == '__main__':
with ZipFile(sys.argv[1] + ".zip", 'w', ZIP_DEFLATED) as zipf:
zipdir(sys.argv[1], zipf)
And now there is no difference after executing strip-nondeterminism
!
Alternate approach
Instead of creating a deterministic archive directly, one could just use strip-nondeterminism
to remove non-deterministic information.
It works, and for one-time jobs, it is surely faster than researching how to create such an archive properly.
As a long-term solution, creating a deterministic archive from the beginning will generally be faster.
strip-nondeterminism
can still be used for verifying that the archive is as deterministic as expected.
Should I prefer unx
or fat
as system?
On Windows, Python creates different zip archives.
This is the output of zipinfo
on an archive created with shutil.make_archive
:
Archive: test.zip
Zip file size: 628 bytes, number of entries: 7
drwxrwxrwx 2.0 fat 0 b- stor 23-May-15 07:50 a/
drwxrwxrwx 2.0 fat 0 b- stor 23-May-15 07:50 b/
drwxrwxrwx 2.0 fat 0 b- stor 23-May-15 07:50 c/
drwxrwxrwx 2.0 fat 0 b- stor 23-May-15 07:50 a/a/
-rw-rw-rw- 2.0 fat 2 b- defN 23-May-15 07:50 a/a/a
-rw-rw-rw- 2.0 fat 2 b- defN 23-May-15 07:50 a/a/a.txt
-rw-rw-rw- 2.0 fat 2 b- defN 23-May-15 07:50 b/b.txt
7 files, 6 bytes uncompressed, 12 bytes compressed: -100.0%
Note that the permissions are different, and the system type if fat
and not unx
.
When using ZipInfo
directly, the archive is 100% identical, the md5sum is 9aff7f1c532e9fea60ec3f33ce67f148 both on Windows and GNU/Linux machines.
Note that strip-nondeterministm
does not change unx
to fat
, but it "fixed" the permissions when I generated those incorrectly (like missing the MS-DOS
file attributes for directories, or using the octal value 644
instead of 100644
)
I chose unx
because this is what shutil.make_archive
and ZipFile.write
do on GNU/Linux systems, and I did not want to add an unnecessary difference while validating my results.
It is currently interesting to see that even a zip archive can leak some information about the used system.
Other alternatives
The zip
command, with the --no-extra
parameters, nearly creates a reproducible archive, but the ordering of the files is not ensured.
This can be "fixed" with find
, sort
and xargs
:
cd <dir>; find . -name '*' -print0 | sort -z | xargs -0 zip -X ../test.zip
Note that this command still creates a different archive from the Python one
--- test.d.zip
+++ test.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│ -Zip file size: 628 bytes, number of entries: 7
│ -drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/
│ -drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 a/a/
│ --rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a
│ --rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ -drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 b/
│ --rw-r--r-- 2.0 unx 2 b- defN 80-Jan-01 12:01 b/b.txt
│ -drwxr-xr-x 2.0 unx 0 b- stor 80-Jan-01 12:01 c/
│ -7 files, 6 bytes uncompressed, 12 bytes compressed: -100.0%
│ +Zip file size: 622 bytes, number of entries: 7
│ +drwxr-xr-x 3.0 unx 0 b- stor 23-May-14 17:14 a/
│ +drwxr-xr-x 3.0 unx 0 b- stor 23-May-14 17:14 a/a/
│ +-rw-r--r-- 3.0 unx 2 t- stor 23-May-14 17:14 a/a/a
│ +-rw-r--r-- 3.0 unx 2 t- stor 23-May-14 17:14 a/a/a.txt
│ +drwxr-xr-x 3.0 unx 0 b- stor 23-May-14 17:14 b/
│ +-rw-r--r-- 3.0 unx 2 t- stor 23-May-14 17:14 b/b.txt
│ +drwxr-xr-x 3.0 unx 0 b- stor 23-May-14 17:14 c/
│ +7 files, 6 bytes uncompressed, 6 bytes compressed: 0.0%
├── filetype from file(1)
│ @@ -1 +1 @@
│ -Zip archive data, at least v2.0 to extract, compression method=store
│ +Zip archive data, at least v1.0 to extract, compression method=store
The main differences are
-
if a file is binary or text
-
compression algorithm/if a file is compressed (the command
zip
has a heuristic for deciding if it makes sense trying to store a given file) -
minimum version for extracting the data
However reproducible does not mean that different tools should create the same output.
It means that a given tool with a given set of data always creates the same output.
Having different tools produce a binary identical output is beyond most use cases.
It can be achieved, but unless you are delivering something (and recreating the delivery every time), and do not want to let your dependencies know that some internal tooling changed, it is probably not worth achieving something like that.
Incidentally, GitHub had exactly this issue this year.
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.