Reproducible zip archives

Notes published the
9 - 11 minutes to read, 2210 words
Categories: reproducible
Keywords: archive python reproducible zip

I wanted to write down how to create a reproducible zip archive in Python and ended up reading the zip specification. While all code snippets are still in Python, the information gathered can be easily reused with other programming languages.

Note 📝
I am using the term reproducible and deterministic as described in reproducible-builds.

How to create an archive in Python

Python offers the high-level function shutil.make_archive for creating archives of folders;

#!/usr/bin/env python3

import shutil
import sys

if __name__ == '__main__':
    shutil.make_archive(sys.argv[1], 'zip', sys.argv[1])

Verify if the archive is reproducible

Since I wanted my archive to be reproducible, I created a minimal test environment.

strip-nondeterminism is a program for removing non-deterministic information from different file formats, like timestamps.

diffoscope is an advanced recursive diff tool that can parse a lot of different formats.

Those two tools put together made it trivial to find which non-deterministic information was embedded in the zip archives.

Another useful tool is zipinfo, also used internally by diffoscope which shows information about zip archives.

mkdir test/{a/a,b,c};
echo a > test/a/a/a;
echo a > test/a/a/a.txt;
echo b > test/b/b.txt;
python3 ./script.py test && cp test.zip test.d.zip && strip-nondeterminism test.d.zip && diffoscope test.zip test.d.zip;

The output of diffoscope is

--- test.zip
+++ test.d.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│  Zip file size: 628 bytes, number of entries: 7
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 a/
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 b/
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 c/
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 a/a/
│ --rw-r--r--  2.0 unx        2 b- defN 23-May-14 17:14 a/a/a.txt
│ --rw-r--r--  2.0 unx        2 b- defN 23-May-14 17:14 a/a/a
│ --rw-r--r--  2.0 unx        2 b- defN 23-May-14 17:14 b/b.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/a/
│ +-rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a
│ +-rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 b/
│ +-rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 b/b.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 c/
│  7 files, 6 bytes uncompressed, 12 bytes compressed:  -100.0%
├── zipnote {}
│ @@ -1,22 +1,22 @@
│  Filename: a/
│  Comment:
│
│ -Filename: b/
│ -Comment:
│ -
│ -Filename: c/
│ +Filename: a/a/
│  Comment:
│
│ -Filename: a/a/
│ +Filename: a/a/a
│  Comment:
│
│  Filename: a/a/a.txt
│  Comment:
│
│ -Filename: a/a/a
│ +Filename: b/
│  Comment:
│
│  Filename: b/b.txt
│  Comment:
│
│ +Filename: c/
│ +Comment:
│ +
│  Zip file comment:

As expected, the Python program generates a non-reproducible archive; The main differences are

  • file ordering

  • timestamp of files

ZipFile

Python offers a middle-level interface for creating archives: ZipFile. A straightforward way to create an archive of a directory would be

#!/usr/bin/env python3

import os
import sys
from zipfile import ZipFile, ZIP_DEFLATED

def zipdir(path, zipf):
    for root, dirs, files in os.walk(path):
        for d in dirs:
            zipf.write(os.path.join(root, d), os.path.relpath(os.path.join(root, d), path))
        for f in files:
            zipf.write(os.path.join(root, f), os.path.relpath(os.path.join(root, f), path))

if __name__ == '__main__':
    with ZipFile(sys.argv[1] + ".zip", 'w', ZIP_DEFLATED) as zipf:
       zipdir(sys.argv[1], zipf)

Again, the archive is not reproducible

--- test.zip
+++ test.d.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│  Zip file size: 628 bytes, number of entries: 7
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 a/
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 b/
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 c/
│ -drwxr-xr-x  2.0 unx        0 b- stor 23-May-14 17:14 a/a/
│ --rw-r--r--  2.0 unx        2 b- defN 23-May-14 17:14 a/a/a.txt
│ --rw-r--r--  2.0 unx        2 b- defN 23-May-14 17:14 a/a/a
│ --rw-r--r--  2.0 unx        2 b- defN 23-May-14 17:14 b/b.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/a/
│ +-rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a
│ +-rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 b/
│ +-rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 b/b.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 c/
│  7 files, 6 bytes uncompressed, 12 bytes compressed:  -100.0%
├── zipnote {}
│ @@ -1,22 +1,22 @@
│  Filename: a/
│  Comment:
│
│ -Filename: b/
│ -Comment:
│ -
│ -Filename: c/
│ +Filename: a/a/
│  Comment:
│
│ -Filename: a/a/
│ +Filename: a/a/a
│  Comment:
│
│  Filename: a/a/a.txt
│  Comment:
│
│ -Filename: a/a/a
│ +Filename: b/
│  Comment:
│
│  Filename: b/b.txt
│  Comment:
│
│ +Filename: c/
│ +Comment:
│ +
│  Zip file comment:

It is easy to see how we can change the ordering of the files, but not how to change the timestamps.

ZipFile and ZipInfo

Unfortunately, there is no way with the given functions to change the timestamp (without changing the metadata of the files, for example with touch), thus we need to get our hands dirty and use the lower-level ZipInfo API.

#!/usr/bin/env python3

import os
import sys
import stat

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED, ZIP_STORED

def zipdir(path, zipf):
    for root, dirs, files in os.walk(path):
        for d in dirs:
            info = ZipInfo(
                filename=os.path.relpath(os.path.join(root, d),path) + "/",
                date_time=(1980, 1, 1, 12, 1, 0)
                )
            info.external_attr = 0o40755  << 16 | 0x010
            info.create_system = 3
            info.compress_type = ZIP_STORED
            info.CRC = 0 # unclear why necessary for directories, maybe a bug?
            zipf.mkdir(info)
        for f in files:
            with open(os.path.join(root, f), 'rb') as data:
                info = ZipInfo(
                        filename=os.path.relpath(os.path.join(root, f),path),
                        date_time=(1980, 1, 1, 12, 1, 0)
                       )
                info.external_attr = 0o100644 << 16
                info.create_system = 3 # unx=3 vs fat=0
                info.compress_type = ZIP_DEFLATED
                zipf.writestr(info, data.read())

if __name__ == '__main__':
    with ZipFile(sys.argv[1] + ".zip", 'w', ZIP_DEFLATED) as zipf:
       zipdir(sys.argv[1], zipf)

The relevant piece of code is date_time=(1980, 1, 1, 12, 1, 0) inside of ZipInfo. By doing so, it was necessary to define external attributes (file permissions) and system type (in this case unix).

For directories, one also needs to set the CRC value, which is strange since it is generated automatically for directories. I assume it is an unnecessary restriction of the API since the CRC value is always 0.

If external attributes are not set accordingly, Python still creates a valid archive, but the permissions are all messed up.

Granted, it would be a reproducible archive, but maybe not as useful.

With the given snippet of code, the output of diffoscope is the following

--- test.zip
+++ test.d.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│  Zip file size: 628 bytes, number of entries: 7
│  drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/
│ -drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 b/
│ -drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 c/
│  drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/a/
│ --rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a.txt
│  -rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a
│ +-rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 b/
│  -rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 b/b.txt
│ +drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 c/
│  7 files, 6 bytes uncompressed, 12 bytes compressed:  -100.0%
├── zipnote {}
│ @@ -1,22 +1,22 @@
│  Filename: a/
│  Comment:
│
│ -Filename: b/
│ -Comment:
│ -
│ -Filename: c/
│ +Filename: a/a/
│  Comment:
│
│ -Filename: a/a/
│ +Filename: a/a/a
│  Comment:
│
│  Filename: a/a/a.txt
│  Comment:
│
│ -Filename: a/a/a
│ +Filename: b/
│  Comment:
│
│  Filename: b/b.txt
│  Comment:
│
│ +Filename: c/
│ +Comment:
│ +
│  Zip file comment:

I’ve used zipinfo to extract from another archive the expected values for external_attr, also here 1 and 4 in permissions are explained.

The appropriate permission for directories is the octal value 40755 and not 755, as 40000, defined as S_IFDIR in POSIX, denotes a directory.

Similarly, for regular files, the correct permission is the octal value 100644 and not 644, as 100000 denotes a regular file.

/usr/include/linux/stat.h
// ...

#define S_IFMT  00170000
#define S_IFSOCK 0140000
#define S_IFLNK  0120000
#define S_IFREG  0100000 // (1)
#define S_IFBLK  0060000
#define S_IFDIR  0040000 // (2)
#define S_IFCHR  0020000
#define S_IFIFO  0010000
#define S_ISUID  0004000
#define S_ISGID  0002000
#define S_ISVTX  0001000

// ...
  1. Regular file

  2. Directory

For the directories, I also set the MS-DOS file attributes (0x10). I’m not sure if it is necessary, but zipinfo -v shows if the value is set or not, and all other archives I’ve checked have this flag set.

Also strip-nondeterminism adds the flag, so there is no value in not setting it.

The Python library could provide a better API and add those values to the structure Zipinfo unless the user overrides them, but I can imagine there is little interest, as it would make the API less transparent for those who know what to do™.

At this point, the missing piece is sorting files and directories:

#!/usr/bin/env python3

import os
import sys
import stat

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED, ZIP_STORED

def zipdir(path, zipf):
    entries = []
    for root, dirs, files in os.walk(path):
        for d in dirs:
            entries.append( os.path.relpath(os.path.join(root, d),path) + "/" )
        for f in files:
            entries.append( os.path.relpath(os.path.join(root, f),path) )
    entries.sort()
    for e in entries:
        info = ZipInfo(
            filename=e,
            date_time=(1980, 1, 1, 12, 1, 0)
        )
        info.create_system = 3
        if e.endswith("/"):
            info.external_attr = 0o40755  << 16 | 0x010
            info.compress_type = ZIP_STORED
            info.CRC = 0 # unclear why necessary, maybe a bug?
            zipf.mkdir(info)
        else:
            info.external_attr = 0o100644 << 16
            info.compress_type = ZIP_DEFLATED
            with open(os.path.join(path, e), 'rb') as data:
                zipf.writestr(info, data.read())

if __name__ == '__main__':
    with ZipFile(sys.argv[1] + ".zip", 'w', ZIP_DEFLATED) as zipf:
       zipdir(sys.argv[1], zipf)

And now there is no difference after executing strip-nondeterminism!

Alternate approach

Instead of creating a deterministic archive directly, one could just use strip-nondeterminism to remove non-deterministic information.

It works, and for one-time jobs, it is surely faster than researching how to create such an archive properly.

As a long-term solution, creating a deterministic archive from the beginning will generally be faster.

strip-nondeterminism can still be used for verifying that the archive is as deterministic as expected.

Should I prefer unx or fat as system?

On Windows, Python creates different zip archives.

This is the output of zipinfo on an archive created with shutil.make_archive:

Archive:  test.zip
Zip file size: 628 bytes, number of entries: 7
drwxrwxrwx  2.0 fat        0 b- stor 23-May-15 07:50 a/
drwxrwxrwx  2.0 fat        0 b- stor 23-May-15 07:50 b/
drwxrwxrwx  2.0 fat        0 b- stor 23-May-15 07:50 c/
drwxrwxrwx  2.0 fat        0 b- stor 23-May-15 07:50 a/a/
-rw-rw-rw-  2.0 fat        2 b- defN 23-May-15 07:50 a/a/a
-rw-rw-rw-  2.0 fat        2 b- defN 23-May-15 07:50 a/a/a.txt
-rw-rw-rw-  2.0 fat        2 b- defN 23-May-15 07:50 b/b.txt
7 files, 6 bytes uncompressed, 12 bytes compressed:  -100.0%

Note that the permissions are different, and the system type if fat and not unx.

When using ZipInfo directly, the archive is 100% identical, the md5sum is 9aff7f1c532e9fea60ec3f33ce67f148 both on Windows and GNU/Linux machines.

Note that strip-nondeterministm does not change unx to fat, but it "fixed" the permissions when I generated those incorrectly (like missing the MS-DOS file attributes for directories, or using the octal value 644 instead of 100644)

I chose unx because this is what shutil.make_archive and ZipFile.write do on GNU/Linux systems, and I did not want to add an unnecessary difference while validating my results.

It is currently interesting to see that even a zip archive can leak some information about the used system.

Other alternatives

The zip command, with the --no-extra parameters, nearly creates a reproducible archive, but the ordering of the files is not ensured.

This can be "fixed" with find, sort and xargs:

cd <dir>; find . -name '*' -print0 | sort -z | xargs -0 zip -X ../test.zip

Note that this command still creates a different archive from the Python one

--- test.d.zip
+++ test.zip
├── zipinfo {}
│ @@ -1,9 +1,9 @@
│ -Zip file size: 628 bytes, number of entries: 7
│ -drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/
│ -drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 a/a/
│ --rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a
│ --rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 a/a/a.txt
│ -drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 b/
│ --rw-r--r--  2.0 unx        2 b- defN 80-Jan-01 12:01 b/b.txt
│ -drwxr-xr-x  2.0 unx        0 b- stor 80-Jan-01 12:01 c/
│ -7 files, 6 bytes uncompressed, 12 bytes compressed:  -100.0%
│ +Zip file size: 622 bytes, number of entries: 7
│ +drwxr-xr-x  3.0 unx        0 b- stor 23-May-14 17:14 a/
│ +drwxr-xr-x  3.0 unx        0 b- stor 23-May-14 17:14 a/a/
│ +-rw-r--r--  3.0 unx        2 t- stor 23-May-14 17:14 a/a/a
│ +-rw-r--r--  3.0 unx        2 t- stor 23-May-14 17:14 a/a/a.txt
│ +drwxr-xr-x  3.0 unx        0 b- stor 23-May-14 17:14 b/
│ +-rw-r--r--  3.0 unx        2 t- stor 23-May-14 17:14 b/b.txt
│ +drwxr-xr-x  3.0 unx        0 b- stor 23-May-14 17:14 c/
│ +7 files, 6 bytes uncompressed, 6 bytes compressed:  0.0%
├── filetype from file(1)
│ @@ -1 +1 @@
│ -Zip archive data, at least v2.0 to extract, compression method=store
│ +Zip archive data, at least v1.0 to extract, compression method=store

The main differences are

  • if a file is binary or text

  • compression algorithm/if a file is compressed (the command zip has a heuristic for deciding if it makes sense trying to store a given file)

  • minimum version for extracting the data

However reproducible does not mean that different tools should create the same output.

It means that a given tool with a given set of data always creates the same output.

Having different tools produce a binary identical output is beyond most use cases.

It can be achieved, but unless you are delivering something (and recreating the delivery every time), and do not want to let your dependencies know that some internal tooling changed, it is probably not worth achieving something like that.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.