Git Logo by Jason Long is licensed under the CC BY 3.0 License

git and binary files

Notes published the
9 - 11 minutes to read, 2299 words

What is the issue?

Binary and git repositories can be a heated topic.

The main issue is that big binary files tend to occupy a lot of space in a git repository; mainly for two reasons

  • they are big

  • they cannot be compressed as efficiently as textual files

The first possible solution is to ignore the issue. In most cases, it’s the right thing to do. git is capable of working with non-textual files, no special handling is required.

But if your product depends on some big files, and there is no way around it (high-resolution images, videos, binary dumps, …​), then it’s hard to simply ignore the issue, as working with the repository will be problematic and on some platforms more than other.

The big repository size is generally problematic when cloning the repository; instead of downloading some megabytes (or kilobytes) of data, you might be downloading gigabytes! Another side-effect is that some operations might take more time to execute. A command that would normally take only some milliseconds to execute, might take minutes.

In such conditions, working with git is painful, as you might spend more time waiting for your PC instead of doing something.

The first solution is to use an alternate technology like git-annex, or git-lfs. I’m putting those options together as they have more or less the same set of advantages and disadvantages.

The second solution is to move the binary files somewhere else and fetch them with something else.

Somewhere else might be a maven/nexus repository (for example java artifacts), a docker image (for example compiler and other tools necessary for building the software), to rely on the package manager your operating system (for example basic tools like bash, or the compiler and other specialized tools for building the software), a separate repository, and so on.

Something else might be your build system, a shell script, or another tool to execute on your machine.

As having those binary files outside of the repository has both advantages and disadvantages, there is no inherently better method for handling those files.

Note 📝
Similar arguments can be made for mono repo vs multi repo, but it is out of scope for those notes.

Is this really an issue?

In most projects, it won’t be an issue, which is why it is fine to ignore it in most cases.

If you use GitHub and get a warning, then it is a different story.

Why doesn’t svn have the same issue?

Because subversion does not download offline the whole history of the repository.

Advantages

The main advantage of offloading your files somewhere else is that the git repository is "smaller", as in the .git folder stays small because the binary files do not end there

In particular, on Windows, this can make a tremendous difference and speed up many operations.

In particular, faster clone and branch switching operations, as those files are not tracked, but also operations that show the content will benefit from it.

Disadvantages

Multi-stage setup

If the files are part of the repository, a git clone is everything you need to do to have everything at your disposal. You might need to set up something (ideally not), but if everything is checked in, you are good to go.

If files are offloaded somewhere else, then you need to set up something. In particular, it means that you cannot disconnect your device after cloning a repository, as your setup process needs to download some other files from somewhere else.

The most annoying part is that you will need to repeat this process when switching branches or updating the current branch. This is error-prone, as scripts/documentation need to be maintained separately, and those scripts/instructions need to be executed, otherwise, you might get an inconsistent state.

No reproducibility by default

A commit in git is a fixed state.

No matter what, checking out that status will recreate the same environment.

Executing custom code that downloads things might create a slightly different status, as the source might change status.

For example, most docker containers install packages from a repository.

Thus recreating the docker container from scratch might create a different environment.

It is obviously possible to create a reproducible environment, but it requires more effort.

Caching

When a file is downloaded once, there should be no need to download it again.

If dependencies are downloaded (for example) during the build process, you should ensure there is a working caching mechanism too.

It should not depend on the build folder (why should building in a different folder download the artifacts again?), and should also work reliably when switching branches.

Yes, build systems like maven and gradle can do this, but maven had issues for a long time (and according to some comments still do), and Gradle Gradle has issues too

Also, the cache in Gradle is "only" a "best effort", as it might decide to delete some elements.

In fact, I’ve experienced in my limited experience stability issues with both build systems.

Other systems, like Conan, should not have those issues.

Also, a good caching mechanism should be able to download all required dependencies when instructed to do so. As far as I know, Gradle does not have such a mechanism (and all custom-made plugins I’ve found did not work reliably), the dependencies are reliably downloaded only during the build process. Which makes it nearly impossible to prepare an environment for developing and working completely offline.

I do not know other dependency systems of other build systems well enough to express an informed opinion.

git already provides this caching mechanism, and technologies on top of it like git-lfs and git-annex too.

No history overview

If everything is checked in, you can use git log (or tig, or any other tool you like) to view the history of any file.

You can verify when and why a file (even binary files) has been added with a comment.

If those files are changed outside of git, you need to look at the history of different programs (if possible at all) and merge those histories. This gets especially difficult when trying to recreate an older status, and not only because of the reproducibility issue.

Diff support

git has built-in support for using different diff tools.

It makes sense to diff binary files; diffoscope is the first that comes to mind, but there are specialized diff tools for images, pdf files, and other documents, and formats.

This is incredibly useful when reviewing changes, as it simplifies the workflow.

Services like Github and Gitlab do even support some binary formats out of the box, even when working with git-lfs.

As it is possible to see the differences with git, it is also possible to see them outside of git. What changes is how easy it is to see those differences.

Instead of seeing the change, you might see that a URL to a resource changed. It might be necessary to save copies of the different statuses one wants to compare and track which copy belongs to which commit.

All things git already does out of the box.

More complex workflow

If everything is checked into git, the workflow will look like

  • change $BINARY_FILE

  • git add $BINARY_FILE

  • git commit

If some files are tracked outside of git:

  • change $BINARY_FILE

  • add $BINARY_FILE to external repository

  • change file in git that tracks $BINARY_FILE

  • git add "file in git that tracks $BINARY_FILE"

  • git commit

Also when changing branches the workflows are more complicated, as already mentioned, it is

  • git switch $branchname

versus

  • git switch $branchname

  • do whatever is necessary to get the desired version of $BINARY_FILE

While it might not be much work, it makes working more error-prone; not only when switching between branches, but also when updating the current branch.

Developer needs to know more tools

The fact that the workflow is more complex does not express loudly enough that knowing how to use git is not enough anymore.

You might need to learn how to use gradle, nexus, docker, …​ or whatever else is introduced for storing the data. Note that you might need to know how to use those tools for other reasons too, but you should pay attention not to "lock in" your development environment.

For example, a docker environment with some dependencies might sound like a good alternative to checking files in git. If all you need are some big images, why impose everyone to work in docker? Especially since it has a maintenance burden?

I’m not saying that having a preconfigured environment for working is a bad idea; on the contrary. But wouldn’t it be better to be able to set up such an environment inside and outside docker?

Granted, this also holds for git-lfs and git-annex; in case of issues, you need to diagnose those, but at least for the "good case", the workflow can be as if the files are checked directly in the repository.

Conclusion

git-annex and git-lfs nearly provide a silver bullet for the current situation.

With both programs, files are not checked in the repository directly. This provides the same advantages as not checking big binary files in git.

Both programs integrate with git (git-annex a little less, at least the ways I used it) in such a way that none of the listed disadvantages apply.

There are still gotchas; for example if git-lfs is not configured correctly before the first git clone, the repository will not be in its expected state (the binary files will be missing).

For those providers supporting git-lfs (at least GitHub and GitLab), the data stored in git-lfs is counted as part of the repository (it has to be saved somewhere after all). Pruning old unused files from git-lfs that no one is interested in means rewriting the history (at least I could not find a way), which is problematic for repositories where multiple people are working. This means I know no good way to reduce the size of a repository without disrupting the work of other people. (Again, this holds only for those providers)

If the files are stored "somewhere else" and retrieved with "something else", then dropping the files from "somewhere else" has no effect on the git history and the size of the repository (which should be small).

This is of little importance for the local copy of the repository, as only the necessary files are downloaded on demand when switching branches.

This is both an advantage and a disadvantage. Downloading all files could mean downloading a lot of unused data. On the other hand, not downloading everything means that you cannot work "offline".

There is a way out: you can download all the data you want explicitly (git is a good cache manager after all). The most straightforward way to do it is to switch to all branches (or the previous status of branches) you intend to work with.

There is one last point to be said. If you are using a central server for synchronizing your work, tools like git-lfs or git-annex need to be configured appropriately.

git-lfs is supported out-of-the-box on GitHub, can be configured on GitLab, Gitea, and other providers.

git-annex seems to be less supported (GitLab used to support it), but on a "simple setup" (no website for managing your repository, just ssh access) it is sufficient to install the git-annex package.

Having said that, putting everything you can in a repository probably makes little sense. Somewhere you have to draw the line.

Committing a working environment (a virtual machine? a Debian ISO for installing the OS on your physical machine? a git executable/installer?) does make little sense.

Committing the tools where the exact version for building your program might be relevant (for example: the compiler) can make sense.

Personally, I would prefer to not have such hard dependencies on specific versions or tools (there are, for example, multiple C and C++ compilers, multiple JVM, multiple and different versions of interpreters for most dynamic languages, …​). This can create a culture where people work with slightly different tools, and (ignoring bugs in the toolchain) leads to more robust software, for example because different compiler vendors, provide different diagnostics, and different checks at runtime. It also makes it easier to work from different environments, which means that people can work where they are more comfortable: at the command line, inside an IDE, in a Unix environment, a Windows machine, …​

A project might depend on third-party libraries, eventually precompiled (remember, the scope of those notes about binary files!). Checking those in is the easiest solution; but similarly to the toolchain, being able to use different versions (assuming a stable API and ABI) can have its advantages too.

And last but not least, in-house binary blobs. Sometimes it is possible to avoid those (instead of committing the compiler binary, compile it from scratch), but it can lead to more error-prone workflows (you have to compile the binary before you can use it).

Other times, it is not possible to avoid those (compiling the binary requires a long time, or it requires a special program that is not available on all machines or the source code has long been lost).

In such cases, it can make sense to see if it is possible to optimize the size of the file (specialized tools like optipng, jpegoptim, mp3packer, …​ come to mind), but it does not change the fact that I would put in most cases those files in git directly.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.