git and binary files
git repositories can be a heated topic.
The main issue is that big binary files tend to occupy a lot of space in a
git repository; mainly for two reasons
they are big
they cannot be compressed as efficiently as textual files
The first possible solution is to ignore the issue. In most cases, it’s the right thing to do.
git is capable of working with non-textual files, no special handling is required.
But if your product depends on some big files, and there is no way around it (high-resolution images, videos, binary dumps, …), then it’s hard to simply ignore the issue, as working with the repository will be problematic, and on some platforms more than other.
The big repository size is generally problematic when cloning the repository; instead of downloading some megabytes (or kilobytes) of data, you might be downloading gigabytes! Another side-effect is that some operations might take more time to execute. A command that would normally take only some milliseconds to execute, might take minutes.
In such conditions, working with
git is painful, as you might spend more time waiting for your PC instead of doing something.
The second solution is to move the binary files somewhere else and fetch them with something else.
Somewhere else might be a maven/nexus repository (for example java artifacts), a docker image (for example compiler and other tools necessary for building the software), to rely on the package manager your operating system (for example basic tools like bash, or the compiler and other specialized tools for building the software), a separate repository, and so on.
Something else might be your build system, a shell script, or another tool to execute on your machine.
As having those binary files outside of the repository has both advantages and disadvantages, there is no inherently better method for handling those files.
|Similar arguments can be made for mono repo vs multi repo, but it is out of scope for those notes.|
In most projects, it won’t be an issue, which is why it is fine to ignore it in most cases.
If you use GitHub and get a warning, then it is a different story.
The main advantage of offloading your files somewhere else is that the git repository is "smaller", as in the
.git folder stays small because the binary files do not end there
In particular, on Windows, this can make a tremendous difference and speed up many operations.
In particular, faster clone and branch switching operations, as those files are not tracked, but also operations that show the content will benefit from it.
If the files are part of the repository, a
git clone is everything you need to do to have everything at your disposal. You might need to set up something (ideally not), but if everything is checked in, you are good to go.
If files are offloaded somewhere else, then you need to set up something. In particular, it means that you cannot disconnect your device after cloning a repository, as your setup process needs to download some other files from somewhere else.
The most annoying part is that you will need to repeat this process when switching branches or updating the current branch. This is error-prone, as scripts/documentation need to be maintained separately, and those scripts/instructions need to be executed, otherwise, you might get an inconsistent state.
A commit in git is a fixed state.
No matter what, checking out that status will recreate the same environment.
Executing custom code that downloads things might create a slightly different status, as the source might change status.
For example, most docker containers install packages from a repository.
Thus recreating the docker container from scratch might create a different environment.
It is obviously possible to create a reproducible environment, but it requires more effort.
When a file is downloaded once, there should be no need to download it again.
If dependencies are downloaded (for example) during the build process, you should ensure there is a working caching mechanism too.
It should not depend on the build folder (why should building in a different folder download the artifacts again?), and should also work reliably when switching branches.
Yes, build systems like
gradle can do this (by the way gradle does not handle multiple builds, and is "only" a "best effort", so it is not a good example. Other systems, like conan, should not have those issues)
Also, a good caching mechanism should be able to download all required dependencies when instructed to do so. As far as I know, Gradle does not have such a mechanism (and all custom-made plugins I’ve found did not work reliably), the dependencies are reliably downloaded only during the build process.
I do not know other dependency systems of other build systems well enough to express an informed opinion.
git already provides this caching mechanism, and technologies like
If everything is checked in, you can use
git log (or
tig, or any other tool you like) to view the history of any file.
You can verify when and why a file (even binary files) has been added with a comment.
If those files are changed outside of
git, you need to look at the history of different programs (if possible at all) and merge those histories. This gets especially difficult when trying to recreate an older status, and not only because of the reproducibility issue.
git has built-in support for using different
This is incredibly useful when reviewing changes, as it simplifies the workflow.
As it is possible to see the differences with git, it is also possible to see them outside git. What changes is how easy it is to see those differences.
Instead of seeing the change, you might see that a URL to a resource changed. It might be necessary to save copies of the different statuses one wants to compare and track which copy belongs to which commit.
git already does out of the box.
If everything is checked into git, the workflow will look like
git add $BINARY_FILE
If some files are tracked outside of git:
$BINARY_FILEto external repository
change file in git that tracks
git add "file in git that tracks $BINARY_FILE"
Also when changing branches the workflows are more complicated, as already mentioned
git switch $branchname
git switch $branchname
do whatever is necessary to get the desired version of
The fact that the workflow is more complex does not express loudly enough that knowing how to use
git is not enough anymore.
You might need to learn how to use
docker, … or whatever else is introduced for storing the data. Note that you might need to know how to use those tools for other reasons too, but you should pay attention not to "lock in" your development environment.
For example, a docker environment with some dependencies might sound like a good alternative to checking files in
git. If all you need are some big images, why impose everyone to work in docker? Especially since it has a maintenance burden?
I’m not saying that having a preconfigured environment for working is a bad idea; on the contrary. But wouldn’t it be better to be able to set up such an environment inside and outside docker?
Granted, this also holds for
git-annex; in case of issues, you need to diagnose those, but for the "good case", the workflow can be as if the files are checked directly in the repository.
git-lfs nearly provide a silver bullet for the current situation.
With both programs, files are not checked in the repository directly. This provides the same advantages as not checking big binary files in
Both programs integrate with
git-annex a little less, at least the ways I used it) in such a way that none of the listed disadvantages apply.
There are still gotchas; for example if
git-lfs is not configured correctly before the first
git clone, the repository will not be in its expected state (the binary files will be missing).
For those providers supporting
git-lfs (at least GitHub and GitLab), the data stored in
git-lfs is counted as part of the repository (it has to be saved somewhere after all). Pruning old unused files from
git-lfs that no one is interested in means rewriting the history (at least I could not find a way), which is problematic for repositories where multiple people are working. This means I know no good way to reduce the size of a repository without disrupting the work of other people. (Again, this holds only for those providers)
If the files are stored "somewhere else" and retrieved with "something else", then dropping the files from "somewhere else" has no effect on the
git history and the size of the repository (which should be small).
This is of little importance for the local copy of the repository, as only the necessary files are downloaded on demand when switching branches.
This is both an advantage and a disadvantage. Downloading all files could mean downloading a lot of unused data. On the other hand, not downloading everything means that you cannot work "offline".
There is a way out: you can download all the data you want explicitly (
git is a good cache manager after all). The most straightforward way to do it is to switch to all branches (or the previous status of branches) you intend to work with.
There is one last point to be said. If you are using a central server for synchronizing your work, tools like
git-annex need to be configured appropriately.
git-annex seems to be less supported (GitLab used to support it), but on a "simple setup" (no website for managing your repository, just
ssh access) it is sufficient to install the
Having said that, putting everything you can in a repository probably makes little sense. Somewhere you have to draw the line.
Committing a working environment (a virtual machine? a Debian ISO for installing the OS on your physical machine? a
git executable/installer?) does make little sense.
Committing the tools where the exact version for building your program might be relevant (for example: the compiler) can make sense.
Personally, I would prefer to not have such hard dependencies on specific versions or tools (there are, for example, multiple C and C++ compilers, multiple JVM, multiple and different versions of interpreters for most dynamic languages, …). This can create a culture where people work with slightly different tools, and (ignoring bugs in the toolchain) leads to more robust software, because different compiler vendors (for example), provide different diagnostics. It also makes it easier to work from different environments, which means that people can work where they are more comfortable: at the command line, inside an IDE, in a Unix environment, a Windows machine, …
A project might depend on third-party libraries, eventually precompiled (remember, the scope of those notes about binary files!). Checking those in is the easiest solution; but similarly to the toolchain, being able to use different versions (assuming a stable API and ABI) can have its advantages too.
And last but not least, in-house binary blobs. Sometimes it is possible to avoid those (instead of committing the compiler binary, compile it from scratch), but it can lead to more error-prone workflows (you have to compile the binary before you can use it).
Other times, it is not possible to avoid those (compiling the binary requires a long time, or it requires a special program that is not available on all machines or the source code has long been lost).
In such cases, it can make sense to see if it is possible to optimize the size of the file (specialized tools like
mp3packer, … come to mind), but it does not change the fact that I would put in most cases those files in git directly.
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.