File optimizations

Just recently I noticed that there is a PNG optimizer I was not aware of: zopflipng

After testing it on different images, it turns out I could reduce the size by 0.1-5% on some images.

I already use optipng to optimize the required space on disk, and in particular, for png files there are multiple optimizers, but often executing more than one is just a waste of time.

Since I did not document anywhere what tools I’m using for optimizing the size of different formats (except in scripts and my shell history), I’ve decided to document better what tools are available.

Why should I want to reduce the quality of my files?

The tools listed here do not reduce the quality of the picture, video, or audio.

There are two categories of methods for compressing data: lossy and lossless.

Lossless algorithms do not throw away any data, the process is completely reversible. This is like archiving a file in a zip archive. If you extract it, the extracted file is identical to the file you archived at the beginning, even if the zip archive is smaller.

Lossy algorithms throw away some data, for example, encoding a high-definition video to low definition. No matter what you do, you’ll not be able to restore the high-definition video from the one in slow definition.

Having said that, there are certain types of lossy optimization that are acceptable and do not degrade quality.

For example removing metadata (location, camera model, license, …​) from an image. Or removing comments and non-relevant whitespace from an HTML file. PNG uses internally a lossless algorithm, by changing the parameter of such an algorithm, the image can have a smaller size, without any change in the quality.

In both cases, in the absence of errors, the original and "optimized" file will be rendered exactly the same⁠[1].

Granted, if you want to keep the metadata of your picture, or if you need to edit manually the HTML file, or if you want to restore the original algorithm parameters used in the PNG file, then those "optimizers" are doing a lossy operation and should be avoided.

Why should I want to reduce the required size for my files?

There are multiple advantages.

  • you can store more things as if you get some free space

  • faster transfer times (especially download and upload between different systems)

  • processing files through an optimizer helped me to find (and repair) some broken files (as most optimizers should validate the input data)

  • some documents might have sensitive metadata you generally do not want to be public

There are obviously some disadvantages or at least risks.

  • the optimizing program might be buggy and create an invalid file from a valid one

  • the optimizing program might be buggy and create a valid file with different content

  • something might depend on the exact binary, any change will break this dependency. An example would be a digital signature.

  • risk of losing metadata you were not aware of, and realizing later you are interested in

  • opening an optimized file might require more time to process

  • the optimizing process might not be reproducible, leading every time to a different file (I suggest reporting the issue upstream)

  • it can take a long time to optimize some files, for little (or no) gain

What metadata I’m interested in depends on the situation. For my personal photo gallery, the EXIF metadata should not be touched. For a screenshot, I’m not interested in the metadata.

A "published" HTML file, at the end of a build process, it is fine to remove all comments and otherwise unused content. In a file that I’m writing by hand, nothing (except maybe trailing whitespace) should be discarded.

I’m not picking an HTML file by accident, minifying document on the web is, for better or worse, a common optimization technique.

Another example of an "optimization process" would be removing garbage from zip archives. It is definitively not a lossless transformation, but since it does not remove things I’m interested in, it is lossless for my workflows.

While "optimizing" can take a lot of time, accessing the optimized data should not incur a significant cost. At least it never happened to me, even on less powerful devices, that opening an optimized .jpg or .png file took measurably more time than its "unoptimized" variant. What did make a difference, was putting big files in an archive and trying to access them directly, without extracting them beforehand.

Another thing to consider is that usually I do not open all the files I have at the same time, and some files might be left untouched for months (or years), so being able to process them as fast as possible is not that important. On the other hand, the files use the space on the drives all the time.

Thus in general I prefer to have smaller files, even if it takes some time to optimize them, at least up to a certain point.

Use a "better" format

Sometimes, instead of optimizing an existing format, it is possible to convert the file to another format.

For example, the format jxl is more efficient at storing the data than .jpg (and apparently .png too), and it is possible to make a lossless conversion between the two file formats. In this case, lossless really means that the files are bit-for-bit identical.

cjxl -q 100 image.jpg image.jpg.jxl
djxl image.jpg.jxl image.jpg.jxl.jpg
md5sum image.jpg image.jpg.jxl.jpg # same hash values
Note 📝
in my test with a png file, I had to execute optipng with -strip all to get the binary identical file. With the jpg file this was not necessary.

While using a different file format might be much more faster than optimizing an existing file format, the main disadvantage is that not necessarily all programs support the new format, thus you might need to convert back and forth.

Another example would be the rvz archive format used by the dolphin emulator. .iso are genereally 4GB big, but in for many games the actual relevant content is much smaller. Dolphin can make a lossless conversion from iso to rvz files (and use the directly), which can lead to a reduction of multiple orders of magnitude. Similarly to rvz, there is also the chd file format, often more efficient than .iso files.

Some programs I’ve been using so far

I mostly focus on tools that are already packaged, it reduces the risk of executing some malware and avoids having to maintain those programs separately from others.

Command-line tools

Graphical tools (relying on command-line tools)

To reduce the chance of breaking some files by introducing artifacts, it is a good idea to use independent tools to verify that:

  • file type is valid

  • no artifacts were introduced

For text files, there are more than enough diffing tools, while for other file types, it is more difficult to find something appropriate. Often the accepted solution is to open the file with a graphical application, and if the tool can open it, then it is probably valid. This might work if you only have a dozen of files, but is problematic if you have many more files.

When converting to a different file format, the easiest way to test the transfromation is lossless is to convert forth and back, and do a binary comparison.

I’m currently also using Beyond Compare (which is not open source and thus not part of the official Debian repositories) for comparing images, archives, folder structures, and other file types.

For finding differences between multiple file types, diffoscope also played an important role.

format optimizers diff tools verifiers

images

jpg

jpegoptim

findimagedupes, magic compare

jpeginfo

jxl

djxl,cjxl

png

optipng, zopflipng,pngcrush

findimagedupes, magic compare

pngcheck

gif

optipng, gifsicle

findimagedupes

svg

optipng, minify

bmp

optipng

tiff

optipng

audio

mp3

mp3packer

mp3val

opus

optivorbis

archives

zip

gzip

zopfli

rvz

dolphin-tool

chd

chdman

documents

pdf

qpdf

diffpdf

html

htmlmin, minify

tidy

xml

minify

epub

docx, pptx, xlsx

other

css

cssmin, csstidy, minify

json

minify

js

terser,uglifyjs


1. Exceptions apply, for example, Internet Explorer used conditional comments

Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.