File optimizations
Just recently I noticed that there is a PNG optimizer I was not aware of: zopflipng
After testing it on different images, it turns out I could reduce the size by 0.1-5% on some images.
I already use optipng
to optimize the required space on disk, and in particular, for png
files there are multiple optimizers, but often executing more than one is just a waste of time.
Since I did not document anywhere what tools I’m using for optimizing the size of different formats (except in scripts and my shell history), I’ve decided to document better what tools are available.
Why should I want to reduce the quality of my files?
The tools listed here do not reduce the quality of the picture, video, or audio.
There are two categories of methods for compressing data: lossy and lossless.
Lossless algorithms do not throw away any data, the process is completely reversible. This is like archiving a file in a zip archive. If you extract it, the extracted file is identical to the file you archived at the beginning, even if the zip archive is smaller.
Lossy algorithms throw away some data, for example, encoding a high-definition video to low definition. No matter what you do, you’ll not be able to restore the high-definition video from the one in slow definition.
Having said that, there are certain types of lossy optimization that are acceptable and do not degrade quality.
For example removing metadata (location, camera model, license, …) from an image. Or removing comments and non-relevant whitespace from an HTML file. PNG uses internally a lossless algorithm, by changing the parameter of such an algorithm, the image can have a smaller size, without any change in the quality.
In both cases, in the absence of errors, the original and "optimized" file will be rendered exactly the same[1].
Granted, if you want to keep the metadata of your picture, or if you need to edit manually the HTML file, or if you want to restore the original algorithm parameters used in the PNG file, then those "optimizers" are doing a lossy operation and should be avoided.
Why should I want to reduce the required size for my files?
There are multiple advantages.
-
you can store more things as if you get some free space
-
faster transfer times (especially download and upload between different systems)
-
processing files through an optimizer helped me to find (and repair) some broken files (as most optimizers should validate the input data)
-
some documents might have sensitive metadata you generally do not want to be public
There are obviously some disadvantages or at least risks.
-
the optimizing program might be buggy and create an invalid file from a valid one
-
the optimizing program might be buggy and create a valid file with different content
-
something might depend on the exact binary, any change will break this dependency. An example would be a digital signature.
-
risk of losing metadata you were not aware of, and realizing later you are interested in
-
opening an optimized file might require more time to process
-
the optimizing process might not be reproducible, leading every time to a different file (I suggest reporting the issue upstream)
-
it can take a long time to optimize some files, for little (or no) gain
What metadata I’m interested in depends on the situation. For my personal photo gallery, the EXIF metadata should not be touched. For a screenshot, I’m not interested in the metadata.
A "published" HTML file, at the end of a build process, it is fine to remove all comments and otherwise unused content. In a file that I’m writing by hand, nothing (except maybe trailing whitespace) should be discarded.
I’m not picking an HTML file by accident, minifying document on the web is, for better or worse, a common optimization technique.
Another example of an "optimization process" would be removing garbage from zip archives. It is definitively not a lossless transformation, but since it does not remove things I’m interested in, it is lossless for my workflows.
While "optimizing" can take a lot of time, accessing the optimized data should not incur a significant cost. At least it never happened to me, even on less powerful devices, that opening an optimized .jpg
or .png
file took measurably more time than its "unoptimized" variant. What did make a difference, was putting big files in an archive and trying to access them directly, without extracting them beforehand.
Another thing to consider is that usually I do not open all the files I have at the same time, and some files might be left untouched for months (or years), so being able to process them as fast as possible is not that important. On the other hand, the files use the space on the drives all the time.
Thus in general I prefer to have smaller files, even if it takes some time to optimize them, at least up to a certain point.
Use a "better" format
Sometimes, instead of optimizing an existing format, it is possible to convert the file to another format.
For example, the format jxl
is more efficient at storing the data than .jpg
(and apparently .png
too), and it is possible to make a lossless conversion between the two file formats. In this case, lossless really means that the files are bit-for-bit identical.
cjxl -q 100 image.jpg image.jpg.jxl
djxl image.jpg.jxl image.jpg.jxl.jpg
md5sum image.jpg image.jpg.jxl.jpg # same hash values
Note 📝 | in my test with a png file, I had to execute optipng with -strip all to get the binary identical file. With the jpg file this was not necessary. |
While using a different file format might be much more faster than optimizing an existing file format, the main disadvantage is that not necessarily all programs support the new format, thus you might need to convert back and forth.
Another example would be the rvz
archive format used by the dolphin emulator. .iso
are genereally 4GB big, but in for many games the actual relevant content is much smaller. Dolphin can make a lossless conversion from iso
to rvz
files (and use the directly), which can lead to a reduction of multiple orders of magnitude. Similarly to rvz
, there is also the chd
file format, often more efficient than .iso
files.
Some programs I’ve been using so far
I mostly focus on tools that are already packaged, it reduces the risk of executing some malware and avoids having to maintain those programs separately from others.
Command-line tools
-
advpng
, part of the advancecomp package -
jpegoptim
-
cjxl
anddjxl
from the libjxl-tools package -
optipng
-
zopfli
/zopflipng
-
optivorbis
-
dolphin-tools
-
chdman
from the mame-tools package -
qpdf
-
htmlmin
-
terser
-
uglifyjs
Graphical tools (relying on command-line tools)
To reduce the chance of breaking some files by introducing artifacts, it is a good idea to use independent tools to verify that:
-
file type is valid
-
no artifacts were introduced
For text files, there are more than enough diffing tools, while for other file types, it is more difficult to find something appropriate. Often the accepted solution is to open the file with a graphical application, and if the tool can open it, then it is probably valid. This might work if you only have a dozen of files, but is problematic if you have many more files.
-
findimagedupes
-
diffpdf
-
jpeginfo
-
pngcheck
-
mp3val
When converting to a different file format, the easiest way to test the transfromation is lossless is to convert forth and back, and do a binary comparison.
I’m currently also using Beyond Compare (which is not open source and thus not part of the official Debian repositories) for comparing images, archives, folder structures, and other file types.
For finding differences between multiple file types, diffoscope
also played an important role.
format | optimizers | diff tools | verifiers | |
---|---|---|---|---|
images | ||||
jpg |
|
|
| |
jxl |
| |||
png |
|
|
| |
gif |
|
| ||
svg |
| |||
bmp |
| |||
tiff |
| |||
audio | ||||
mp3 |
|
| ||
opus |
| |||
archives | ||||
zip | ||||
gzip |
| |||
rvz |
| |||
chd |
| |||
documents | ||||
|
| |||
html |
|
| ||
xml |
| |||
epub | ||||
docx, pptx, xlsx | ||||
other | ||||
css |
| |||
json |
| |||
js |
|
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.