Parallelize bulk operations
Introducing parallelism in a program is no simple task.
In particular, it makes the software more complex and generally slower.
But some operations are trivial to parallelize without a noticeable overhead; when all tasks are independent. It’s the case of an embarrassingly parallel problem.
Use case: jpegoptim
jpegoptim
is a program that takes a JPEG file and makes lossless optimizations.
It is currently a single-threaded program, but that is not a problem, as in practice it does not take long to process a single image.
Adding support for parallel processing would be a great feature when invoking jpegoptim
on a photo library, where hundreds if not thousands of images are processed. Even if optimizing a single image is a fast operation, processing so many files takes a considerable amount of time, and most of the resources of the device are unused.
Handling a folder with subfolders can be accomplished easily with find
find <source> -type f -iname '*jpg' -exec jpegoptim {} \;
In this case, for every image, a new process is started. Starting processes is a relatively fast operation (at least on Linux-based systems), but it takes some time too. Without resorting to other tools, this cost can be mitigated by passing to jpegoptim
more images to process
find <source> -type f -name '*jpg' -exec jpegoptim {} +
This creates fewer processes and also gives jpegotpim
the possibility to reuse internally some resources (for example allocated memory).
Still, images are processed sequentially. Considering that any modern machine (probably any machine built in the last 15 years) has at least two cores (phones too), a trivial way to parallelize the task is to split the source of images in two and run the mentioned command twice.
There are already many possibilities for automating this task, for example with xargs
it is possible to execute commands in parallel 🗄️
find <source> -type f -iname '*jpg' -print0 | xargs -0 -P2 jpegoptim;
While it works, using xargs
as shown has at least two drawbacks.
First, we have to hardcode the number of processes. We could query something like /proc/cpuinfo
on Linux systems to determine the number of processors and cores, but it would still not take into account how many available resources the system has, and what if the workload changes during execution.
Second, jpegoptim
supports --totals
, which creates a summary of how many files it has processed, but most importantly how much space it was able to save. Of course, such reports would be incomplete because we are executing jpegoptim
at least twice (notice that this might already be true when processing all images sequentially). While collecting such statistics does not make our problem perfectly parallel anymore, it is still trivial to parallelize most of the workload.
A minor drawback (with jpegoptim
, might not be minor with other programs) is that the output might be garbled.
Execute jpegoptim
with GNU parallel
Solving the first issue is easy, thanks to GNU parallel 🗄️:
find . -name '*jpg' -print0 | parallel --null 'jpegoptim {}'
This command will start a new process for every image, but
-
we do not need to query how much capacity we have, GNU parallel will handle it for us.
-
it will also adapt dynamically the number of instances executed concurrently, based on how many resources are available.
At least with jpegoptim
, running more processes in parallel is much more time-efficient than minimizing the number of processes.
GNU Parallel also buffers the output, reducing the chances of garbled output.
Replicate --totals
Now that one more instance of jpegoptim
is executed, it’s time to look at how to replicate --totals
.
jpegoptim
also supports the option --csv
, which creates an output that is generally easier to parse. Note that files containing commas or newlines in the name won’t be handled correctly, as they break the CSV format. This is not a showstopper as only the output of --totals
will be inaccurate, jpegoptim
is still executed on all files, no matter how convoluted their names are.
With awk
(or other text processing utilities) it is possible to parse the CSV output and recreate the output that jpegoptim
would normally create:
awk -F, '{
if($8==error){
print $1 (length($2) == 0 ? " " : $2) "[ERROR]"
}else{
sum_percent+=$7; num_files+=1; sum_size =$5-$6;
print $1 " " $2 " " $3 " " $4 " [OK] " $5 " --> " $6 " (" $7 "%), " $8
}
}
END{
if(num_files == 0){
percent="nan"
}else{
percent=sum_percent/num_files
}
print "Average compression (" num_files " files): " percent "% (" sum_size ")"
}'
Minimal implementation
A minimal implementation would be something like
#!/bin/sh
help(){
"$1" --help 2>&1 | \
sed -e 's/jpegoptim/pjpegoptim/g' \
-e '1s/^pjpegoptim/pjpegoptim, based on jpegoptim/g' \
-e 's/Usage.*/Usage: find . -name [pattern] -print0 | pjpegoptim [options]/' \
-e '/--stdout/a\\n --jpegotpim invoke another jpegoptim executable instead of the one in PATH' \
-e '/--stdout/d' -e '/--stdin/d' \
-e '$aKnown bugs (pjpegoptim only):\n * If a path contains a comma or newline, the average compression size (reported with --totals) is not calculated correctly';
}
version(){
"$1" --version 2>&1 | sed -e '1s/^jpegoptim/pjpegoptim, based on jpegoptim/g';
}
process_out_with_awk()
{
awk -F, '{
if($8==error){
print $1 (length($2) == 0 ? " " : $2) "[ERROR]"
}else{
sum_percent+=$7; num_files+=1; sum_size+=$5-$6;
print $1 " " $2 " " $3 " " $4 " [OK] " $5 " --> " $6 " (" $7 "%), " $8
}
}
END{
if(num_files == 0){
percent="nan"
}else{
percent=sum_percent/num_files
}
print "Average compression (" num_files " files): " percent "% (" sum_size ")"
}'
}
test_val(){
if test $# -lt 2; then :;
printf 'Missing value for %s\n' "$1" >&2;
exit 1;
fi
}
main(){
[ -t 0 ] && set -- "$@" '?';
TOTALS=false;
JPEGOPTIM=jpegoptim;
SKIPNEXT=false;
for key in "$@"; do :;
shift;
case "$key" in
--help|help|-h|h|\?)
help "$JPEGOPTIM"; exit 0;;
--version|-v)
version "$JPEGOPTIM"; exit 0;;
--totals)
TOTALS=true;;
--jpegoptim)
JPEGOPTIM="$2"; SKIPNEXT=true;;
*)
if ! $SKIPNEXT ; then :;
set -- "$@" "$key";
fi
SKIPNEXT=false;
esac
done;
if $TOTALS ; then :;
parallel --null "$JPEGOPTIM" --csv "$@" '{}' | process_out_with_awk;
else :;
parallel --null "$JPEGOPTIM" "$@" '{}';
fi
}
main "$@";
Its usage would be
find . -name [pattern] -print0 | pjpegoptim [options of jpegoptim]
Note that contrary to jpegoptim
, the presented script can only be used in a pipe, ie it cannot be called with a list of files as input. Doing so requires parsing the arguments supported by jpegoptim
, to separate those from the files to be processed. It can be done, but it’s left as an exercise to the reader.
Other programs
While these notes focus on jpegoptim
, most of the considerations can be applied without any changes to other programs.
Some examples that come to mind are other optimizations processes on a big collection of files, like optipng
, test suites, or validation processes like mp3val
on a music library, reduce test cases found while fuzzing, …
As long as the bottleneck is the processing of the data on a single core, and not, for example, reading and writing the data on the disk, using parallel
or even other utilities like xargs
or make
can greatly reduce the runtime.
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.