Parallelize bulk operations


4 - 6 minutes read, 1123 words
Categories: scripting
Keywords: amdahl's law jpegoptim linux make mp3val optipng parallel performance photo scripting test

Introducing parallelism in a program is no simple task.

In particular, it makes the software more complex and generally slower.

But some operations are trivial to parallelize without a noticeable overhead; when all tasks are independent. It’s the case of an embarrassingly parallel problem.

Use case: jpegoptim

jpegoptim is a program that takes a JPEG file, and optimizes it losslessly.

It is currently a single-threaded program, but that is not a problem, as in practice it does not take long to process a single image.

Adding support for parallel processing would be a great feature when invoking jpegoptim on a photo library, where hundreds if not thousands of images are processed. Even if optimizing a single image is a fast operation, the execution on so many files takes a considerable amount of time, and most of the resources of the device are unused.

Handling a folder with subfolders can be accomplished easily with find

find <source> -type f -iname '*jpg' -exec jpegoptim {} \;

In this case, for every image, a new process is started. Starting processes is a relatively fast operation (at least on Linux-based systems), but it takes some time too. Without resorting to other tools, this cost can be mitigated by passing to jpegoptim more images to process

find <source> -type f -name '*jpg' -exec jpegoptim {} +

This creates fewer processes and also gives jpegotpim the possibility to reuse internally some resources (like memory).

Still, images are processed sequentially. Considering that any modern machine (probably any machine built in the last 10 years) has at least two cores (phones too), a trivial way to parallelize the task is to split the source of images in two and run the mentioned command twice.

There are already many possibilities for automating this task, for example with xargs it is possible to execute an executable in parallel

find <source> -type f -iname '*jpg' -print0 | xargs -0 -j2 jpegoptim;

While it works, using xargs as shown has at least two drawbacks.

First, we have to hardcode the number of threads. We could query something like /proc/cpuinfo on Linux systems to determine the number of processors and cores, but it would still not take into account how many available resources the system has.

Second, jpegoptim supports --totals, which creates a summary of how many files it has processed, but most importantly how much space it was able to save. Of course, such reports would be incomplete because we are executing jpegoptim at least twice (notice that this might already be true when processing all images sequentially). While collecting such statistics does not make our problem perfect parallel anymore, it does still require not much effort to add parallelism.

Execute jpegoptim with GNU parallel

Solving the first issue is easy, thanks to GNU parallel:

find . -name '*jpg' -print0 | parallel --null 'jpegoptim {}'

We do not need to query how much capacity we have, GNU parallel will handle it for us. It will also adapt dynamically, based on how many resources are available, the number of instances executed concurrently.

Replicate --totals

Now that one more instance of jpegoptim is executed, it’s time to look at how to replicate --totals.

jpegoptim also supports the option --csv, which creates an output that is generally easier to parse. Note that files containing comma or newlines in the name won’t be handled correctly, as they break the CSV format. This is not a showstopper as only the output of --totals will be inaccurate, jpegoptim is still executed on all files, no matter how convoluted their names are.

With awk (or other text processing utilities) it is possible to parse the CSV output and recreate the output that jpegoptim would normally create:

awk -F, '{
  if($8==error){
    print $1 (length($2) == 0 ? " " : $2) "[ERROR]"
  }else{
    sum_percent+=$7; num_files+=1; sum_size =$5-$6;
    print $1 " " $2 " " $3 " " $4 " [OK] " $5 " --> " $6 " (" $7 "%), " $8
  }
}
END{
  if(num_files == 0){
    percent="nan"
  }else{
    percent=sum_percent/num_files
  }
  print "Average compression (" num_files " files): " percent "% (" sum_size ")"
}'

Minimal implementation

A minimal implementation would be something like

#!/bin/sh

help(){
  "$1" --help 2>&1 | \
    sed -e 's/jpegoptim/pjpegoptim/g' \
    -e '1s/^pjpegoptim/pjpegoptim, based on jpegoptim/g' \
    -e 's/Usage.*/Usage: find . -name [pattern] -print0 | pjpegoptim [options]/' \
    -e '/--stdout/a\\n  --jpegotpim invoke another jpegoptim executable instead of the one in PATH' \
    -e '/--stdout/d' -e '/--stdin/d' \
    -e '$aKnown bugs (pjpegoptim only):\n * If a path contains a comma or newline, the average compression size (reported with --totals) is not calculated correctly';
}

version(){
  "$1" --version 2>&1 | sed -e '1s/^jpegoptim/pjpegoptim, based on jpegoptim/g';
}

process_out_with_awk()
{
  awk -F, '{
    if($8==error){
      print $1 (length($2) == 0 ? " " : $2) "[ERROR]"
    }else{
      sum_percent+=$7; num_files+=1; sum_size =$5-$6;
      print $1 " " $2 " " $3 " " $4 " [OK] " $5 " --> " $6 " (" $7 "%), " $8
    }
  }
  END{
    if(num_files == 0){
      percent="nan"
    }else{
      percent=sum_percent/num_files
    }
    print "Average compression (" num_files " files): " percent "% (" sum_size ")"
  }'
}

test_val(){
  if test $# -lt 2; then :;
    printf 'Missing value for %s\n' "$1" >&2;
    exit 1;
  fi
}

main(){
  TOTALS=false;
  JPEGOPTIM=jpegoptim;
  for key in "$@"; do :;
    shift;
    case "$key" in
      --help|help|-h|h|\?)
        help "$JPEGOPTIM"; exit 0;;
      --version|-v)
        version "$JPEGOPTIM"; exit 0;;
       --totals)
       TOTALS=true;;
       --jpegoptim)
       JPEGOPTIM=$1; shift; shift;;
      *)
       :;;
    esac
    set -- "$@" "$key";
  done;

  if [ "$TOTALS" = true ]; then :;
    parallel --null "$JPEGOPTIM" --csv "$@" '{}' | process_out_with_awk;
  else :;
    parallel --null "$JPEGOPTIM" "$@" '{}';
  fi
}

main "$@";

Its usage would be

find . -name [pattern] -print0 | pjpegoptim [options of jpegoptim]

Note that contrary to jpegoptim, the presented script can only be used in a pipe, ie it cannot be called with a list of files as input. Doing so requires parsing the arguments supported by jpegoptim, in order to separate those from the files to be processed. It can be done, but it’s left as an exercise to the reader.

Other programs

While this notes focus on jpegoptim, most of the considerations can be applied without any changes to other programs.

Some examples that come to mind are other optimizations processes on a big collection of files, like optipng, test suites, or validation processes like mp3val on a music library, …​

As long as the bottleneck is the processing of the data on a single core, and not, for example, reading and writing the data on the disk, using parallel or even other utilities like xargs or make can greatly reduce the runtime.


Do you want to share your opinion? Or is there an error, same parts that are not clear enough?

You can contact me here.