Parallelize bulk operations

Notes published the
5 - 6 minutes to read, 1209 words

Introducing parallelism in a program is no simple task.

In particular, it makes the software more complex and generally slower.

But some operations are trivial to parallelize without a noticeable overhead; when all tasks are independent. It’s the case of an embarrassingly parallel problem.

Use case: jpegoptim

jpegoptim is a program that takes a JPEG file and makes lossless optimizations.

It is currently a single-threaded program, but that is not a problem, as in practice it does not take long to process a single image.

Adding support for parallel processing would be a great feature when invoking jpegoptim on a photo library, where hundreds if not thousands of images are processed. Even if optimizing a single image is a fast operation, processing so many files takes a considerable amount of time, and most of the resources of the device are unused.

Handling a folder with subfolders can be accomplished easily with find

find <source> -type f -iname '*jpg' -exec jpegoptim {} \;

In this case, for every image, a new process is started. Starting processes is a relatively fast operation (at least on Linux-based systems), but it takes some time too. Without resorting to other tools, this cost can be mitigated by passing to jpegoptim more images to process

find <source> -type f -name '*jpg' -exec jpegoptim {} +

This creates fewer processes and also gives jpegotpim the possibility to reuse internally some resources (for example allocated memory).

Still, images are processed sequentially. Considering that any modern machine (probably any machine built in the last 15 years) has at least two cores (phones too), a trivial way to parallelize the task is to split the source of images in two and run the mentioned command twice.

There are already many possibilities for automating this task, for example with xargs it is possible to execute commands in parallel 🗄️

find <source> -type f -iname '*jpg' -print0 | xargs -0 -P2 jpegoptim;

While it works, using xargs as shown has at least two drawbacks.

First, we have to hardcode the number of processes. We could query something like /proc/cpuinfo on Linux systems to determine the number of processors and cores, but it would still not take into account how many available resources the system has, and what if the workload changes during execution.

Second, jpegoptim supports --totals, which creates a summary of how many files it has processed, but most importantly how much space it was able to save. Of course, such reports would be incomplete because we are executing jpegoptim at least twice (notice that this might already be true when processing all images sequentially). While collecting such statistics does not make our problem perfectly parallel anymore, it is still trivial to parallelize most of the workload.

A minor drawback (with jpegoptim, might not be minor with other programs) is that the output might be garbled.

Execute jpegoptim with GNU parallel

Solving the first issue is easy, thanks to GNU parallel 🗄️:

find . -name '*jpg' -print0 | parallel --null 'jpegoptim {}'

This command will start a new process for every image, but

  • we do not need to query how much capacity we have, GNU parallel will handle it for us.

  • it will also adapt dynamically the number of instances executed concurrently, based on how many resources are available.

At least with jpegoptim, running more processes in parallel is much more time-efficient than minimizing the number of processes.

GNU Parallel also buffers the output, reducing the chances of garbled output.

Replicate --totals

Now that one more instance of jpegoptim is executed, it’s time to look at how to replicate --totals.

jpegoptim also supports the option --csv, which creates an output that is generally easier to parse. Note that files containing commas or newlines in the name won’t be handled correctly, as they break the CSV format. This is not a showstopper as only the output of --totals will be inaccurate, jpegoptim is still executed on all files, no matter how convoluted their names are.

With awk (or other text processing utilities) it is possible to parse the CSV output and recreate the output that jpegoptim would normally create:

awk -F, '{
  if($8==error){
    print $1 (length($2) == 0 ? " " : $2) "[ERROR]"
  }else{
    sum_percent+=$7; num_files+=1; sum_size =$5-$6;
    print $1 " " $2 " " $3 " " $4 " [OK] " $5 " --> " $6 " (" $7 "%), " $8
  }
}
END{
  if(num_files == 0){
    percent="nan"
  }else{
    percent=sum_percent/num_files
  }
  print "Average compression (" num_files " files): " percent "% (" sum_size ")"
}'

Minimal implementation

A minimal implementation would be something like

#!/bin/sh

help(){
  "$1" --help 2>&1 | \
    sed -e 's/jpegoptim/pjpegoptim/g' \
    -e '1s/^pjpegoptim/pjpegoptim, based on jpegoptim/g' \
    -e 's/Usage.*/Usage: find . -name [pattern] -print0 | pjpegoptim [options]/' \
    -e '/--stdout/a\\n  --jpegotpim invoke another jpegoptim executable instead of the one in PATH' \
    -e '/--stdout/d' -e '/--stdin/d' \
    -e '$aKnown bugs (pjpegoptim only):\n * If a path contains a comma or newline, the average compression size (reported with --totals) is not calculated correctly';
}

version(){
  "$1" --version 2>&1 | sed -e '1s/^jpegoptim/pjpegoptim, based on jpegoptim/g';
}

process_out_with_awk()
{
  awk -F, '{
    if($8==error){
      print $1 (length($2) == 0 ? " " : $2) "[ERROR]"
    }else{
      sum_percent+=$7; num_files+=1; sum_size+=$5-$6;
      print $1 " " $2 " " $3 " " $4 " [OK] " $5 " --> " $6 " (" $7 "%), " $8
    }
  }
  END{
    if(num_files == 0){
      percent="nan"
    }else{
      percent=sum_percent/num_files
    }
    print "Average compression (" num_files " files): " percent "% (" sum_size ")"
  }'
}

test_val(){
  if test $# -lt 2; then :;
    printf 'Missing value for %s\n' "$1" >&2;
    exit 1;
  fi
}

main(){
  [ -t 0 ] && set -- "$@" '?';
  TOTALS=false;
  JPEGOPTIM=jpegoptim;
  SKIPNEXT=false;
  for key in "$@"; do :;
    shift;
    case "$key" in
      --help|help|-h|h|\?)
        help "$JPEGOPTIM"; exit 0;;
      --version|-v)
        version "$JPEGOPTIM"; exit 0;;
      --totals)
       TOTALS=true;;
      --jpegoptim)
       JPEGOPTIM="$2"; SKIPNEXT=true;;
      *)
        if ! $SKIPNEXT ; then :;
          set -- "$@" "$key";
        fi
        SKIPNEXT=false;
    esac
  done;


  if $TOTALS ; then :;
    parallel --null "$JPEGOPTIM" --csv "$@" '{}' | process_out_with_awk;
  else :;
    parallel --null "$JPEGOPTIM"       "$@" '{}';
  fi
}

main "$@";

Its usage would be

find . -name [pattern] -print0 | pjpegoptim [options of jpegoptim]

Note that contrary to jpegoptim, the presented script can only be used in a pipe, ie it cannot be called with a list of files as input. Doing so requires parsing the arguments supported by jpegoptim, to separate those from the files to be processed. It can be done, but it’s left as an exercise to the reader.

Other programs

While these notes focus on jpegoptim, most of the considerations can be applied without any changes to other programs.

Some examples that come to mind are other optimizations processes on a big collection of files, like optipng, test suites, or validation processes like mp3val on a music library, reduce test cases found while fuzzing, …​

As long as the bottleneck is the processing of the data on a single core, and not, for example, reading and writing the data on the disk, using parallel or even other utilities like xargs or make can greatly reduce the runtime.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.