Execute programs in parallel

Executing programs in parallel is an easy way to speed up operations that handle a lot of data independently.

I’ve used in that situation GNU Parallel, but there are at least two other alternatives that are available on most systems (Windows included thank to wsl, cygwin, …​): xargs and parallel from the moreutils suite.

Different tools have different API, at least they are similar. Since I do not use those tools that often, I always end up searching for examples that cover my basic needs.

So I decided to enlist them here.

In most cases, I want to parallelize an operation over multiple files, so my command begins more or less with

find "$directory" -name "$pattern" -exec "$program_to_execute" {} +

and ends in the form

find "$directory" -name "$pattern" -print0 | "$exec_program_in_parallel"

For testing purposes, a simple printf 'a\0b\0c\0d' can be used to replace find with the parameter -print0 in the examples

Note for xargs

I’m using /proc/cpuinfo for determining the number of jobs for xargs. This might not be optimal.

The parameter --no-run-if-empty is a GNU extension for xargs, and might not always be available.

Execute program with one argument per instance at a specific position

# triggers warning if -n supplied, even with -n1
# xargs --no-run-if-empty is a gnu extension
MAXPROCS="--max-procs=$(($(grep -c processor /proc/cpuinfo)+1))"
find -print0 | xargs        --null --no-run-if-empty "$MAXPROCS"                           -i{} -- echo myprogram arg1 "{}" arg2 "{}"

find -print0 | xargs        --null --no-run-if-empty            -- parallel.moreutils -n1 -i      echo myprogram arg1 "{}" arg2 "{}" --

find -print0 | parallel.gnu --null --no-run-if-empty                                  -n1 -i{} -- echo myprogram arg1 "{}" arg2 "{}"

Execute programs in parallel with multiple arguments at the end per instance

MAXPROCS="--max-procs=$(($(grep -c processor /proc/cpuinfo)+1))"
find -print0 | xargs        --null --no-run-if-empty "$MAXPROCS"                       -n2 -- echo myprograms arg1 arg2

find -print0 | xargs        --null --no-run-if-empty            -- parallel.moreutils -n2    echo myprogram arg1 arg2 --

find -print0 | parallel.gnu --null --no-run-if-empty                                  -n2 -- echo myprogram arg1 arg2

The parameter -n is not necessary for xargs and gnu.parallel, unless you want to define the maximum amount of arguments. parallel.moreutils defaults to -n1 if -n is not provided, thus you should always set the -n parameter for parallel.moreutils.

Also note that in this case it is possible to avoid xargs:

find . -exec parallel.moreutils -n2 echo myprogram arg1 arg2 -- {} +

You should never use \; instead of +:

find . -exec parallel.moreutils -n2 echo myprogram arg1 arg2 -- {} \;

In this case, find invokes parallel.moreutils with only one argument and waits until it finishes, thus providing no chance to invoke myprogram in parallel.

Also note that

find . -exec parallel.moreutils -n1 -i echo myprogram arg1 arg2 "{}" arg3 "{}" -- {} +

does not work. When using find with + at the end there must be only one pair of brackets, and as both find and parallel.moreutils do not have the option to use something else than brackets as placeholder for the arguments, there is no "easy way" to make it work. As a general guideline I prefer not to use -exec but pipelines.

multiple args per instance at a specific position

xargs and parallel.moreutils do not support that feature directly. With sh -c it is possible to implement the desired functionality

find -print0 | xargs        --null --no-run-if-empty "$MAXPROCS"                       -n2      -- sh -c 'echo myprogram arg1 arg2 "$@" arg3 "$@"' sh

find -print0 | xargs        --null --no-run-if-empty            -- parallel.moreutils -n2         sh -c 'echo myprogram arg1 arg2 "$@" arg3 "$@"' sh --

find -print0 | parallel.gnu --null --no-run-if-empty                                  -n2 -i{} --        echo myprogram arg1 arg2 "{}" arg3 "{}"

My system has parallel, is it parallel.gnu or parallel.moreutils?

Both GNU parallel and parallel from moreutils might be available as parallel. The binary parallel.moreutils, at least on Debian based systems, is available only if both the packages parallel (which provides GNU parallel) and moreutils are installed. I believe it would be better if the binary parallel.moreutils would be unconditionally available when moreutils is installed, but it is not the case.

My current workaround is to use those two scripts in from an interactive shell

~/bin/parallel.gnu
#!/bin/sh

set -o errexit;
set -o nounset;

if dpkg -s parallel 2>/dev/null >/dev/null && [ -f /usr/bin/parallel ]; then :;
  exec /usr/bin/parallel "$@";
else
  printf 'GNU parallel does not seem to be installed'>&2
  exit 1
fi
~/bin/parallel.moreutils
#!/bin/sh

set -o errexit;
set -o nounset;

if [ -f /usr/bin/parallel.moreutils ]; then :;
  exec /usr/bin/parallel.moreutils "$@";
elif dpkg -s moreutils 2>/dev/null >/dev/null && [ -f /usr/bin/parallel ]; then :;
  exec /usr/bin/parallel "$@";
else
  printf 'parallel (from moreutils) does not seem to be installed'>&2
  exit 1
fi

The workaround is not that good, because I cannot simply look in the PATH (for example with which parallel.moreutils) to see if the program is available, unless I deploy those scripts only when the corresponding package is installed.

Other relevant differences

parallel.gnu provides tons of features, but for the "simple" use-cases most of them are not needed.

In particular, parallel.gnu has a --bar options for estimating how much job has already been done, both xargs and parallell.moreutils do not have such option. The option --verbose/-t of xargs can help to keep track at which point one is, but it is not really comparable.

For most tasks, xargs and parallel.moreutils might be good enough, and on minimal system you might want to avoid to install another programs (xargs is probably on most systems, and parallel from moreutils might be already available).


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.