Execute programs in parallel
- Note for
xargs
- Execute program with one argument per instance at a specific position
- Execute programs in parallel with multiple arguments at the end per instance
- multiple args per instance at a specific position
- My system has
parallel
, is itparallel.gnu
orparallel.moreutils
? - Other relevant differences
Executing programs in parallel is an easy way to speed up operations that handle a lot of data independently.
I’ve used in that situation GNU Parallel, but there are at least two other alternatives that are available on most systems (Windows included thank to wsl
, cygwin
, …): xargs
and parallel
from the moreutils
suite.
Different tools have different API, at least they are similar. Since I do not use those tools that often, I always end up searching for examples that cover my basic needs.
So I decided to enlist them here.
In most cases, I want to parallelize an operation over multiple files, so my command begins more or less with
find "$directory" -name "$pattern" -exec "$program_to_execute" {} +
and ends in the form
find "$directory" -name "$pattern" -print0 | "$exec_program_in_parallel"
For testing purposes, a simple printf 'a\0b\0c\0d'
can be used to replace find
with the parameter -print0
in the examples
Note for xargs
I’m using /proc/cpuinfo
for determining the number of jobs for xargs
. This might not be optimal.
The parameter --no-run-if-empty
is a GNU extension for xargs
, and might not always be available.
Execute program with one argument per instance at a specific position
# triggers warning if -n supplied, even with -n1
# xargs --no-run-if-empty is a gnu extension
MAXPROCS="--max-procs=$(($(grep -c processor /proc/cpuinfo)+1))"
find -print0 | xargs --null --no-run-if-empty "$MAXPROCS" -i{} -- echo myprogram arg1 "{}" arg2 "{}"
find -print0 | xargs --null --no-run-if-empty -- parallel.moreutils -n1 -i echo myprogram arg1 "{}" arg2 "{}" --
find -print0 | parallel.gnu --null --no-run-if-empty -n1 -i{} -- echo myprogram arg1 "{}" arg2 "{}"
Execute programs in parallel with multiple arguments at the end per instance
MAXPROCS="--max-procs=$(($(grep -c processor /proc/cpuinfo)+1))"
find -print0 | xargs --null --no-run-if-empty "$MAXPROCS" -n2 -- echo myprograms arg1 arg2
find -print0 | xargs --null --no-run-if-empty -- parallel.moreutils -n2 echo myprogram arg1 arg2 --
find -print0 | parallel.gnu --null --no-run-if-empty -n2 -- echo myprogram arg1 arg2
The parameter -n
is not necessary for xargs
and gnu.parallel
, unless you want to define the maximum amount of arguments. parallel.moreutils
defaults to -n1
if -n
is not provided, thus you should always set the -n
parameter for parallel.moreutils
.
Also note that in this case it is possible to avoid xargs
:
find . -exec parallel.moreutils -n2 echo myprogram arg1 arg2 -- {} +
You should never use \;
instead of +
:
find . -exec parallel.moreutils -n2 echo myprogram arg1 arg2 -- {} \;
In this case, find
invokes parallel.moreutils
with only one argument and waits until it finishes, thus providing no chance to invoke myprogram
in parallel.
Also note that
find . -exec parallel.moreutils -n1 -i echo myprogram arg1 arg2 "{}" arg3 "{}" -- {} +
does not work. When using find with +
at the end there must be only one pair of brackets, and as both find
and parallel.moreutils
do not have the option to use something else than brackets as placeholder for the arguments, there is no "easy way" to make it work. As a general guideline I prefer not to use -exec
but pipelines.
multiple args per instance at a specific position
xargs
and parallel.moreutils
do not support that feature directly. With sh -c
it is possible to implement the desired functionality
find -print0 | xargs --null --no-run-if-empty "$MAXPROCS" -n2 -- sh -c 'echo myprogram arg1 arg2 "$@" arg3 "$@"' sh
find -print0 | xargs --null --no-run-if-empty -- parallel.moreutils -n2 sh -c 'echo myprogram arg1 arg2 "$@" arg3 "$@"' sh --
find -print0 | parallel.gnu --null --no-run-if-empty -n2 -i{} -- echo myprogram arg1 arg2 "{}" arg3 "{}"
My system has parallel
, is it parallel.gnu
or parallel.moreutils
?
Both GNU parallel and parallel from moreutils
might be available as parallel
. The binary parallel.moreutils
, at least on Debian based systems, is available only if both the packages parallel
(which provides GNU parallel) and moreutils
are installed. I believe it would be better if the binary parallel.moreutils
would be unconditionally available when moreutils
is installed, but it is not the case.
My current workaround is to use those two scripts in from an interactive shell
#!/bin/sh
set -o errexit;
set -o nounset;
if dpkg -s parallel 2>/dev/null >/dev/null && [ -f /usr/bin/parallel ]; then :;
exec /usr/bin/parallel "$@";
else
printf 'GNU parallel does not seem to be installed'>&2
exit 1
fi
#!/bin/sh
set -o errexit;
set -o nounset;
if [ -f /usr/bin/parallel.moreutils ]; then :;
exec /usr/bin/parallel.moreutils "$@";
elif dpkg -s moreutils 2>/dev/null >/dev/null && [ -f /usr/bin/parallel ]; then :;
exec /usr/bin/parallel "$@";
else
printf 'parallel (from moreutils) does not seem to be installed'>&2
exit 1
fi
The workaround is not that good, because I cannot simply look in the PATH
(for example with which parallel.moreutils
) to see if the program is available, unless I deploy those scripts only when the corresponding package is installed.
Other relevant differences
parallel.gnu
provides tons of features, but for the "simple" use-cases most of them are not needed.
In particular, parallel.gnu
has a --bar
options for estimating how much job has already been done, both xargs
and parallell.moreutils
do not have such option. The option --verbose
/-t
of xargs
can help to keep track at which point one is, but it is not really comparable.
For most tasks, xargs
and parallel.moreutils
might be good enough, and on minimal system you might want to avoid to install another programs (xargs
is probably on most systems, and parallel
from moreutils
might be already available).
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.