xapply.
sh(1), and
have an understanding of the UNIX process model and have used GNU
parallel for a while. Or if you've used
dsh see the section below.
If you've been using find as a parallel
process tool, there is help for that too.
parallel and xapplyxapply
doesn't try to allow multiple sources of tasks to be mixed ad hoc.
Parallel offers to include iterations elements
from the command-line mixed with stdin and
fixed strings, while xapply acts more like a
filter: items come from stdin
or from fixed words, not both.
Every replacement markup in
Here is an example of
Or you are looking for
In my tool chain this is best done with
The list of hosts contains more than just the names of the hosts.
In fact any attribute related to the host might
be listed in the file. See the
Given that file we can process the three listed hosts. I'll put the
script I want to run in a file named
The cache recipe file contains the parts of the process that are marked-up
for every execution. This reduces keyboard errors, makes the process
more repeatable. The recipe is run through
By referencing the name of the cache directory on
the
The file
Note that every file may have a revision control comment in it,
that is a very good idea. Also note that
We encapsulate each operation in a directory so we may reuse them in
different combinations (and orders) to provide derived services. It is
possible to have any directory recursively call another, as well.
To run that for the same hosts:
The little detail is that the
The real loss here is in the reuse we got from the combination for
cache directories. In the
I usually use
Using the cache directory above would fetch the report from the
target host, then send it back as the file
By limiting the changes to the command-line we allow rapid development of
common tasks. Then quick integration into existing automation.
By using a configuration file format with arbitrary attribute macros
these 5 tools all read natively (
And everything should always be revision controlled.
Normally
For example we might
It is much more portable to use
In addition to the permutations done by the output order from the
parallel tasks,
Examples from the
For all of these you would actually embed the command in a recipe file:
either a
The other option would be to use
The minimal required recipe (to send no files) would be
To save even more typing add an
Back to Xapply takes the filter approach: if you want
a stream of words, then build that stream with a program that does
just that. Xapply reads stdin
in two formats: NUL (aka.
\000) terminated and NL
(aka. \n) terminated. We don't need a way to
mix files, fixed arguments, and stdin built-in
to the processor (we have perl for that).
Actually I've almost never needed more than sed
and oue to
cleanup any iteration list.
Markup differences
The secondary difference is that parallel's markup
tries to read your intent more than xapply's.
For example the markup
{.} under parallel treats
removing the dot-suffix from "sub.dir/foo.jpg" (.jpg is removed)
differently than "sub.dir/bar" (nothing is removed), while
xapply allows either
%[1/$.-$] to remove the suffix on the basename,
or %[1.-$] to remove a suffix that might
include a slash (/).
I believe parallels' markup more likely to
bleed complexity into other programs, and is less likely to solve any
real-world problem. (When you are not sure there is a dot-extender on
the filename, then fix the invariant so you are sure.)
parallel is a unique case,
each has a unique command-line option to change the spelling, and unique rules.
Xapply uses a more complex markup, but every
expansion supported by the dicer and mixer is general. That is to say
that the same dicer markup is used in mk,
oue, and other tools.
Simple command-line conversions
-0 or --null
-z. This is always a good idea if you
had to build an iterator script to output a mix of fixed strings,
files, and stdin.
-a input-file
-f and list all
input-files as positional parameters.
There is no interaction between stdin and
-f unless you specify a dash in the
positional parameters, in which case the default for -i
is changed to /dev/null.
--basefile file
msrc to send files to a remote host (and
recover any output files).
--bg
nohup and daemon
are for.
--cleanup
msrc's -m option,
which may trigger any recipe cleanup required.
--colsep regexp
--delimiter delim
--E eof-str
NUL terminated list, then use
-count to
set the number of column. If you need a delim
do it in that filter.
--dry-run
-n.
--eta
--group
--ungroup
-m and wrap your xapply
in an explicit xclate, if you want even more
features (like exit code processing). Or use the hxmd
program built on top of xapply.
--halt-on-error
USR1 signal and %p
markup to find the correct pid to signal xapply to
stop adding more tasks. Use hxmd's retry
logic to process any recovery logic (or code you own with
xclate's -N option).
--joblog
xapply, but hxmd's
REDO logic could provide some of the same data.
The date and run-time information could be provided by
changing -c, but this would really need to
be done in a script that called hxmd because
the command-line would be quite long and complex.
-k
xclate manager doesn't have a way to
force the order. If you want to collate the output, then write each
output stream to a file named as a function of
%u. For example, one may build
a temporary directory (/tmp/ksb) to
stash the output from several tasks:
Now a second pass with the same list will match keys to output files:
$ xapply -P8 -f 'long-shot %1 >/tmp/ksb/%u' moons$ xapply -f 'summary %W1 /tmp/ksb/%u' moons-L max-lines
-f and -count and -p pad
to get the same effect.
--load max-load
xapply do this. A
task manager like hxmd could sample the load
before injecting tasks into an xapply queue.
In my experience the system load average alone is
not enough information to provide task manager with sufficient feed-back.
It might have to sample any combination of swap space, available physical
memory, disk input/output utilization, and network throughput.
--xargs
fmt or adjust
to group arguments.
--onall
hxmd.
--files
-k above. Another
way to for stdout to a file is to prefix
the command with "exec >/tmp/ksb/%u;" which
doesn't limit the number of shell commands which might be listed in
cmd.
--pipe
--block-size size
NUL terminated input blocks, and
use -z. Any processing you need to do to
the input stream is better encoded in your own filter.
--progress
xapply show you a
progress status, since it doesn't know the total number of tasks
it might run. I have executed xapply processes
that have run for weeks, reading their input clients connecting to a FIFO.
--number-of-cpus
ptbw.
--interactive
or wrap your command in such a program.
echo Run %W*? 1>&2;read a</dev/tty;expr "_$a" : "_[yY].*" >/dev/null || exit 0--quote
Xapply supports
3 levels of quoting:
These prefixes allow some parameters to be quoted, while others
are not. For example:
%q
sh
or ksh) double quotes. That is any of these
four characters: \, ",
$, or `.
%Q
IFS):
`,
$,
\,
",
~,
*,
?,
[,
|,
&,
;,
(,
),
#,
=,
',
<, or
>.
%W
%Q would plus the
standard IFS list (space, tab, and newline).
xapply -2 -fp red '%1 %Q2' brush colors --no-run-if-empty
grep or
sed.
--retries
Xapply doesn't know a computer name from from
any other parameters, you are looking for hxmd.
--return filename
xapply. If you are to structure a
task to process data on a remote host and send return files back, I
would use either msrc -m
(see the hxmd's
--semaphore
-t. The
ptbw token broker acts as a semaphore
handler in most cases.
--sshloginfile
hxmd's
-C option. Which lets you specify a
whole lot more than a few fixed parameters.
--tty
-i/dev/tty.
--timeout sec
xapply, in any case. In a pinch you could use
mk's resource limits, but that's a little over-kill.
mk markup to do that:
The #!/bin/sh
# Use "mk -mLimit" to run with a 20 second wall-clock time limit:
# $Limit,clock=20: %f -t -P2
# ...mk --transfer
--trc
msrc.
--trim
sed.
--xapply
--shebang
mk and hxmd.
Since hxmd takes comments in the list of
targets we embed a marked line (see mk's
I do not think the pun of a configuration file as a script is a
great idea, but local policy allows other things I don't like.
Remember to #!/usr/bin/env -S mk -mRun
# $Run: ${hxmd:-hxmd} -C%f some-processor
list-of-hosts and attributeschmod it
+x.
-F. You can use
xapply as an interpeter with something like:
#!/usr/bin/env -S xapply -P2 -F
gzip -9 %1The push, remote execute, pull model
The most useful meme encoded in parallel is
the idea that one might visit a task on a list of hosts with some
data file, then return the results back to the driving hosts.
While that's not hard to explicitly spell under xapply,
it is surely easier to cast with parallel.
msrc
(or plain hxmd if you'd rather walk).
We break the task down into the orthonormal parts: the list of
target hosts (site.cf), the recipe
(report/Cache.m4) and the remote
script to run (report/cmd). The last two,
taken together, form an hxmd
hxmd
site.cf:
# $Id:...
%HOST COLOR CPUs
sulaco black 2
ripley grey 4
lv426 cream 1report/cmd:
# $Id:...
date
uptime
exit 0m4
for each host in site.cf so that the attributes
of the each host can tune the actions of the recipe. Then the recipe is
used as a make recipe file to build the required
data for each target host, which is also marked-up in
m4 so that it can be processed for each
target host to tweak the recipe for attribute values (like
CPUs).
Here is a very simple report/Cache.m4:
The file `# $Id:...
report: FRC
ssh -x 'ifdef(`REMOTE_LOGIN',`REMOTE_LOGIN@')HOST` /bin/sh <cmd
# Shell completion might put a trailing slash on our directory name -- ksb
'HOST`: report
FRC:
'dnlcmd in the report
recipe allows us to push commands to the target host without
quoting them from m4. make, and
the shell. We'll use the cmd script from
above.
hxmd command-line, we force
the m4 processing of
the Cache.m4 recipe in that directory and
the make update of the name of
the directory (as the target).
The update rule for the HOST macro is only triggered
when the directory name is suffixed with a slash, due to the rules
hxmd uses to create the update taget.
Which outputs:
$ hxmd -P10 -Csite.cf cat report
Change "sulaco:
Tue May 1 16:21:06 MDT 2012
2:34PM up 55 days, 23:54, ...
ripley:
Tue May 1 16:21:06 MDT 2012
2:34PM up 133 days, 10:00, ...
lv426:
Tue May 1 16:21:07 MDT 2012
2:34PM up 144 days, 20:32, ...-P10" to the options
"-dCX -P1" to see how it works.
cmd in the report
could take any actions required on the remote host (as long as it doesn't
need to read stdin).
This model scales out to thousands of hosts with attribute tunes for as many
cases a needed to meet your needs.
REMOTE_LOGIN
may be defined to map the local login to any remote login, even on a
per-host basis.
Using
To do that same task with msrc repeat that taskmsrc using a punned control
recipe we need a make recipe to offer the required
macros to msrc and with the report script encoded
as an update rule, and nothing else:
# $Id:...
INTO=/tmp/ksb.1
IGNORE=+++
report: FRC
date
uptime
FRC:$ msrc -P10 -Csite.cf make reportmsrc data recovery
only goes to stdout: with
hxmd the data is actually cached in
a local file, which makes it easier to use for
additional processing. Under hxmd we use
cat to display the "report", while under
msrc we use make to
run the display on the remote host. That is an important detail (the display
runs on the remote host, not on the local host).
msrc tactic we
code the cmd code in the recipe, and must
use make markup to quote dollar sign
($) and avoid command failures that
would stop the process.
msrc for software builds,
hxmd for process control scripts, and
xapply for ad hoc status polling.
report.
This is most useful when the process includes and update to the
content as it is processed (in at least one direction). This would
be triggered by including the name of the directory in the
MAP macro list.
See the msrc MAPed files are used
much more than MAPed cache directories.
The common wins with
With these tools you can specify a subset of a whole population with
some host selection options (which work for both tools exactly the
same way). For example you might target a single test host:
hxmd and msrc
(I replaced "$ msrc -G prometheus -Csite.cf make report-P10" with an explicit host
selection via -G.)
mmsrc,
msrc, hxmd,
efmd, and distrib)
and others can parse by proxy (via efmd), we
can share the host data between interactive tasks, across political groups,
and use them in diverse autotmation applications.
Conversions from
find's execution options to xapplyFind is a great utility for producing a source-stream
for a parallel task. Some non-standard additions have been made to
find to reduce the number of check processes the
-exec primative forks to
search the filesystem. I think there are better ways to improve
the overall through-put of a find pipeline.
find's -exec
should be parallelized with:
This pipeline allows $ find pathname ... expression -print0 |xapply -mfzP 'template' -find to traverse the
filesystem without any logic to manage forked
processes. We let find focus on the filesystem,
while xapply manages the processes. Tuning
xapply's the parallel factor (under
-P) added more parallel processes, adding
an xclate wrapper or ptbw
governor, or status code stream is now possible, where it is not
with find `managing' the execution.
This is a very powerful meme: by running a process in the context of
a different directory we may leverage another invarient to increase
our parallelism. Find's -execdirFind imposes a limit that
we'll refactor here: the name of the file we locate must be
the program we want to run.
By using xapply we remove that restriction.
find a make
recipe file (-name '[mM]akefile') or a file
with a locally meaningful extender (viz. ".lme"), neither of which
need be the program we want to execute. Using the dicer we can
select the directory, then run the processor of our choice:
$ find pathname ... -name '[mM]akefile' -print0 |
xapply -mfzP8 'cd %[1/-$] && make -f %[1/$] found' -
The OpenBSD hack to Find's + hack is really a binpackfind
(see the manual page) allows
multiple arguments to be joined into a single executrion of the
target utility, but it is really not
portable across versions of find.
$ find pathname ... -name '*.lme' -exec bundle-process +-print0 to
build a path list that is NUL
(\000) terminated.
Then use xapply -z to process
the list.
If you want to group the maxumum number of elements for
each command (like the OpenBSD $ find pathname ... -name '*.lme' -print0 |
xapply -mfzP13 -8 'bundle-process %W1 %W2 %W3 %W4 %W5 %W6 %W7 %W8' \
- - - - - - - -+ feature does)
use the binpack filter under
the -zN options to group the files, then
feed the list to xapply.
If you have a lot of filenames with special characters in them this
may exceed $ find pathname ... -print0 |
binpack -zN bundle-process |
xapply -mfP10 '' -kern.argmax, tune the limit down
with -b (divide by 2 always works). Since most
filenames do not have shell meta-characters in them, this almost
never happens. (Or tune -w down to make less optimal
bins.)
binpack permutes the order of
the files as it packs them into bins. If you require a (more) stable order,
just use a simple perl filter to
limit the command length. Here is an example:
#!/usr/bin/env perl
use Getopt::Std;
use strict;
# Example linear packer takes -b bytes and -z only, add others as needed --ksb
my(%opts,$q,$l);
my($overhead) = 8; # 8 >= sizeof(char *)
getopts("b:z", \%opts);
$/ = "\000" if ($opts{'z'});
my($bsize) = $opts{'b'};
if (!defined($bsize)) {
$bsize = `sysctl -a kern.argmax 2>/dev/null` || 128*1024;
$bsize =~ s/.*([0-9]*)\s*$/$1/;
# bias bsize for environment space, ptr+"name=value\000" * envs
map { $bsize -= 2+length($_)+length($ENV{$_})+$overhead } keys(%ENV);
}
my($cur) = 0;
while ($q = <>) {
chomp($q);
$q =~ s/([\"\'\\\#\`\$\&;|*()?><\{~=[])/\\$1/g;
$l = length($q)+$overhead;
if (0 == $cur) {
print "$q";
$cur = $l;
} elsif ($cur+$l+1 < $bsize) {
print " $q";
$cur += $l+1;
} else {
print "\n$q";
$cur = $l;
}
}
if ($bsize > 0) {
print "\n";
}
exit 0;The difference between
The dsh and hxmddsh application resembles
hxmd, but worries more about
the source host than the clients.
Emphasis is on local resource utilization, over client configuration, and
less on automation of client-side processes. Most trivial cases might
be implemented as straight xapply commands against
a file which only contains a list of hostnames.
Dsh's configuration structure breaks hosts into
groups (posses in hxmd speak) by listing the
members of a group in a file named for the group.
Hxmd allows arbitrary posse relationships,
via attribute macros and guards.
The attribute macros also provide configuration options to scripts, recipes,
and other files markup-up with m4.
Conversion of
The dsh optons to hxmddsh options are largely geared towards
interactive use to drive an interactive process, while the
hxmd options are more geared for
completely automated tasks.
-v show execution process
hxmd you may use -v,
-dC, and -dX to
show different aspected of the execution process.
--quiet
hxmd is very quiet.
--machine machinenames
-G followed by
the exact spelling of the hostname as it appears in the configuration file.
--all
hxmd.
--group groupname
SERVICE to form a posse,
see the hxmd
--file machinefile
-C, -X, or
-Z depending on what you really want.
--remoteshell shellname
--remoteshellopt rshoption
control
specification, or use the HX_CMD attribute
macro to set the default action.
-h
--wait-shell
-P1 for sequential commands. Set a
higher value for parallel access.
Always set $PARALLEL to a default that
makes sense in any script or recipe file.
--concurrent-shell
hxmd alone.
Usually we start a screen or
tmux instance, then drive that with
hxmd or xapply.
--show-machine-names
xclate options, like
But that only outputs the hostname as the first line of hosts that
output something, which is actually more useful.
$ xclate -ms hxmd -Csite.cf -F2 -e XCLATE=-H%2 "%0ssh -n HOST uptime" "HOST"--hide-machine-names
--duplicate-input
--bufsize buffer-size
tmux is used. But sending a shell
script or make recipe to the host is a much better
idea. Fingers on keyboards cause mistakes. Sending mistakes to many
hosts in parallel is a recipe for trouble.
-V
--num-topology N
msrc).
--forklimit fork-limit
hxmd
more than a hard limit (which is set with -P).
dsh web site:
I'll assume $PARALLEL is set to the parallel
factor you want for these examples.
uname)
$ dsh -a -c -- uname -a
$ hxmd -P 'ssh -n HOST uname -a'$ dsh -g children -c -- uname -a
$ hxmd -C children.cf -P 'ssh -n HOST uname -a'netgroup
ypmatch to get the list of hosts
we can feed them in as a configuration file:
$ dsh -g @nisgroup -- uname -a
$ ypmatch ... |hxmd -C - -P 'ssh -n HOST uname -a'mk, make,
op or other recipe processor, or in
a shell script, function or alias.
msrc with a
simple Msrc.mk, which makes the commands look
more like
$ dsh -g children -c -- uname -a
$ msrc -Cchildren.cf -P uname -a
The nifty thing about that command is that the
directory context supplies the default
# $Id....
INTO=.
SEND=.
MAP=.
IGNORE=+++-C configuration and other parameters
(via Msrc.hxmd). This saves a lot of typing
for interactive use, and allows scripts to use the same spells over and
over without recoding each service every time it is needed.
Msrc.hxmd
with the default -C and -P
options:
Then the command becomes just:
# $Id....
-Cchildren.cf
-P10
(Use the $ msrc uname -a-z command-line option to
defeat the inclusion of options from that file.)
Summary
Any of these tools are better than typing lots of commands by hand.
Pick the ones you like the best and use them, it might save your
hands and wrists.
-- ksb (KS Braunsdorf) Sep 2013
xapply,
or use your browser's back button.
$Id: parallel.html,v 3.21 2013/09/04 13:49:27 ksb Exp $ by ksb.