To understand this document

This document assumes you are quite familiar with the standard UNIX™ shell, sh(1), and have an understanding of the UNIX filter model (pipes, input/output redirection, and some specific filters.

It also assumes that you can read the manual page for any other example command. It would help a little if you've used xapply(1) or some other percent-markup filter, but it's OK if you've not used any before.

What is `oue`?

The manual page describes oue as a filter that removes duplicate elements from a stream much as "sort -u" would. The main difference between a sort-unique operation and the unique operation this program provides is the lack of a delay before the next filter gets the first element.

The sort operation must wait to read every line of input before it may output the first line of unique data (because the last line might be the first in the sorted order). The unique filter provided by oue is different in that the first element is output as soon as it is discovered to be unique. That is quite a turbo-charger when the later processing is limited by factors other than the CPU; for example, a network command managed by xapply.

Perl programmers have used the "touch a hash element" code to find unique items for a long time. In fact, oue was a perl script that looked a lot like:

perl -e 'while (<>) { next if $s{$_}; $s{$_} = 1; print $_;}'

before I needed more dicer action.

In the sections below, I'll try to explain why oue is nifty enough to put in your spell book and how to use it to solve real problems for the real world.

The simplest cases

Oue without any options is a powerful filter. Say you have a way to generate a list of hosts that you want to visit to run a "clean-up" command. This generator does generate all the hosts you need to clean-up, but sometimes it generates the same hostname more than once. This is may not be a big problem. If it doesn't do it often, you might just be able to ignore the duplicate clean-up jobs.

On the other hand, the clean-up code might take a lot of resources, be destructive when run multiple times, or the generator might produce many repetitions of a few hosts. In any of those cases, building a pipeline using oue reduces the list to the unique hosts:

generator | oue | xapply -f "ssh %1 -n clean-up..." -

This doesn't work if the hostname are spelled differently; for example, as "foo.example.com" and "foo" and "FOO.example.com" since oue doesn't know about hostnames or folding case. When the case of an element is an issue, use tr to fold the case, as in:

generator | tr '[A-Z]' '[a-z]' | oue | ...

The command-line interface

Oue reads a stream of data from files or stdin and writes results to stdout; so, it stands to reason that the command line options would set parameters on how the filter processes the streams.

In the sections below, I'll introduce each option as I describe the use-case for each, but we'll look at the whole here:

oue [-cdlNsvz] [-span] [-a c] [-b length] [-D state] [-e every] [-f first] [-k key] [-p pad] [-r memory] [-R report] [files]

oue -I prev [-i] [-cdlNsvz] [-span] [-a c] [-b length] [-B replay] [-D state] [-e every] [-f first] [-k key] [-p pad] [-r memory] [-R report] [-x extract] [files]

oue -H

oue -h

oue -V

The first set is the filter set when oue is processing data. The -h usage is the common "ksb help" output for quick on-line reference. The -V usage outputs a version string that is used to track out-dated copies of the filter. The -H offers on-line help for the built-in percent markup.

In the implementation the -I is not actually required to specify -B, -x, or -i -- but they don't do anything without a prev to operate against. I removed the restriction as older versions of oue used a bundled "-iI" in the examples in documents like this one.

Remember the past (`-D` `state`)

The memory features make the other other use-cases for oue easier to explain. Oue builds a GDBM file from the input keys. Each unique element is reduced to a key (which we'll talk about later), and a memory. That pair records the existence of the element in the GDBM file, which is called state on the command line. When no state is specified on the command-line, oue builds one in a temporary file then removes it after the input is exhausted.

If you need to keep the state to process more data, you'll have to pick a file for oue to remember the keys it has already seen. This file is usually pretty compact (only growing linearly with the size of the unique keys). After you're done with it, you should remove it.

We need some input files to use in the first set of examples, so let's build (or imagine) 2 files (1.el and 2.el):

## build 2 files to draw keys from
$ cat >1.el <<\!
a
b
!
$ cat >2.el <<\!
b
c
c
!

To find the unique elements (where each element is a single line), we would run:

## find the unique keys from both files (the union)
$ oue 1.el 2.el
a
b
c

Or, in reverse order:

## reverse the files order
$ oue 2.el 1.el
b
c
a

If we had to process the 1.el file on Monday and the 2.el file later in the week, then we'd have to keep the state file around between the executions:

## adding files later
$ oue -D /tmp/keep.db 1.el
a
b

## time passes
$ oue -D /tmp/keep.db 2.el
b
c

This option lets you remember the keys you've seen over a longer period of time. It also provides more features in combination with the other options -- mostly with -I).

Don't update with what we saw (`-I` `prev`)

In the example above, oue updates the state GDBM file for each new unique element. We saw "b" on the first run, but didn't exclude processing of it the second time. Let's fix that:

## start over
$ rm /tmp/keep.db
$ oue -D /tmp/keep.db -I /tmp/keep.db 1.el
a
b

## time passes
$ oue -D /tmp/keep.db -I /tmp/keep.db 2.el
c

There are use-cases where we'd like to exclude elements, but not remember them between runs. For example, we may build a file of the English words that use 'w' as a vowel ("cwm", and "crwth") for use in a spell check filter.

## build a state file for words that use w as a vowel
$ (echo cwm ; echo crwth) | oue -D vwls.none >/dev/null

We don't want to add words to file with any other instance of oue, but we do want to exclude them from a list. So, we can consult the file with the -I option. This includes the GDBM in the processing and, by default, excludes the elements from the list of unique elements. Though, any updates go to state, while no updates are directed to prev (ever).

To use this in the spell checking pipe-line:

 ... list-words  | oue -I vwls.none | check-unique-words

This has 3 nice features:

We only check the spelling of each word once: Since oue removed any duplicate words.
We don't have to spell check words without a vowel: Since oue removed those as previously seen keys.
We don't have to sort the input stream: Since oue doesn't depend on key order for the uniqueness checking (unlike sort's -u option).

This is a silly example, but it shows the general case well. More often I use oue to filter the names of target hosts, so I only visit each host once, and might filter out hosts that are known to be unavailable.

Invert the sense of `prev` (`-v`)

The prev GDBM usually removes keys from the output, but under the -v switch keys must exist in that GDBM to be included (once) in the output list. That is to say, oue intersects the keys in prev with the input keys to form the output key list.

This option is named for grep's inversion option.

As an example, we can intersect our first two key files -- first building a state from 1.el named state.db, then using that to intersect 2.el

## find the keys in common (the intersection)
$ oue -D state.db 1.el >/dev/null
$ oue -vI state.dh 2.el
b

Disjunction

Since we can intersect and union key sets, it would complete our set operations if we could form a disjunction (unique element that are not common to 2 key lists). We can, but we must form the intersection set first:

## find the keys that are non-intersection (the disjunction)
$ oue -D state.db 1.el >/dev/null
$ oue -vI state.db 1.el | oue -D intersect.db >/dev/null
$ oue -I intersect.db a.cl b.cl
a
c

Be sure to clean up any left-overs from such an operation:

## clean up our temp space
$ rm intersect.db state.db

Reporting what we saw (`-i`)

Given that we can remember keys over time, it would be cool if we could export those keys back into a pipeline. The -i option specifies that the previous GDBM given as prev should be included in the output. If we use the vwls.none file we built in the last example and an empty input file (viz. /dev/null), we can recover the words in it with:

## look at the current keys in vwls.none
$ oue -iI vwls.none /dev/null
crwth
cwm

At the end of some reporting cycle, we may have to report on more than just the keys we saw over the previous executions of oue. We might have to keep some reference to where we saw the key (or other such data). We need to learn how to put that data into a GDBM before we can talk about getting it back out. But the option you want is -B.

This shows you that the default value is a lonely period ("."):

## look at the current keys and values
$ oue -iI vwls.none -B "%k -> %m" /dev/null
crwth -> .
cwm -> .

A refined key (`-k` `key`)

In many cases, oue is required to pull a part of each input element to form the key for unique comparison. If oue only worked on whole lines, it would be far less useful to an application programmer.

To select part of the input element as the key, the command-line option -k specifies a dicer expression. In dicer-speak, we call the first line in the element %1, the second input line would be %2, and so on. Later, we'll see how to make elements span multiple input lines, but for now assume that there is always 1 input line in an element.

If you already understand the dicer from xapply or mk, then you can skip down a bit.

The dicer breaks lines into fields based on a character, then may break that result again until it extracts the required datum. The usual split would be on white-space or a single character. In English, I would say "from the first line of the element extract the second word;" in the dicer-markup, I'd spell that "%[1 2]".

In awk I would say awk -F: '{print $3;}'; in the dicer, I would spell that "%[1:3]". The big difference is that the -F specification for awk applies to every record in the file, the dicer markup may change the separator character as often as required.

The /etc/passwd file holds the login information for each user on a UNIX system. The seventh field is the login shell. So, to get the list of unique shells from the password file, we could specify a key of %[1:7] with a shell command like:

## Find the unique shells on this host
$ oue -k '%[1:7]' /etc/passwd
/bin/ksh
/usr/sbin/nologin
/sbin/nologin

In English, I might ask for "the first digit after the close-paren from line 3", in the dicer I would say %[3(2 1] to get the first word after the first close paren, then use the mixer to pull out the first letter with: %([3(2 1],1).

$ (echo line1; echo line2; echo "call (000)265-1212") |oue -3 -R "%([3(2 1],1)"

Using the dicer and the mixer you should be able to get almost any string you can describe in English out of well formatted text. If you can't, just use another filter to make it easier to reach. This is usually done with sed or perl, because they can duplicate lines, strip off the parts that made it hard -- then oue scan parts of (or the whole of) the new line to get the required value(s).

Memories (`-r` `memory`)

In the last example, we selected the unique logins, but we don't know which login used that shell. For example, I may want to know which login uses the out-dated reference to /sbin/nologin.

In other words, we want to remember a different part of the element along with the key that inserted it. The -r option specifies a dicer expression to select the part of the element to remember (called the memory).

If we run the oue command with the -r specification, we see exactly the same output:

## Find the unique shells on this host
$ oue -k '%[1:7]' -r '%[1:1]' /etc/passwd
/bin/ksh
/usr/sbin/nologin
/sbin/nologin

That's because the result of the %[1:1] dicer specification went into the GDBM, and we removed it. We should ask oue to output the data as it inserts it. Keep reading to see how to fix that.

Report more than the `key` (`-R` `report`)

To report more than the key for each element, specify a report dicer expression under -R. Oue outputs the results of this dicer expression to stdout for each insert into state, or under -i for each record in prev.

By adding -R '%[1:1] -> %[1:7]' to the example command, we get a nifty table:

## Find the unique shells on this host
$ oue -k '%[1:7]' -r '%[1:1]' -R '%[1:1] -> %[1:7]' /etc/passwd
root -> /bin/ksh
daemon -> /usr/sbin/nologin
bin -> /sbin/nologin

It looks a little redundant to specify possibly long dicer expressions twice on that command-line; so, in the context of the report option, you should use %k for the current key, and %m for the current memory. This works under -i, where the current lines are not available. The better command would be:

## Find the unique shells on this host
$ oue -k '%[1:7]' -r '%[1:1]' -R '%m -> %k' /etc/passwd
root -> /bin/ksh
daemon -> /usr/sbin/nologin
bin -> /sbin/nologin

We could use more of the environment oue provides to record the line number (%n) that gave us the fact -- but in this case we don't need it. The login name uniquely selects a line from the password file, we hope; if it doesn't, you need the next section.

Duplicate keys (like `uniq` `-d`)

If the logins in the password file are not unique, we can use oue to find out!

$ oue -d -k '%[1:1]' /etc/passwd

I hope that command never shows you any output. If it did it would be telling you that more than 1 login had the same name, which would be a bad thing. With shadow password files, on the other hand, almost every account has x as a password, so

$ oue -d -k '%[1:2]' /etc/passwd
x

Should only show you "x" on a line by itself, if it shows more unique passwords in the world accessible file you might think about an upgrade.

A more advance example, to check the HTML source to this document for duplicate id tags I used:

$ sed -n -e ': again
	/.*[Ii][Dd]=\([^<>]*\)\(.*\)/{
	h
	s//\1/p
	x
	s//\2/
	b again
}' oue.html |tr -d '"' |oue -d

Count of key instances (like `uniq` `-c`)

By itself the count option (-c) outputs the unique keys prefixed by the count of the number of times the element occurred in the filtered input. To do this the filter must remember each unique key and a counter for each one. Because of the hash used to store this state the output is not in any stable order, by default.

$ oue -c 1.el 2.el
2 b
2 c
1 a

If the order of the input keys must be reflected in the output order ask oue for "stable output":

$ oue -cs 1.el 2.el
1 a
2 b
2 c
$ echo "x" | oue -cs 2.el - 1.el
2 b
2 c
1 x
1 a

The stable option does require another temporary file which is the size of all the accumulated keys. For comparison here are the resource usages for some similar commands (run from /usr/share/dict), which make it clear that uniq is much faster if you know the input is already sorted and don't need the dicer to find the key:

Command real user system

cat words 0.01s 0.00s 0.00s

uniq words 0.28s 0.27s 0.00s

oue words 4.30s 1.11s 2.89s

sort -u words.rnd 0.75s 0.53s 0.00s

uniq -c words 0.80s 0.62s 0.01s

oue -c words 8.22s 2.25s 5.15s

oue -cs words 12.32s 2.75s 7.49s

Command	real	user	system
cat words	0.01s	0.00s	0.00s
uniq words	0.28s	0.27s	0.00s
oue words	4.30s	1.11s	2.89s
sort -u words.rnd	0.75s	0.53s	0.00s
uniq -c words	0.80s	0.62s	0.01s
oue -c words	8.22s	2.25s	5.15s
oue -cs words	12.32s	2.75s	7.49s

But when you need to process each element to find the key, and the elements are not already sorted -- you can pay the 10 seconds to get the right answer. Also take care when testing uniq, unlike most filters it writes on any second file specified, so it is really not a filter.

Duplicates counted, processed, and more

I selected options for oue that work in almost any combination. It is fine to specify both -c and -d:

$ oue -cd 1.el 2.el
2 b
2 c

More to the point -v inverts the selection logic of -d:

$ oue -cdv 1.el 2.el
1 a

To get even more out of oue you can visit wonderland with me. Each key has a unique accumulator bound to it. This string variable allows you to record a fact (list, string, or dicer expression) about the elements selected for the key. When you report or remember the key the markup %e expands to the accumulated value. The option -e sets the every dicer expression that updates the accumulator.

Any accumulator needs an initial value, which is set under -f (called first). For example let's collect the place where we first found the key:

$ oue -dc -f "%f:%n" -R "%k from %e count %c" 1.el 2.el
b from 1.el:2 count 2
c from 2.el:2 count 2

$ echo "c" | oue -dc -f "%f:%n" -R "%k from %e count %c" 1.el - 2.el
b from 1.el:2 count 2
c from -:1 count 3

With every we can add each line number to the accumulator. For that we need the previous value of the accumulator, which is markup provided by the %p:

$ oue -dc -f "%f" -e "%p,%n" -R "%k from %e count %c" 1.el 2.el
b from 1.el,2,1 count 2
c from 2.el,2,3 count 2

That last output is misleading, the line number for the second "b" is from 2.el -- but this is a simple example. (More accurate output requires a sed filter to cleanup the list of line numbers.)

The uppercase version of %p will not expand an empty previous accumulator. Rather %P consumes charaters from the template up to the next percent character to remove any punctuation that would come between the empty value and the next expander. Thus -e "%P,%m" makes a list of key memory values without a leading empty (when there was no previous value).

Let's get crazy and expand the duplicate tags checker to output the lines with each duplicate anchor id:

$ sed -n -e ': again
	/.*[Ii][Dd]=\([^<>]*\)\(.*\)/{
	h
	=
	s//\1/p
	x
	s//\2/
	b again
}' oue.html |tr -d '"' |oue -d -2 -l -k %2 -e "%p,%1" -R "%k on lines %[m,-1]"

Which might output something like "tag on lines 18,504". Note that we used sed's = command to output the line number before each tag, then collected them as pairs of lines in oue (via -2).

Setup to count more

If you really want a running count across more than a single instance of oue you'll have to add the previous count to the present value. Given a prev that holds the last count in the value (someplace you can get to with the dicer) you should specify a extract expression to find it. Then a memory to build a new record.

That makes it sound too easy. Since oue really wants unique elements there is no way to make it update the previous GDBM. It is going to record the new count in the state GDBM, and it is up to you to get that merged with the previous; even there oue can help:

$ oue -cD count1.gb -r "%c" -R '' 1.el
$ oue -D cur.db -R '' /dev/null
$ xapply -a@ '(
		oue -c -Icur.db -x %v -R %k.%t @1
		oue -B %k.%v -iIcur.db /dev/null
	) |oue -k %[1.-\$] -r %[1.\$] -D tmp.db -
	mv tmp.db cur.db
  ' 1.el 2.el
$ oue -iI cur.db -B '%k count %v' /dev/null
b count 2
c count 2
a count 1

This allows as many runs of oue as needed to process various input file, formats, and functions. I picked an xapply example, but in some scripts the loop is actually done with hxmd.

Expected accumulation versus quicker output (`-l`)

Under -d (or with no other options) the accumulator doesn't look like it works. That's because oue reports each key as soon as it knows it must (and has the complete count of under -c. If you want to delay the output of every key until each instance has been collected you need to specify -l.

Under -l oue waits for the last instance of a key to output the report or update the state GDMB. For example, we only get a count of 2 when we provide 3 identical keys under -d:

$ (echo foo; echo foo; echo foo) |oue -d -R "%c %k"
2 foo

That's because 2 was enough to trigger the output, we didn't specify -c to force a complete total, so oue updated the state as soon as the count reached 2. To prevent this we specify the last option:

$ (echo foo; echo foo; echo foo) |oue -dl -R "%c %k"
3 foo

If you added an every specification to the example (I'll let you do it) we'd see the last value of the accumulator in the output, rather than the second.

Of course the stable option (-s) fixes the order of the output to match the input order, if that's important to you.

Summary after all the filters are done (`-B`)

I think one of the best features of oue is the summary report you might generate after all the processing. If you've been putting aside useful information in the state GDBM you can extract it under -B to build whatever report you need. I like to keep dates in the GDBM for some tasks (like updating software on a host). In the report I might output the name of the host (the key) and the date we updated it (the value), then sort those and put them up on a status page on my admin website.

The difference between -R and -B is that when there are current and previous records with different value markup you need a way to separate them in the report versus replay logic. For summary reports may use -B explicitly and use the file /dev/null as files.

As an example lets parse the group file for login names that are not in the password file. First we'll make a list of the logins in the password file, with a value of login gid.

$ tfile=`mktemp /tmp/test$$XXXXXX`
$ oue -D $tfile -k %[1:1] -r %[1:4] /etc/passwd >/dev/null
$ oue -B '%k login gid %m' -I $tfile /dev/null  # check our work

Next we should pull all the logins from /etc/group, using oue to filter out the ones we saw in the password file:

$ sed -e 's/^.*://' /etc/group | tr -s ',\n' '\n' | oue -I $tfile

That should output nothing on a clean host.

We can use a similar tactic to check the other direction. We build a collection of all the primary login groups mapped to the logins that require them.

$ rm $tfile
$ oue -l -k %[1:4] -e '%P,%[1:1]' -r '%e' -D $tfile /etc/passwd >/dev/null

Then look for non-duplicates from the group file, and map that back into the list of logins that requires the gid mapping:

$ oue -k %[1:3] -r %[1:1] -D $tfile.gids /etc/group >/dev/null
$ oue -B %k -I $tfile /dev/null | oue -I $tfile.gids -R '%k' |\
  oue -dI $tfile -R 'gid %k missing from /etc/group for %v'
gid 602 missing from /etc/group for charon
$ rm $tfile $tfile.gids		# cleanup

Change the escape character from `%` (`-a` `c`)

Like xapply, oue might need to use an alternate escape character. If that's the case, use -a to change it. I've never needed to do that. If I need to nest an oue inside another dicer-based filter I change the escape character in the enclosing process.

Multi-line elements (-`span` and `-p` `pad`)

If oue could only work on single line elements, we'd have to fit programs on either side of it to compress records into single lines, then uncompress them. That proves to be error-prone and too much work.

To read a 5 line record from a file, use a span of five as -5. Under that specification, the dicer provides %1, %2, %2, %3, %4, %5, and %$. The %$ is the last line in the element, which, in this case, is an alias for %5.

So, the dicer expression %[2 5] is the fifth word on the second line of the record. What if the input doesn't have five lines (or a multiple there-of) though? In that case, oue pads the last element with repetitions of the pad string specified under -p.

Process ASCII `NUL` terminated input (`-z`)

Sometimes, you can't trust the output from a program to be safe. For example, the output of find over a publicly write-able directory. Someone could put files with newlines or spaces in them in there.

For just that reason, find provides the -print0 option -- and oue provides -z to read it. Oue also outputs elements in the same style as the input; so, under -z the output report is sent on with NUL termination.

If you want to read under -z and write newline terminated records, put a tr on oue's output (viz. tr '\000' '\n'). If you want to block until all the unique elements are gathered, then use -D to keep them in a GDBM, and replay it under -iI later. I'd use the tr command.

A presumptious use of this would be to find the unique back-up files for the group file in /etc/OLD:

$ glob '/etc/OLD/group??????' |
	xapply -f 'printf "%1\\000" ;cat %1 ;printf "\\000" ' - |
	oue -z -2 -k %2 -R '%1 '

That does assume that the group file doesn't contain any NUL characters. The command outputs all the unique backup files: using that list with an inverse option creates a script to remove the duplicates. How clever. If you don't have printf(1) installed the xapply might be better written with a tr to create the NUL delimiters as:

	xapply -f 'echo %1 |tr "\\n" "\\000" ; cat %1 ; echo "" |tr "\\n" "\\000" ' - |

Advanced dicer expression (the mixer)

Yes, you can use the mixer too, just like xapply. See the xapply HTML document. The mixer is available as %(dicer).

A good example of this is the typical reformatting of a number, like a phone numer (or an account number) to put in dashes or other bling. See the dicer/mixer HTML document for some examples.

Using the dicer in your own code

Read the dicer HTML document which goes into more detail about the interface to the dicer code.

Unsafe GDBM updates (`-N`)

If you want to leave a state GDBM file unlocked while you update it, specify -N. It's your party: I hope you know what you're doing. I'd only use that on the default temporary file I know we're going to discard.

In fact if you don't specify -D then oue opens the temporary GDBM unlocked, because the file will be unlinked as soon as the filter is done.

Out of space errors (`-b`)

In rare cases the processing of an element might take more resources than the default 10KiB per input line. In that case oue outputs an error message like this:

oue: expansion is too large for buffer

Specify a larger value for length. Be aware the oue doesn't allow values smaller than 1k per line.

$ oue -b 400k ... huge.rec

Some real world examples

Oue is oft used in scripts to suppress duplicates, be they error output, hosts, filesystems, files, process-ids, or any other text item.

Remember to remember

My operations team wants to be told about RAID controller errors. If a battery, disk, or controller need replacement we want to be told as soon as possible. But we don't want a flood of errors to hide other issues in the stream.

We record all the unique errors for each day-of-year in a file. We filter errors through yesterday's file and todays. We'll use date to fetch the Julian day for today and yesterday:

#!/usr/bin/env ksh
# $Id: ...
# Only show the same error every 2 days, if we can't find a place to
# write the squelch files, just become a cat.
# $Purge: /usr/local/sbin/kruft -A 10w /var/local/squelch
cd /var/local/squelch || exec cat
CUR=`date +%j`.db  PREV=`TZ=GMT+24 date +%j`.db
[ -f $PREV ] || oue -D$PREV /dev/null			# assure prev exists
exec oue -I$PREV -D$CUR

That does require a cleanup job to remove the older files. I use a crontab entry (or kicker task) to run kruft once a day:

8 21 * * * mk -smPurge .../squelch

This leaves a trail of all the errors for the last 10 weeks in the squelch directory. For admins to review (with oue). We put all the tuning parameters in the script itself for easy deployment, except the time the cleanup runs, which could be in there too.

Evolutions of this script may allow longer suppression of error, or might send recurring issues to another escallation chain.

Process log files to retire services

I need to get a list of all the IP clients connecting to a service we are planning to retire. From that list I'll contact the owners of the client applications to help them upgrade to the replacement. So over time the list of clients should drop, until I can shutdown the existing service and retire the supporting instances.

I have a number of log files on each instance that include the client information. These roll every day, so I'm going to look at the last 2 days of logs and find the unique clients. The decompress command outputs a line that looks like:

 service foo request from alien.npcguild.org (10.9.4.6) for ...

So I need a filter on each host to convert the verbose logs into a list of unique clients (we'll allow IPv6 as well):

#!/usr/bin/env ksh
# $Id:...revision control markup... $
find /var/puma/logs -name cli\*[0-9] -ctime -2 |
xapply -f 'local-decompress-command -f %1' - |
tr 'A-Z' 'a-z' | tr -s '\t  ' '  |
sed -ne 's/^.*request from \([^ ]*\) (\([0-9.:a-f]*\)).*/\1 \2/p' |
oue

(We drop to lowercase because some spellings of the hostnames may be upper or mixed case.)

If I run that on all the host (with hxmd) I can direct the output to a local oue which will unique the sum of the streams:

$ hxmd -P20 -Cpuma.cf 'ssh -n HOST script-above' | oue -cl

Which outputs the number of target hosts that see at least one connection from each client. This makes it easy to measure the on-going progress of the retirement effort. As demand declines we can consolidate load onto fewer instances.

If we want to know which clients connect to the most instances we could change the summary command slightly:

$ hxmd -P20 -Cpuma.cf 'ssh -n HOST script-above | sed -e "s/^/HOST:/"' |
	oue -l -e '%P,%[1:1]' -k '%[1:-1]' -R '%k: %e'

which outputs each "host IP" followed by the list of hosts with which it connected.

Summary

Oue is a handy filter for compressing lists into unique elements and finding common (unique) elements from any number of input lists. Oue also supports records with any fixed number of lines in them.

It largely replaces the common uses of comm, sort -u, and uniq in most pipelines. And removes the restriction that their input streams must be sorted. It works very well with find and xapply, even with NUL (\000) terminated streams.

Multiple passes of oue may use a history file (prev), which allows processing to efficiently continue. Additionally oue may be used to build summary reports from saved gdbm files or directly from raw input (via accumulator logic) including occurrence counts and details.

$Id: oue.html,v 1.23 2012/09/25 15:08:47 ksb Exp $

To understand this document

What is oue?

The simplest cases

The command-line interface

Remember the past (-D state)

Don't update with what we saw (-I prev)

Invert the sense of prev (-v)

Disjunction

Reporting what we saw (-i)

A refined key (-k key)

Memories (-r memory)

Report more than the key (-R report)

Duplicate keys (like uniq -d)

Count of key instances (like uniq -c)