To understand this document

This document is all about how to avoid recurring errors in your sites configuration management. In policy, procedure, and in the structure of the system you deploy.

If you've ever make a mistake, you're good to go for this document. I also like the descriptions in the Field Guide to Human Error. Reading that first might help you understand this document.

What I've tried to provide is enough feedback and cross-checks in the operational processes and development work-flows to enable workers to compare their intent to the actual status of the automation they are driving. This is a balance that requires a large measure to diligence and a trust-but-verify attitude at all times.

No complex system is inherntly safe, and the master source structure is just as dangerous as any other. The intent is to balance the power to make broad changes with the assurance that your actions are changing what you meant to change.

The goal of any configuration management strucure is to "build what you promised from what you have, without errors, and in a timely manner", see the master source top-level document. Mistakes in execution are more likely to break ongoing production than errors in content. Since our goal is always to make the structure better, we should take steps to avoid either type failure.

To some extent every configuration management strucure is clumsy and complex. People balance and mitigate these issues to differentiate success from failure. This document explains the reasoning and tactics I use to train people, maintain site policy, and justify my salary.

Types of errors

I'm going to break errors into 5 common groups. I do this not to build a clever taxonimy, but to focus my mitigation efforts on specific cases. You might find other types and find better mitigations, but this is my document:
Missing a change in status, metrics, paths, names, or values
Errors caused by this include sending an uncommited change to a production hosts, sending the wrong change to the target hosts, sending the corrent source to the wrong population.

The proximal cause of this type of error is almost always a missing close-the-loop step.

Mode error.
The safty was off and you pulled the wrong trigger. This includes activation of the wrong target by being in the wrong directory, or running as the wrong login, or running a step out-of-order.

The proximal cause of this type of error is usually a loss of context. Either interruption in process or lack of a secondary checking your work. Sometimes also caused by normalization of deviance.

Data overload or loosing state.
Getting lost in a graphic display architectures, getting no (meaningful) feedback from the process,
Lack of coordination when changing common configuration inputs.
Priority mistakes

Proximal versus distal causes

Proximal causes are those closest to the trigger event. Those include keystroke errors, perspective and context errors, and timing errors. Distal causes are those setup before the trigger event. Those include partial updates to input data (code, tables, databases, really any bits in-play), ineffective site policy or procedure, and lack of cross-group coordination.

In the new version of the master source I've tried to make all of the data proximal and available to the diver of each structure. The local recipe file (Makefile or Msrc.mk) and the platform recipe file (Makefile.host or Makefile) are both kept in the current working directory. No data is stored in a non-text format (we strongly prefer text files to database tables). There are command-line options to display the derived values of each step in the process, and options to dry-run most every operation.

These give the driver feedback. That feedback must be taken seriously by every driver. A push from the wrong directory, or with the wrong version of a key file is just about the worst thing you can do to any production operation. I also include the current working directory in my shell prompt, PS1.

Proximal pitfalls remain

For example specification of a production configuration file when a test file was required. This has actually happened to a valued team-member of mine. A request to send a configuration update to test systems was mistakenly sent to live production hosts. This resulted in a serious service failure. The two commands were different in exactly 4 characters:
$ msrc -Ctest.cf -P10 -E ....
vs:
$ msrc -Cprod.cf -P10 -E ....

To mitigate that we added a step to the procedure to stress that running a "do nothing update" before any push that might damage production is mandatory:

$ msrc -C.... : test
The : test command is selected because forgetting the colon runs an empty test command (which fails silently), and missing the test word doesn't hurt anything either. (Omitting the space fails to find the command :test, which is also harmless.)

The output from that command includes the list of instances updated. Which gives the driver two items that might trigger an abort reaction: an unexpected set of hostnames, or the length of the host list (either too large or too small) for the expected change targets. The attempt here is to offer feedback before the actual commit, and with history editing replacing the : test with make install is trivial.

The fix for the aborted update is also clear: if you got the wrong set of hosts, then you should use efmd to produce the correct list. That gives you the updated options for msrc, since they take identical options.

Keystroke errors

Each command is a step towards success or failure: choose with care.

Use the recipe file to record all the update commands you intend to run. Testing a recipe file's install target which updates 20 files on a test host is great. Keying in 20 install commands in a change window is insane. I don't know how anyone can justify the odds of a mistake in the later case.

For similar reasons I avoid punned recipe files. When a single make recipe named Makefile serves as both the master and platform recipe file, one might activate an update in the wrong context.

If you don't put in a make recipe, embed it in a comment using mk markup. Never type a utility command of more than a few words. I try to avoid quoting shell meta character for the remote command as well. If you need a pipe, put it in the recipe. There may be an occasional need for a remote shell meta-character (usually &&), which is why msrc passes them quoted to each target host.

On the other end of the typing errors: recipe names should be long enough to avoid single letter mistakes. Steps named n and m are easy to mistype and mistake for each other.

Always have at least 3 eyes on every change

Two people should both check any keyboard-entered commands run for a change. Nobody I've ever seen has 3 eyes, so the rule is there should be at least 3 eyes on every step of a change that is entered on a keyboard.

People rely on there experience to recognize key patterns that indicate if things are going according to plan, or not. The idea is that the two people making a change share a common mental model of what is supposed to happen, and are constantly checking their mental model against what is actually happening and each other. This situational awareness is core to preventing mistakes. (This is also a core concept in pair programming, for the same reasons.)

I am assuming that all scripts used to make production change were reviewed and tested on a non-production set of hosts, well before the change window. If that is not the case, fix that first. There is little-to-no excuse to run any change without prior testing.

Checkouts that verify success

It is a great idea to have a checkout target in any update recipe. This should product no output if all is well, and a statement of what is missing or wrong which includes the instance name and an absolute path to at least one out-of-phase file.

This gives the person running the change a close-the-loop metric which enables them to close the change ticket with a positive assertion ("checkout complete") rather than a observation that they didn't see any obvious errors.

Note that the checkout recipe should never be a step in the update recipe. It might be run before the update to verify that the update has not been done, and it may well be run multiple times (by the third eye requirement) as a post-update step.

Processes which can't stop, are moded, or lack progress

Recovery from common operations should never be diffcult. It always comes as a terrible surprise to an operator when something like an keyboard interrupt puts the structure in a nearly unrecoverable state. Mark any step must run to completion clearly.

Similarly a structure that changes modes without a clear request from the operator is really bad. The old complaint from emacs users that vi's modes were bad is ironic, in that emacs has even more modes, and it can change modes without keyboard input.

Just as bad are processes that offer no feedback. This is why file transfer programs (like scp) offer a updated progress metric as they copy data over the network. Show status for long processes. Status is more important than behavior: don't tell the operator about details they didn't already know about. People do not deal well with extra information they do not understand.

Half-time

A fair summary of the last two sections would be:
Knowledge of the current situation prevents mistakes.
This is true for errors in execution. It is not true for errors caused by some distant events.

The biological term for "the part of an organism that is distant from the point of inspection (or connection)" is distal. Failures that result from external forces, or actions taken by parties out-side of the driver's work-group (or span-of-control) are therefore distal sources of error.

Preventing impacts from distal sources of error

This is always harder. Policy makers that are disconnected from the situation in production operations come to some fabulously painful and degrading blame-bombs. It is far easier to blame the person at the keyboard for the four character error as the "weak link" in the process. It is also patently unfair to do so. Several distant sources contributed to cause that spelling error to have far greater impact than it might have had. For example the test hardware was different in many ways from the production hardware. This lack of alignment saved a little up-front money, and created an on-going operational tax on every change to the system.

So I don't subscribe to that doctrine: the weak link is almost always a process that provided little feedback or visibility (e.g. a GUI) or a procedure that had no useful cross-checks before the commit action was taken. The cross-check in this case was the cost of on-going changes and the added risk to those changes, versus the small savings in capital costs for the slightly larger disks.

Distant sources of data need to be observable: as the list of hosts we are about to update needs to be visible to the driver (as above). But the reasons for each step in the process need to be just as clear to the driver. Steps which add no certainty to the proces are of little value to the driver. What gives each step in the process value? Here is a list I would start with:

Clear results

The output from the process is organized and fairly easy to read (possibly with some training).

Actionable messages

The basic UNIX™ shell commands have a common pattern for error messages:

command: noun: error-message
For example, I'll spell the null device wrong:
$ cat /dev/mull
cat: /dev/mull: No such file or directory

This error message doesn't tell the driver which component of the path is wrong, but is gives her a finite number of places to inspect.

Cut points

If a key step fails, then any automation should stop as soon as possible. Never depend on the driver to interrupt the process from the keyboard. The failure should be as close to the last line of output as you can make it, and include a key phrase like "noun failed to verb". The best thing abou these errors is that they are common across many tools, and the error message are available in most locales. They are also clearly spelled out for each system call in the manual pages for sections 2 and 3. That is not to say they are clear to a novice, but they are consistant and can be learned.

And nearly every base tool exits with a non-zero exit code when it fails. So check the status of any command that matters, and don't run commands that don't matter.

Investigation of failures should include cross-checks from the point-or-view of any distal inputs. Any distal part that has a way to cross check our work should have an interface to test it ad hoc. Use those to recover from failures.

Restart points

If there is a possible termination-point in the process, then there should be a clear on-ramp to resume the process after the issue is resolved. This may require a return to a previous step. This may even require a whole-sale backout of the work aready done. Live with that, and learn to accept temporary failure as long-term success.

Verification of results

In the physical world we are bombarded by our sences with input data, so much so that we have to ignore most of it. In the digital world one must request data to see it.

Actions to prevent mistakes require not assuming that others have a similar understanding of the situation. Verification steps assure that the driver and their secondary agree on the status of the change. Never let a chance to check a verification pass you by.

Avoid any "normalization of deviance". If any output in the process looks funny then stop to confirm that output was (in some way) expected. Situational awareness is key in configuration management, and viewing all available data before taking actions (unplanned or planned) is the key to stable operations.

Change of plans is usually a sign of trouble

Avoid dealing with newly emerging requirements in an event-driven or uncoordinated way. Discovering "new knowledge" a part of a planned change takes you out of the "envelope of pathways to a safe outcome".

Anticipate available resource before your change. If resources are not following what you expected find out why.

Assure that distant errors make it home

We have that base of best practice to build on, so we should add value and carry as much information to the dirver as possible. That means we might prefix a standard error message with the name of the instance that produced it:
instance: command: noun: error-message

We also should carry exit codes back from remote commands.

We should build a strcuture to examine exit codes from each update, and take action for unexpected results.

Preventing systemic errors

To prevent systeminc errors, look at our configuration management structure from the operators point of view. Each touch-point needs a close-the-loop operations: look from the operators position to locate the data that would be most useful for each of their decision points.

People running operation, development, and change management are working under rules that make sense given the context, indicators, operational status, and their organizational norms. Always look for the outcomes and messages that will cause them to take the best action after each step. Make them aware of something that is not "normal" and they may take action to avoid making it worse. Hiding failures, cross-checks, or other related data from them gives them no context to take compensatory actions.

Offered, process, and requested information

Information is available in different measures. Some information is offered in the course of standard operating procedure, some is produced as part of the process, and other data is only avaiable by explicit request.

The common GNU build is a great example of this. A README file in each product source directory is visible in the output of ls. This is being offered to the builder in a culturally normal way. Becaue the most common action of an operator after unpacking a source directory is to cd to it then run ls. In fact a source directory without one of these is quite rare.

Along the same lines, the existance of the file configure in the directory is usually a script built by autoconf. If the README instructs the operator to run that script, then they will usually do that. The expectation being that the operation of that script does no harm. Moreover that configuration script shows you what it finds as part of the process of execution.

After the product is built (and may installed) the operator may request the version of the application under a common command-line option, usaully -V or --version. This is compared to the last known version to assure that the update did the right thing.

This canonical chain (README to configure to -V) has changed very little in the last 20 years.

You're local site policy should call-out which style of information each process requires. If all information if requested you should expect more failures.

Request for changes

Changes to production systems are always triggered by some need for the change. I would never update a production system just because the clock changed, or the up-stream sources incremented a version, release, or distribution number.

Changes happen because we need them. I have run production machines with uptimes of more than 2,400 days. There was no compelling reason to update the operating system, so no need to reboot them.

Someone must request each change. That's not to say that the same group issues every change request. Some changes are tiggered by different steak-holders than others. Local site policy should state a clearly as possble who requests different changes.

Autonomous changes by automation

Some managers think that auto-updates are the key to speedy changes. Automatically appliction of patches can be a great idea. But that assumes that these patches are nearly perfecttly tested, and carry virtually no risk of failure. The first time a global update removes 100% of your production services will be the last automatic update ever performed.

How changes are requested

The only fixed part of the process is a clear audit trail. That could be in e-mail, a request queue, minutes of a meeting, or paper records. To cover any audit requests each site needs to keep a secure backup copy of the history of requests and their outcomes.

This might be as easy as setting up a pair of nntp news servers and publishing a log if each request and the log of each change supporting that request. I've done that for more than 25 years, with absolutely no regrets. This solution also allows operators to Cc: an e-mail gateway to the news service in Customer correspondences.

rcsdiff, cvs diff, and the like
tickle + email
msync + email?
TODO files
level2s msync  (all checks it makes)
level3s build
level3s diff
Focus attention on the basis of earlier assessments (of errors).
limit scope of new learning (sh, make, m4, rdist, maybe xapply markup)
keep similar circumstances truely similar
History editing is great, but view the entire command, not just the first
	80 characters!
Adapt this to the way you do your work, or at least meet in the middle.
	local site policy
	don't get complacent
Stability is more important than a short-term plan.  If a change is
so important that is cannot be aborted, then you've already failed.
Address organizational pressures to choose schedule over system safety
before the change starts.  Lack of a testing environment is not
acceptable for a risky change.
Error types:
	Mode error.
	Getting lost in display architectures.
	Lack of coordination when changing common configuration inputs.
	Wrong task priority.
	Data overload.
	Not noticing changes (in status, metrics, paths, names, values) [~graphic].
Cognitive consequences of computerization
	Computers increase demands on people's memory;
	Computers ask people to add to their package of skills and knowledge;
	computers increase the risk of people falling behind in high tempo
	operations;
	Computers can undermine people's formation of accurate mental models of
	how the system and underlying process works;
	knowledge calibration problem: thinking you know how the system
	works when you know very little of the actual model;
	compartmentalization limits the reach of relavent information
	making a list of hosts with a kludge is a bad idea -- be sure that
	the reason a host is selected is the right one.
automation traces
	op rules log
		describe that logging
	install's log, if you use ksb's install
		describe how to enable that!
	local.defs might record each command
	;

See also

http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
$Id: error.html,v 1.5 2012/11/10 23:14:57 ksb Exp $