System and Job Crashes: What To Do

The first thing we have to do is differentiate between system crashes, job aborts, and system hangs.

A system crash is when the entire TSX System halts operation and produces an error message at the debug terminal (which is normally the console). The error message will invariably start out with "Fatal system error:". At this point the sytem will drop into the kernel mode debugger and you will get a "DB>" prompt.

If your system is crashing in this manner you probably, but not necessarily, need to get us a crash dump. Click here on this crash dump link to learn how and when to take crash dumps and what to do after you have.

A system hang also stops operation of the entire system, but is not accompanied by a message about fatal system error. The best hope here is to try to force the system into the debugger to get a crash dump. So, if you need to pursue this line of attack please click on this system hang link to learn more.

Some folks use the term "crash" to describe a job abort. A job is either interactive or detached. Jobs are always running some program. Examples of interactive jobs are the copies of TPR running when people use TSX-Online and copies of TSKMON when people are at a command prompt. Detached jobs include the NAMED program, SMTPS, and so on. If TSX detects an invalid condition in these jobs it will abort them. This is not a crash. Your system is not dead, although death of critical jobs like the NAMESERV program on a TSX-Online system can make it seem dead since nobody can log on. Often you will see a message associated with a job abort such as "Memory protection error". Here is the link you can click on to learn more about coping with job aborts.