Posted in

Error Analysis in AIX: Interpreting ERRPT Sense Data

In the AIX system, when an error occurs, information about it is logged in the error log. By reviewing the errors that have appeared in the system (errpt), we can read basic information about when the error occurred and what it was related to.

For example, analyzing the error:

aix73lab:/# errpt -a

--------------------------------------------------------------------------
LABEL:           SC_DISK_ERR10
Date/Time:       Tue Feb  4 19:00:05 CEST 2024
Type:            PERM
Resource Name:   hdisk1
Description
REQUESTED OPERATION CANNOT BE PERFORMED
Detail Data
PATH ID
           0
SENSE DATA
0600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0118 0005 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1283 0000
z041A 003D 001A FFFF FFFF 0100 0000 0118 0000 0005 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000
---------------------------------------------------------------------------

We can see when it occurred, that it was related to hdisk1, and the problem that occurred was “REQUESTED OPERATION CANNOT BE PERFORMED.”

But what happened that prevented the operation from being performed?

This can be discovered by analyzing the Sense Data field. This field contains detailed information that helps determine the cause of the error and what it pertains to.

For the above error, additional information is:

Feb 04 19:00:05 hdisk1     P SC_DISK_ERR10       path  0 TEST UNIT READY  RESERVATION CONFLICT (0 Sec rtry 01)

As you can see, the answer is simple and clear – a reservation conflict (SCSI) occurred.

The Sense Data field consists of data in hexadecimal format, which makes it difficult to interpret quickly. Of course, it can be analyzed manually, for example, by using SCSI Sense Data, but this takes time, and during a failure, time is always in short supply.

Fortunately, a diagnostic tool – SUMM – is available, allowing the translation of Sense Data into a “human readable” format. It is not available by default in the system, but it is worth installing as it helps with diagnostics, takes up little space, and has no dependencies.

The tool can be downloaded from the website:

https://www.ibm.com/support/pages/ibm-aix-diagnostic-tool-summ-summarized-system-error-log-and-report-generator-io-devices

It is very easy to use and allows for quickly obtaining additional information about the problem.

errpt -a | summ

or

summ error_log.txt

Based on it, you can convert this:

aix73lab:/# errpt
IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
F31FFAC3   0205012525 I H hdisk0         PATH HAS RECOVERED
F31FFAC3   0205012525 I H hdisk1         PATH HAS RECOVERED
6382B81C   0205012525 T S vscsi1         Temporary VSCSI software error
DE3B8540   0205012425 P H hdisk0         PATH HAS FAILED
DE3B8540   0205012425 P H hdisk1         PATH HAS FAILED
DE3B8540   0205012325 P H hdisk0         PATH HAS FAILED
DE3B8540   0205012325 P H hdisk1         PATH HAS FAILED
81453EE1   0205012225 T S vscsi1         Underlying transport error

to this:

aix73lab:/# errpt -a | ./summ
Feb  5 01:25:47 hdisk0     I SC_DISK_PCM_ERR9    path  1 path recovered
Feb  5 01:25:01 hdisk1     I SC_DISK_PCM_ERR9    path  1 path recovered
Feb  5 01:25:00 vscsi1     T VIOS_VSCSI_ERR1     Temporary VSCSI software error
Feb  5 01:24:47 hdisk0     P SC_DISK_ERR7        path  1 path failure; TEST UNIT READY  transport fault
Feb  5 01:24:01 hdisk1     P SC_DISK_ERR7        path  1 path failure; TEST UNIT READY  transport fault
Feb  5 01:23:48 hdisk0     P SC_DISK_ERR7        path  1 path failure; WRITE(10)        (3E7D4160,0008) transport fault (15.9 Sec rtry 01)
Feb  5 01:23:32 hdisk1    P SC_DISK_ERR7        path  1 path failure; WRITE(10)        (24114778,0008) transport fault (15.9 Sec rtry 01)
Feb  5 01:22:55 vscsi1     T VIOS_VSCSI_ERR2     Underlying transport error

Some other examples of decoded data:

Feb 1 17:58:00 rmt12      P TAPE_ERR1           WRITE(6)         (040000) CHECK MEDIUM ERROR; WRITE ERROR
Feb  1 15:54:27 rmt15      P TAPE_ERR1           REWIND           (00000000) CHECK MEDIUM ERROR; MEDIA LOAD OR EJECT FAILED

Feb 1 12:45:41 hdisk311   T SC_DISK_ERR4        path  0 READ(10)         (102C4200,0300) transport fault
Feb 25 12:45:40 fscsi8     T FCP_ERR4            Error detected receiving from port 0x04BC30 (possible marginal link)

Feb 1 22:31:41 hdisk6 P SC_DISK_ERR7 path 3 path failure; WRITE(10) (037E6C58,0008) no response from device
Feb 1 22:31:27 fscsi2 T FCP_ERR14 Nameserver GID_PN reject for 5....F; Port Name not registered

SUMMARY

Interpreting Sense Data from AIX errpt can provide a lot of useful information about the problem. Based on it, it’s easier to understand what happened. Using SUMM simplifies the process by converting this data into a human-readable format, making diagnosis quicker and easier.

One thought on “Error Analysis in AIX: Interpreting ERRPT Sense Data

  1. Wow, very nice.
    We used to have a tool called psdb to check this kind of data, but no longer supported by IBM.
    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *