In the AIX system, when an error occurs, information about it is logged in the error log. By reviewing the errors that have appeared in the system (errpt), we can read basic information about when the error occurred and what it was related to.
For example, analyzing the error:
aix73lab:/# errpt -a -------------------------------------------------------------------------- LABEL: SC_DISK_ERR10 Date/Time: Tue Feb 4 19:00:05 CEST 2024 Type: PERM Resource Name: hdisk1 Description REQUESTED OPERATION CANNOT BE PERFORMED Detail Data PATH ID 0 SENSE DATA 0600 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0118 0005 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1283 0000 z041A 003D 001A FFFF FFFF 0100 0000 0118 0000 0005 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ---------------------------------------------------------------------------
We can see when it occurred, that it was related to hdisk1, and the problem that occurred was “REQUESTED OPERATION CANNOT BE PERFORMED.”
But what happened that prevented the operation from being performed?
This can be discovered by analyzing the Sense Data field. This field contains detailed information that helps determine the cause of the error and what it pertains to.
For the above error, additional information is:
Feb 04 19:00:05 hdisk1 P SC_DISK_ERR10 path 0 TEST UNIT READY RESERVATION CONFLICT (0 Sec rtry 01)
As you can see, the answer is simple and clear – a reservation conflict (SCSI) occurred.
The Sense Data field consists of data in hexadecimal format, which makes it difficult to interpret quickly. Of course, it can be analyzed manually, for example, by using SCSI Sense Data, but this takes time, and during a failure, time is always in short supply.
Fortunately, a diagnostic tool – SUMM – is available, allowing the translation of Sense Data into a “human readable” format. It is not available by default in the system, but it is worth installing as it helps with diagnostics, takes up little space, and has no dependencies.
The tool can be downloaded from the website:
It is very easy to use and allows for quickly obtaining additional information about the problem.
errpt -a | summ or summ error_log.txt
Based on it, you can convert this:
aix73lab:/# errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION F31FFAC3 0205012525 I H hdisk0 PATH HAS RECOVERED F31FFAC3 0205012525 I H hdisk1 PATH HAS RECOVERED 6382B81C 0205012525 T S vscsi1 Temporary VSCSI software error DE3B8540 0205012425 P H hdisk0 PATH HAS FAILED DE3B8540 0205012425 P H hdisk1 PATH HAS FAILED DE3B8540 0205012325 P H hdisk0 PATH HAS FAILED DE3B8540 0205012325 P H hdisk1 PATH HAS FAILED 81453EE1 0205012225 T S vscsi1 Underlying transport error
to this:
aix73lab:/# errpt -a | ./summ Feb 5 01:25:47 hdisk0 I SC_DISK_PCM_ERR9 path 1 path recovered Feb 5 01:25:01 hdisk1 I SC_DISK_PCM_ERR9 path 1 path recovered Feb 5 01:25:00 vscsi1 T VIOS_VSCSI_ERR1 Temporary VSCSI software error Feb 5 01:24:47 hdisk0 P SC_DISK_ERR7 path 1 path failure; TEST UNIT READY transport fault Feb 5 01:24:01 hdisk1 P SC_DISK_ERR7 path 1 path failure; TEST UNIT READY transport fault Feb 5 01:23:48 hdisk0 P SC_DISK_ERR7 path 1 path failure; WRITE(10) (3E7D4160,0008) transport fault (15.9 Sec rtry 01) Feb 5 01:23:32 hdisk1 P SC_DISK_ERR7 path 1 path failure; WRITE(10) (24114778,0008) transport fault (15.9 Sec rtry 01) Feb 5 01:22:55 vscsi1 T VIOS_VSCSI_ERR2 Underlying transport error
Some other examples of decoded data:
Feb 1 17:58:00 rmt12 P TAPE_ERR1 WRITE(6) (040000) CHECK MEDIUM ERROR; WRITE ERROR Feb 1 15:54:27 rmt15 P TAPE_ERR1 REWIND (00000000) CHECK MEDIUM ERROR; MEDIA LOAD OR EJECT FAILED Feb 1 12:45:41 hdisk311 T SC_DISK_ERR4 path 0 READ(10) (102C4200,0300) transport fault Feb 25 12:45:40 fscsi8 T FCP_ERR4 Error detected receiving from port 0x04BC30 (possible marginal link) Feb 1 22:31:41 hdisk6 P SC_DISK_ERR7 path 3 path failure; WRITE(10) (037E6C58,0008) no response from device Feb 1 22:31:27 fscsi2 T FCP_ERR14 Nameserver GID_PN reject for 5....F; Port Name not registered
SUMMARY
Interpreting Sense Data from AIX errpt can provide a lot of useful information about the problem. Based on it, it’s easier to understand what happened. Using SUMM simplifies the process by converting this data into a human-readable format, making diagnosis quicker and easier.
Wow, very nice.
We used to have a tool called psdb to check this kind of data, but no longer supported by IBM.
Thanks