Bad Hardware Day: More on Hardware BluescreensPosted: July 1, 2010
Sure I wasn’t.
I have had bluescreens and other odd behavior for the two days since I last posted.
That’s my MSI K9N Neo F AMD/nVidia based motherboard. Four years ago to the month, it replaced another MSI board that died prematurely due to bad capacitors.
Guess what we see in the image above? Note the swollen tops of two capacitors just above the PCIE connector.
Note this too:
The CMOS battery—which is not the exact one in the photo, I changed it out—was reported bad.
Its voltage was completely flat when I put it in my battery tester. It also was corroded. You may be able to see some ugly brown residue from who knows what on the battery holder, just above the capacitor. Whatever it is, it has gotten to the board, as seen at the lower left side of the battery holder.
When I discovered all this, I was trying to decode the Machine Check Status code I posted last time. I wasn’t really happy with my non-answer and wanted to find the definite source.
Since I have an AMD processor, I found the AMD manuals. I’ll give a link to the Intel equivalents, but I mention AMD because I had a hard time tracking down their reference material, whereas Intel is mentioned everywhere in searches.
These are the AMD manuals I refer to:
- BIOS and Kernel Developer’s Guide for the AMD Athlon™ 64 and AMD Opteron™ Processors
- AMD64 Architecture Programmer’s Manual Volume 2: System Programming
I’ll use a crash dump I got today (one of 4!!!) Unlike the last time I saw a defective motherboard bluescreen, these bluescreens are remarkably consistent, all with a bug check code of 0×124 (WHEA_UNCORRECTABLE_ERROR) with nearly the same status codes from what I’ve been able to tell. It’s a testament to the much-improved error handling in Windows Vista and Seven. I’m going to skip most of the debugger output, since that’s in my last post, and go to the specific processor machine check:
Section 2 : x86/x64 MCA
Descriptor @ fffffa800528a148
Section @ fffffa800528a2d0
Offset : 664
Length : 264
Flags : 0x00000000
Severity : Fatal
Error : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)
Status : 0xb200001000010c0f
Binary: 10110010 00000000 00000000 00010000 00000000 00000001 00001100 00001111
Bits Mnemonic Description
63 VAL Valid
62 OVER Status Register Overflow
61 UC Uncorrected Error
60 EN Error Condition Enabled
59 MISCV Miscellaneous-Error Register Valid
58 ADDRV Error-Address Register Valid
57 PCC Processor-Context Corrupt
56–32 Other Information
31–16 Model-Specific Error Code
15–0 MCA Error Code
In our status code, bit 63 is set so it is valid. Bit 62 is unset so there’s no overflow. Bit 61 indicates an uncorrected error and bit 60 indicates an error condition enabled. Bit 57 is set and indicates processor context corrupt.
OK, it’s a fatal error and couldn’t continue.
Other information in bits 56-32 is not set but the documentation says that the field is used for, amongst other things, ECC information, which I just do not have in that (or any) client PC. (SATV may have a server with ECC memory. Someday.) Bits 31-16 (the third word) is for the model-specific error code.
The model-specific error code is 00000001b. I have an Athlon X2 4800+ Brisbane CPU that is at least three years old; the AMD documentation says I should look up the error code in a manual specific to that CPU but I couldn’t find one on their website. I would expect to see that field used on Opterons, their server CPU.
Moving on to the MCA error code, the last word, bits 15-0, is:
Binary: [omitting the first three words] 00001100 00001111
The MSB in binary, 00001100, indicates a bus error, so I’ll use this field to decode it: 0000 1PPT RRRR IILL, where PP is Participation Processor, T is Timeout, R is Memory Transaction Type, I means Memory or I/O and L is Cache Level.
PP is 10b, “Local Node Observed Error as Third Party (OBS)” OK, whatever.
The timeout bit is not set so I presume it wasn’t something timing out.
R is 00b and that is a Generic error, which I assume to be “error not otherwise categorized”.
I is 11b, and that is also a generic error (“Something bad happened but I don’t know where?!”)
L (cache) is also 11b and also generic.
All this work by hand to get this error message:
Error : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)
But you almost have to read the manual anyway just to skim the keywords as this message is composited from several keywords that describe specific types of errors and where and how they were found by the CPU that incurred the machine check. (FYI, most of the relevant information was gotten from the AMD BIOS and Kernel Developer’s guide, pages 120-130 of Chapter 3, “Memory System Configuration” and all of Chapter 5, “Machine Check Architecture”.)
For most people it’s just enough to know the board was bad, but I hated the way I closed out my last post (“oh, I don’t know what the MCI status is and I don’t care!”) and I wanted to know this stuff. I still don’t know what “Bank 4” refers to, if it even refers to memory (I had shuffled my DIMMs around in the system hoping the error would follow a specific DIMM. Didn’t happen.)
Besides, it’ll help some poor guy or gal searching through Bing.
I know I will be making Newegg happy again very soon.
P.S. The Intel manual that describes machine checks for Intel CPUs is in two parts:
UPDATE: The NT Debugging blog has posted on the WHEA bugcheck. Geoff Chappell’s web site has a entry for 0×124 WHEA_UNRECOVERABLE_ERROR and a page describing the bug check function that WHEA invokes when it can’t fix an error.