Bad Hardware Day: More on Hardware Bluescreens

I was hoping not to follow up to my last post.

Sure I wasn’t.

I have had bluescreens and other odd behavior for the two days since I last posted.

That’s my MSI K9N Neo F AMD/nVidia based motherboard.  Four years ago to the month, it replaced another MSI board that died prematurely due to bad capacitors.

Guess what we see in the image above?  Note the swollen tops of two capacitors just above the PCIE connector.

Note this too:

The CMOS battery—which is not the exact one in the photo, I changed it out—was reported bad.

Its voltage was completely flat when I put it in my battery tester.  It also was corroded.  You may be able to see some ugly brown residue from who knows what on the battery holder, just above the capacitor.  Whatever it is, it has gotten to the board, as seen at the lower left side of the battery holder.

When I discovered all this, I was trying to decode the Machine Check Status code I posted last time.  I wasn’t really happy with my non-answer and wanted to find the definite source.

Since I have an AMD processor, I found the AMD manuals.   I’ll give a link to the Intel equivalents, but I mention AMD because I had a hard time tracking down their reference material, whereas Intel is mentioned everywhere in searches.

These are the AMD manuals I refer to:

I’ll use a crash dump I got today (one of 4!!!)  Unlike the last time I saw a defective motherboard bluescreen, these bluescreens are remarkably consistent, all with a bug check code of 0x124 (WHEA_UNCORRECTABLE_ERROR) with nearly the same status codes from what I’ve been able to tell.   It’s a testament to the much-improved error handling in Windows Vista and Seven.  I’m going to skip most of the debugger output, since that’s in my last post, and go to the specific processor machine check:

!errrec fffffa800528a038
Section 2 : x86/x64 MCA
Descriptor @ fffffa800528a148
Section @ fffffa800528a2d0
Offset : 664
Length : 264
Flags : 0x00000000
Severity : Fatal

Error : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)
Status : 0xb200001000010c0f
The status word, from what I can see, has many bitfields that describe the location and cause of the error.  I’ll try breaking it down.  In binary:
.formats 0xb200001000010c0f
Evaluate expression:
Hex: b2000010`00010c0f
Decimal: -5620492266238833649
Octal: 1310000001000000206017
Binary: 10110010 00000000 00000000 00010000 00000000 00000001 00001100 00001111
Mangled copy from the AMD manual:

Bits Mnemonic Description

63 VAL Valid
62 OVER Status Register Overflow
61 UC Uncorrected Error
60 EN Error Condition Enabled
59 MISCV Miscellaneous-Error Register Valid
58 ADDRV Error-Address Register Valid
57 PCC Processor-Context Corrupt
56–32 Other Information
31–16 Model-Specific Error Code
15–0 MCA Error Code

In our status code, bit 63 is set so it is valid.  Bit 62 is unset so there’s no overflow.  Bit 61 indicates an uncorrected error and bit 60 indicates an error condition enabled.  Bit 57 is set and indicates processor context corrupt.

OK, it’s a fatal error and couldn’t continue.

Other information in bits 56-32 is not set but the documentation says that the field is used for, amongst other things, ECC information, which I just do not have in that (or any) client PC.  (SATV may have a server with ECC memory.  Someday.)  Bits 31-16 (the third word) is for the model-specific error code.

The model-specific error code is 00000001b.  I have an Athlon X2 4800+ Brisbane CPU that is at least three years old;  the AMD documentation says I should look up the error code in a manual specific to that CPU but I couldn’t find one on their website.  I would expect to see that field used on Opterons, their server CPU.

Moving on to the MCA error code, the last word, bits 15-0, is:

Binary:  [omitting the first three words] 00001100 00001111
The bits in the MSB determine the category of the error found, which can be from the bus (HyperTransport),, the GART (graphics), or the cache.

The MSB in binary, 00001100, indicates a bus error, so I’ll use this field to decode it:  0000 1PPT RRRR IILL, where PP is Participation Processor, T is Timeout, R is Memory Transaction Type, I means Memory or I/O and L is Cache Level.

PP is 10b, “Local Node Observed Error as Third Party (OBS)”  OK, whatever.

The timeout bit is not set so I presume it wasn’t something timing out.

R is 00b and that is a Generic error, which I assume to be “error not otherwise categorized”.

I is 11b, and that is also a generic error (“Something bad happened but I don’t know where?!”)

L (cache) is also 11b and also generic.

All this work by hand to get this error message:

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)

But you almost have to read the manual anyway just to skim the keywords as this message is composited from several keywords that describe specific types of errors and where and how they were found by the CPU that incurred the machine check.  (FYI, most of the relevant information was gotten from the AMD BIOS and Kernel Developer’s guide, pages 120-130 of Chapter 3, “Memory System Configuration” and all of Chapter 5, “Machine Check Architecture”.)

For most people it’s just enough to know the board was bad, but I hated the way I closed out my last post (“oh, I don’t know what the MCI status is and I don’t care!”) and I wanted to know this stuff.   I still don’t know what “Bank 4” refers to, if it even refers to memory (I had shuffled my DIMMs around in the system hoping the error would follow a specific DIMM.  Didn’t happen.)

Besides, it’ll help some poor guy or gal searching through Bing.

I know I will be making Newegg happy again very soon.

P.S.  The Intel manual that describes machine checks for Intel CPUs is in two parts:

UPDATE:  The NT Debugging blog has posted on the WHEA bugcheckGeoff Chappell’s web site has a entry for 0x124 WHEA_UNRECOVERABLE_ERROR and a page describing the bug check function that WHEA invokes when it can’t fix an error.


One Comment on “Bad Hardware Day: More on Hardware Bluescreens”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s