Diagnosing Hardware Bluescreens

Screenshot from NirSoft’s BlueScreenView

This morning I was waking my computer before breakfast to check on a FedEx shipment (much needed cooling fans for my apartment!) and when my machine woke up this is what I got.

I restarted it and the BIOS told me, “could not read disk, press Ctrl-Alt-Del to restart”.  I power-cycled the machine and got Windows to boot.

I checked my hard drive with Crystal Disk Info, but found nothing out of line in the SMART data—in fact, my terabyte HD, nearly a year old, has never had an error or a remapped sector or anything odd.  Had my partition table been truly corrupted, that would usually cause another bluescreen when I tried to boot.

OK, Windbg:

0: kd> !analyze -v
[banner omitted]
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800435c038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b2000010, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000010c0f, Low order 32-bits of the MCi_STATUS value.

I’d already guessed when the error happened, but to be sure, here’s the stack:

Child-SP          RetAddr           Call Site
fffff800`00ba8ac8 fffff800`02e2b917 nt!KeBugCheckEx
fffff800`00ba8ad0 fffff800`02fe84d3 hal!HalBugCheckSystem+0x1e3
fffff800`00ba8b10 fffff800`02e2b5dc nt!WheaReportHwError+0x263
fffff800`00ba8b70 fffff800`02e2af2e hal!HalpMcaReportError+0x4c
fffff800`00ba8cc0 fffff800`02e1ee8f hal!HalpMceHandler+0x9e
fffff800`00ba8d00 fffff800`02ed0eac hal!HalHandleMcheck+0x47
fffff800`00ba8d30 fffff800`02ed0d13 nt!KxMcheckAbort+0x6c
fffff800`00ba8e70 fffff880`03dd11f2 nt!KiMcheckAbort+0x153
fffff800`00b9cc98 fffff800`02ee013a amdk8!C1Halt+0x2
fffff800`00b9cca0 fffff800`02edadcc nt!PoIdle+0x53a
fffff800`00b9cd80 00000000`00000000 nt!KiIdleLoop+0x2c

The machine woke up to Windows, started running, and did its normal CPU idle procedure;  in all modern machines, the CPU halts when it is not otherwise running user or kernel code.  It’s possible the exception happened during the transition to sleep when I put the machine to bed the night before, in this event log entry:

The previous system shutdown at 11:27:54 PM on ‎6/‎28/‎2010 was unexpected.

OK, so it’s hardware.  What is the WHEA_ERROR_RECORD?

WHEA stands for Windows Hardware Error Architecture in Vista, 2008, Seven and 2008R2.  It replaces the Machine Check Architecture mechanism in earlier versions of Windows.

Parameter #2 of the bugcheck points to the hardware error record:

0: kd> dd fffffa800435c038fffffa80`0435c038  52455043 ffff0210 0003ffff 00000001fffffa80`0435c048  00000002 000003a0 000c1114 140a061dfffffa80`0435c058  00000000 00000000 00000000 00000000fffffa80`0435c068  00000000 00000000 00000000 00000000fffffa80`0435c078  cf07c4bd 4e18b789 731fc4b3 3171b52cfffffa80`0435c088  e8f56ffe 4cc5919c ab6588ba bb1349e1fffffa80`0435c098  0ced40e1 01cb1314 00000000 00000000fffffa80`0435c0a8  00000000 00000000 00000000 00000000

Right.  That’s clear.  Fortunately there are debugging extension commands for WHEA in the latest debugger.  I’ll try them.

0: kd> !wheaError Source Table @ fffff80003062b380 Error Sources

 

OK, not much info there, I’ll try one of the others.

0: kd> !errrec fffffa800435c038===============================================================================Common Platform Error Record @ fffffa800435c038-------------------------------------------------------------------------------Record Id     : 01cb13140ced40e1Severity      : Fatal (1)Length        : 928Creator       : MicrosoftNotify Type   : Machine Check ExceptionTimestamp     : 6/29/2010 12:17:20Flags         : 0x00000000

===============================================================================Section 0     : Processor Generic-------------------------------------------------------------------------------Descriptor    @ fffffa800435c0b8Section       @ fffffa800435c190Offset        : 344Length        : 192Flags         : 0x00000001 PrimarySeverity      : Fatal

Proc. Type    : x86/x64Instr. Set    : x64Error Type    : BUS errorOperation     : GenericFlags         : 0x00Level         : 3CPU Version   : 0x0000000000060fb1Processor ID  : 0x0000000000000000

===============================================================================Section 1     : x86/x64 Processor Specific-------------------------------------------------------------------------------Descriptor    @ fffffa800435c100Section       @ fffffa800435c250Offset        : 536Length        : 128Flags         : 0x00000000Severity      : Fatal

Local APIC Id : 0x0000000000000000CPU Id        : b1 0f 06 00 00 08 02 00 - 01 20 00 00 ff fb 8b 17                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa800435c250

===============================================================================Section 2     : x86/x64 MCA-------------------------------------------------------------------------------Descriptor    @ fffffa800435c148Section       @ fffffa800435c2d0Offset        : 664Length        : 264Flags         : 0x00000000Severity      : Fatal

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)  Status      : 0xb200001000010c0f

 

We’re getting somewhere.  The Processor Generic section categorizes this as a bus error.  Section 2 gets a bit more detailed:

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)  Status      : 0xb200001000010c0f
I have no information on how to decode the status word.  Doing a search on BUSLG turns up a few hits related to memory errors in FreeBSD.  The “Bank 4” wording implies memory hardware—this machine has four sticks of 1G each for a 4G system.
MSDN has a description of WHEA error events, though none of them describe my error at all.
A likely scenario may be that the system was put to sleep and entered sleep normally, but there was a power glitch during sleep, or on wakeup, that affected the standby power that keeps the RAM alive.  If you put the system to sleep and turn off the power, Windows will complain about it in the event logs.

I’ve seen a lot of quirks with this particular system but this is a new one.   I’ve often had BIOS messages that tell me,

A HyperTransport sync flood occurred on last bootHit F1 to Resume
Sure, I see that message and think, OMG the sync flooded, get a mop!  It’s not a very actionable error.
Unless this happens again, I’m not going to do anything about this.  The motherboard is elderly, 4 years old, and I plan to get a new board when it hits its fifth birthday this time next year.   If it happens every day for the next month though….
Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s