This Hot Server

Just weeks after having my workstation die, I had another problem:  Heat.  My building gets hot.  It’s well insulated and there were new windows installed a few years ago.  We have also had many 90 °F days with 70 °F dewpoints.  Not fun.  During renovations to my part of the building last year, a neighbor upstairs from me told me of all the window AC’s he cooked from the sun.

An alert from my system management program, at midnight local time, tells the whole story:

2010/07/19 04:00:11(UTC)    Email(First)        CPU Temp: Reading(86.000C), Status(>= upper critical threshold)        PINKY(192.168.10.5)

86° C.

Yikes!

This server has a troubled history:  It was SATV’s first server, and as I related before, it was cheaply built and defective from the start.  I got it in 2002 and it has been totally gutted.  Nothing is left from its past.  It’s currently running a Tyan S3950 AMD-based server board with a dual core Opteron 1224SE, 8G of RAM, ServerWorks chipset, Intel NICs, and a HighPoint 1742 SATA RAID controller with two 1TB Western Digital drives.

Except for the time the CPU heatsink frame broke, it has been trouble free, though there are BIOS bugs that affect virtualization (I’ve never been able to use that on this board) and thermal management (no CPU throttling).  The chipset drivers were orphaned since Broadcom sold off its ServerWorks line right after I bought the board, so 64-bit support has not been what it could be.

It’s that bit about thermal management that I’ve been having problems with.  I have not actually had my server lock up from the heat but it is a constant worry.  Besides, I am conditioned, from being long in IT, to obsess over every watt of power that goes in and every watt of heat going out.  I don’t really need another space heater under my workbench.

I want to get a new server board and would love a new Chenbro case but unexpected expenses…well, I have to live with this a bit longer.

All I could reasonably do with little money was to put in more fans in the case.  My server’s case is large enough.  Unfortunately, it was designed in the mid-90’s when CPU’s didn’t get as hot as they do now.  The only cooling in the front of the machine is a small fan and an impossibly small air vent in the front.

As in many machines, the front card slot holder is used to mount the fan (and also the PC speaker, which has been taken out–the Tyan has an onboard sounder.)  There are virtually no full-length cards in use, at least I have never seen any.

It’s possible I could put a bigger fan in and remove the slot holder but I have no larger fans in stock so I’m deferring this.  But the air intake is a bigger problem.

That’s it.  That’s the air intake.

I was also having trouble with dust bunnies.  They would set up nice dust bunny condos inside my server if I let them.  It didn’t help that my house vacuum cleaner was failing for months before I noticed and got a higher-powered Hoover.  I was on MCM Electronics, trying to fill up an order to get the free shipping when I found a filter frame on clearance.

I had some more fans lying around.

They’re three-wire fans, unfortunately, the third wire is not compatible with most motherboards–it is used for the thermistor sensor, rather than the tach as in most fans so they will just be wired into the power supply without any management.

After too much work I cut this vent hole in the front bezel and mounted the filter.

And mounted two fans in back for exhaust.

I would really like to replace the front fan with one that has more airflow.  I do expect the dust bunnies will have to find another place to set up shop.

Temperatures on the system board seem to be around 55-62C since I did this project, acceptable for a machine with two hot Western Digital RAID drives in it, but there haven’t been any really hot days since the work was done.

If I get through the rest of the summer with this, it is a win.  Just so long as I can run the SBS 7 beta with it, I will be satisfied.


Bad Hardware Day: More on Hardware Bluescreens

I was hoping not to follow up to my last post.

Sure I wasn’t.

I have had bluescreens and other odd behavior for the two days since I last posted.

That’s my MSI K9N Neo F AMD/nVidia based motherboard.  Four years ago to the month, it replaced another MSI board that died prematurely due to bad capacitors.

Guess what we see in the image above?  Note the swollen tops of two capacitors just above the PCIE connector.

Note this too:

The CMOS battery—which is not the exact one in the photo, I changed it out—was reported bad.

Its voltage was completely flat when I put it in my battery tester.  It also was corroded.  You may be able to see some ugly brown residue from who knows what on the battery holder, just above the capacitor.  Whatever it is, it has gotten to the board, as seen at the lower left side of the battery holder.

When I discovered all this, I was trying to decode the Machine Check Status code I posted last time.  I wasn’t really happy with my non-answer and wanted to find the definite source.

Since I have an AMD processor, I found the AMD manuals.   I’ll give a link to the Intel equivalents, but I mention AMD because I had a hard time tracking down their reference material, whereas Intel is mentioned everywhere in searches.

These are the AMD manuals I refer to:

I’ll use a crash dump I got today (one of 4!!!)  Unlike the last time I saw a defective motherboard bluescreen, these bluescreens are remarkably consistent, all with a bug check code of 0x124 (WHEA_UNCORRECTABLE_ERROR) with nearly the same status codes from what I’ve been able to tell.   It’s a testament to the much-improved error handling in Windows Vista and Seven.  I’m going to skip most of the debugger output, since that’s in my last post, and go to the specific processor machine check:

!errrec fffffa800528a038
[…]
===============================================================================
Section 2 : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor @ fffffa800528a148
Section @ fffffa800528a2d0
Offset : 664
Length : 264
Flags : 0x00000000
Severity : Fatal

Error : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)
Status : 0xb200001000010c0f
The status word, from what I can see, has many bitfields that describe the location and cause of the error.  I’ll try breaking it down.  In binary:
.formats 0xb200001000010c0f
Evaluate expression:
Hex: b2000010`00010c0f
Decimal: -5620492266238833649
Octal: 1310000001000000206017
Binary: 10110010 00000000 00000000 00010000 00000000 00000001 00001100 00001111
Mangled copy from the AMD manual:
 

Bits Mnemonic Description

63 VAL Valid
62 OVER Status Register Overflow
61 UC Uncorrected Error
60 EN Error Condition Enabled
59 MISCV Miscellaneous-Error Register Valid
58 ADDRV Error-Address Register Valid
57 PCC Processor-Context Corrupt
56–32 Other Information
31–16 Model-Specific Error Code
15–0 MCA Error Code

In our status code, bit 63 is set so it is valid.  Bit 62 is unset so there’s no overflow.  Bit 61 indicates an uncorrected error and bit 60 indicates an error condition enabled.  Bit 57 is set and indicates processor context corrupt.

OK, it’s a fatal error and couldn’t continue.

Other information in bits 56-32 is not set but the documentation says that the field is used for, amongst other things, ECC information, which I just do not have in that (or any) client PC.  (SATV may have a server with ECC memory.  Someday.)  Bits 31-16 (the third word) is for the model-specific error code.

The model-specific error code is 00000001b.  I have an Athlon X2 4800+ Brisbane CPU that is at least three years old;  the AMD documentation says I should look up the error code in a manual specific to that CPU but I couldn’t find one on their website.  I would expect to see that field used on Opterons, their server CPU.

Moving on to the MCA error code, the last word, bits 15-0, is:

Binary:  [omitting the first three words] 00001100 00001111
The bits in the MSB determine the category of the error found, which can be from the bus (HyperTransport),, the GART (graphics), or the cache.
 

The MSB in binary, 00001100, indicates a bus error, so I’ll use this field to decode it:  0000 1PPT RRRR IILL, where PP is Participation Processor, T is Timeout, R is Memory Transaction Type, I means Memory or I/O and L is Cache Level.

PP is 10b, “Local Node Observed Error as Third Party (OBS)”  OK, whatever.

The timeout bit is not set so I presume it wasn’t something timing out.

R is 00b and that is a Generic error, which I assume to be “error not otherwise categorized”.

I is 11b, and that is also a generic error (“Something bad happened but I don’t know where?!”)

L (cache) is also 11b and also generic.

All this work by hand to get this error message:

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)

But you almost have to read the manual anyway just to skim the keywords as this message is composited from several keywords that describe specific types of errors and where and how they were found by the CPU that incurred the machine check.  (FYI, most of the relevant information was gotten from the AMD BIOS and Kernel Developer’s guide, pages 120-130 of Chapter 3, “Memory System Configuration” and all of Chapter 5, “Machine Check Architecture”.)

For most people it’s just enough to know the board was bad, but I hated the way I closed out my last post (“oh, I don’t know what the MCI status is and I don’t care!”) and I wanted to know this stuff.   I still don’t know what “Bank 4” refers to, if it even refers to memory (I had shuffled my DIMMs around in the system hoping the error would follow a specific DIMM.  Didn’t happen.)

Besides, it’ll help some poor guy or gal searching through Bing.

I know I will be making Newegg happy again very soon.

P.S.  The Intel manual that describes machine checks for Intel CPUs is in two parts:

UPDATE:  The NT Debugging blog has posted on the WHEA bugcheckGeoff Chappell’s web site has a entry for 0x124 WHEA_UNRECOVERABLE_ERROR and a page describing the bug check function that WHEA invokes when it can’t fix an error.