When a blue screen is really hardware

The SBS Diva has a bluescreen problem.  Several commenters on her blog suggested that it might be hardware-related.  Let me join the chorus.
 
I’ve been fighting bluescreens on my 5-year old MSI Athlon XP-based motherboard for almost two months. The problem was hardware.  Here’s how I found the problem.
 
To reiterate the diagnostic criteria for hardware-related problems, they are:
 
1) Symptoms are intermittent
2) Symptoms happen after machine has been powered on for a while
3) Bluescreen diagnostics point to many different causes.
4) Memory tests may or may not indicate a problem.
 
In my case, the bluescreens were happening, seemingly at random.  I would do bluescreen analysis with the debugging tools, as I described in an earlier post, each and every time it crashed.  I found a wide variety of errors, many of these happening in win32k.sys.
 
I removed as many third-party drivers as I could (I own a Kensington trackball), and reverted to the "safe" settings in the BIOS.  I checked the ventilation on my system, cleaned and replaced one of the fans and the power supply and even suspected the keyboard.  After all this work, it still bluescreened.
 
I put in a new stick of memory from Crucial and performed a memory test with Microsoft’s tester, running through two passes.  No change.  Bluescreens still.
 
One cause of defective hardware is not always recognized outside of service centers or the hobbyist PC community.  Look at these pictures of my motherboard.  The round objects are capacitors.  They help to regulate voltages in electronic equipment, including my motherboard.  They’re often in a hot, dusty, environment, and they sometimes fail.  (This problem is too common.   Google "bad caps" for a sampler.)
 
Normal capacitors have flat tops (the triangular score lines on the top are normal and do not indicate any defect.)
 
Bad capacitors generate heat and pressure inside, which result in the cap bulging its top.  Eventually, the cap may "pop" (the reason for the score lines on the top) or simply short out.  In either case, the device it’s wired into–hello, motherboard!–may fail.
 
My 5-year old MSI board was all but certainly brought down by the two caps in the photo, which just happen to be near the memory sockets (obscured by wiring).  It’s little wonder I was having problems; 
 
If you have a flashlight and a magnifier, and a few minutes, you can check your motherboard for bad caps yourself.  Just turn off your computer and unplug it, remove the cover (following directions in your hardware manual) and have a look.
 
Caps can be anywhere on your motherboard, but they are most commonly located around the DIMM (memory) sockets and around the CPU, where you’ll find a large cluster of them.
 
Normal caps will have flat tops–again, the triangular score lines at the top are normal.  Bad caps will have bulging tops, or they’ll "leak";  there may be junk leaking out of the cap and running over the motherboard.  Badcaps.net has more photos.
 
If you find a bad cap, there’s not much you can do other than replacing the motherboard or calling for service.  But at least you won’t be running in circles with crashdumps.
 
Take care,
 
Dave
 
 
P.S.  Happy ending for me:  One new motherboard and processor later (Athlon 64), I was back in business one day later.
 
 
 
 
 
 

The Heartbeat of Windows

Windows has a heartbeat!  Yes, it’s true!  How can this be?  Windows is not exciting for most admins when things are running normally (and the excitement over a new BSOD is much much overrated.)
 
Windows servers (and desktops) have a registry key:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Reliability\TimeStampInterval.
This is a DWORD and the default is 5 for servers and 0 for workstations (XP, Win2K Pro, etc.)
 
When set, Windows writes a timestamp to the registry every 5 seconds (or, presumably, for whatever the DWORD is set to.)  A value of 0 disables it.
 
With this entry set, if the computer bluescreens or you pull the plug, Windows uses the timestamp to determine that the last shutdown was unexpected, and its approximate time.  See the following registry entry from my very sick workstation after one of its many bluescreens:
 
Event Type: Error
Event Source: EventLog
Event Category: None
Event ID: 6008
Date:  6/3/2006
Time:  1:06:47 PM
User:  N/A
Computer: WAKKO
Description:
The previous system shutdown at 1:02:58 PM on 6/3/2006 was unexpected.
 
The UPTIME.EXE utility I posted about previously will set the heartbeat.  Simply do uptime /heartbeat and follow the prompts to turn it on or off.
 
Take care,
 
Dave
 

Little-known Windows utility: UPTIME.EXE

When administering a server or a collection of servers, you need a way not only to see your systems in the present (is the server up or down or bluescreened; is the web service running or stopped or not even reachable), but also in the recent past (when did it last boot?  How long has it been up?)
 
Since NT, there’s been a utility to check system uptime, UPTIME.EXE.  It produces output like this:
uptime /s
 
Uptime Report for: \\DOT
Current OS: Microsoft Windows Server 2003, Service Pack 1, Uniprocessor Free.
Time Zone: Eastern Daylight Time
System Events as of 6/3/2006 9:44:34 PM:
Date:      Time:        Event:               Comment:
———- ———–  ——————-  ———————————–
  5/5/2006  8:12:11 AM  Boot                
  5/5/2006 12:33:56 PM  Shutdown             Prior uptime:0d 4h:21m:45s
  5/5/2006 12:35:13 PM  Boot                 Prior downtime:0d 0h:1m:17s
[…]
  6/2/2006  8:04:36 AM  Shutdown             Prior uptime:0d 0h:14m:32s
  6/2/2006  8:08:12 AM  Boot                 Prior downtime:0d 0h:3m:36s
Current System Uptime: 1 day(s), 13 hour(s), 36 minute(s), 55 second(s)
——————————————————————————–
Since 5/5/2006:
           System Availability: 99.5831%
                  Total Uptime: 29d 10h:34m:54s
                Total Downtime: 0d 2h:57m:29s
                 Total Reboots: 20
     Mean Time Between Reboots: 1.48 days
             Total Bluescreens: 0

 

The word wrap may make this hard to read, but essentially UPTIME.EXE lists all the boot and shutdown events and calculates uptime (and downtime) from them.

 
It uses the Event Log to determine uptime, as explained in this KB article "Why Windows NT reports 6005, 6006, 6008 and 6009 Event Log Entries".
 
This practice has continued through Windows Server 2003, so UPTIME still works. 
 
UPTIME.EXE is available from Microsoft
 
Take care,
 
Dave
 
 
 

WSUS SP1 available

If you run Windows Server Update Services (WSUS), a new service pack (SP1) has been released.  More information and download from Microsoft
 
I had difficulty installing it on my home SBS box;  apparently the install would die after installing a new instance of WMSDE when the new instance wouldn’t start after installation and the remainder of the process would fail.
 
Rebooted, retried installation, and it worked.  The odd part was that the WSUS install asked to reboot again, which I did.  WSUS worked and is working fine still.  It was then installed on an SBS server at work with no incidents.
 
Lessons:  Be aware of what maintainance you performed on a system before you install an update;  I had stopped the WSUS administration site (mistakenly) in order to use wsusutil to clean up old, superseded uptates in the update database.
 
That’s probably what caused the install to fail the first time.
 
Lesson 2:  There are extensive debugging logs in Program Files\Update Services\LogFiles.  Read them and send them to Microsoft if necessary.  I found out much about how the update is installed and the steps it goes through, and importantly, where it fails.
 
Take care,
 
Dave