Windows 8 Bluescreens

Windows 8 Blue Screen of Death

There are always silly articles when a new version of Windows is leaked or becomes available for preview.  Windows was once supposed to have a chartreuse screen of death when Vista was released.  This is the “new” BSOD, at least so far.

It’s perhaps too cute to make the release, but it’s functional, considering that in most instances I am troubleshooting from the crash dump or the event log so it’s not as important that I have the specific bugcheck code on the screen.

Unfortunately, some vendor’s drivers will make this screen harder to diagnose from;  Intel provides the storage drivers for virtually all of their desktop boards, including our Dells at SATV and my laptop.  When that driver crashes the machine, it does so with a bugcheck code of 0x8086.  Which is a “vendor defined” code that is nowhere to be found in a search.   It means you need to bug Intel for a driver fix.  I’m not sure if that code will present itself in this blue screen.

I have a suggestion for a new blue screen design:

Clippy's Blue Screen of Death

Advertisements

Bad Hardware Day: More on Hardware Bluescreens

I was hoping not to follow up to my last post.

Sure I wasn’t.

I have had bluescreens and other odd behavior for the two days since I last posted.

That’s my MSI K9N Neo F AMD/nVidia based motherboard.  Four years ago to the month, it replaced another MSI board that died prematurely due to bad capacitors.

Guess what we see in the image above?  Note the swollen tops of two capacitors just above the PCIE connector.

Note this too:

The CMOS battery—which is not the exact one in the photo, I changed it out—was reported bad.

Its voltage was completely flat when I put it in my battery tester.  It also was corroded.  You may be able to see some ugly brown residue from who knows what on the battery holder, just above the capacitor.  Whatever it is, it has gotten to the board, as seen at the lower left side of the battery holder.

When I discovered all this, I was trying to decode the Machine Check Status code I posted last time.  I wasn’t really happy with my non-answer and wanted to find the definite source.

Since I have an AMD processor, I found the AMD manuals.   I’ll give a link to the Intel equivalents, but I mention AMD because I had a hard time tracking down their reference material, whereas Intel is mentioned everywhere in searches.

These are the AMD manuals I refer to:

I’ll use a crash dump I got today (one of 4!!!)  Unlike the last time I saw a defective motherboard bluescreen, these bluescreens are remarkably consistent, all with a bug check code of 0x124 (WHEA_UNCORRECTABLE_ERROR) with nearly the same status codes from what I’ve been able to tell.   It’s a testament to the much-improved error handling in Windows Vista and Seven.  I’m going to skip most of the debugger output, since that’s in my last post, and go to the specific processor machine check:

!errrec fffffa800528a038
[…]
===============================================================================
Section 2 : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor @ fffffa800528a148
Section @ fffffa800528a2d0
Offset : 664
Length : 264
Flags : 0x00000000
Severity : Fatal

Error : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)
Status : 0xb200001000010c0f
The status word, from what I can see, has many bitfields that describe the location and cause of the error.  I’ll try breaking it down.  In binary:
.formats 0xb200001000010c0f
Evaluate expression:
Hex: b2000010`00010c0f
Decimal: -5620492266238833649
Octal: 1310000001000000206017
Binary: 10110010 00000000 00000000 00010000 00000000 00000001 00001100 00001111
Mangled copy from the AMD manual:
 

Bits Mnemonic Description

63 VAL Valid
62 OVER Status Register Overflow
61 UC Uncorrected Error
60 EN Error Condition Enabled
59 MISCV Miscellaneous-Error Register Valid
58 ADDRV Error-Address Register Valid
57 PCC Processor-Context Corrupt
56–32 Other Information
31–16 Model-Specific Error Code
15–0 MCA Error Code

In our status code, bit 63 is set so it is valid.  Bit 62 is unset so there’s no overflow.  Bit 61 indicates an uncorrected error and bit 60 indicates an error condition enabled.  Bit 57 is set and indicates processor context corrupt.

OK, it’s a fatal error and couldn’t continue.

Other information in bits 56-32 is not set but the documentation says that the field is used for, amongst other things, ECC information, which I just do not have in that (or any) client PC.  (SATV may have a server with ECC memory.  Someday.)  Bits 31-16 (the third word) is for the model-specific error code.

The model-specific error code is 00000001b.  I have an Athlon X2 4800+ Brisbane CPU that is at least three years old;  the AMD documentation says I should look up the error code in a manual specific to that CPU but I couldn’t find one on their website.  I would expect to see that field used on Opterons, their server CPU.

Moving on to the MCA error code, the last word, bits 15-0, is:

Binary:  [omitting the first three words] 00001100 00001111
The bits in the MSB determine the category of the error found, which can be from the bus (HyperTransport),, the GART (graphics), or the cache.
 

The MSB in binary, 00001100, indicates a bus error, so I’ll use this field to decode it:  0000 1PPT RRRR IILL, where PP is Participation Processor, T is Timeout, R is Memory Transaction Type, I means Memory or I/O and L is Cache Level.

PP is 10b, “Local Node Observed Error as Third Party (OBS)”  OK, whatever.

The timeout bit is not set so I presume it wasn’t something timing out.

R is 00b and that is a Generic error, which I assume to be “error not otherwise categorized”.

I is 11b, and that is also a generic error (“Something bad happened but I don’t know where?!”)

L (cache) is also 11b and also generic.

All this work by hand to get this error message:

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)

But you almost have to read the manual anyway just to skim the keywords as this message is composited from several keywords that describe specific types of errors and where and how they were found by the CPU that incurred the machine check.  (FYI, most of the relevant information was gotten from the AMD BIOS and Kernel Developer’s guide, pages 120-130 of Chapter 3, “Memory System Configuration” and all of Chapter 5, “Machine Check Architecture”.)

For most people it’s just enough to know the board was bad, but I hated the way I closed out my last post (“oh, I don’t know what the MCI status is and I don’t care!”) and I wanted to know this stuff.   I still don’t know what “Bank 4” refers to, if it even refers to memory (I had shuffled my DIMMs around in the system hoping the error would follow a specific DIMM.  Didn’t happen.)

Besides, it’ll help some poor guy or gal searching through Bing.

I know I will be making Newegg happy again very soon.

P.S.  The Intel manual that describes machine checks for Intel CPUs is in two parts:

UPDATE:  The NT Debugging blog has posted on the WHEA bugcheckGeoff Chappell’s web site has a entry for 0x124 WHEA_UNRECOVERABLE_ERROR and a page describing the bug check function that WHEA invokes when it can’t fix an error.


Diagnosing Hardware Bluescreens

Screenshot from NirSoft’s BlueScreenView

This morning I was waking my computer before breakfast to check on a FedEx shipment (much needed cooling fans for my apartment!) and when my machine woke up this is what I got.

I restarted it and the BIOS told me, “could not read disk, press Ctrl-Alt-Del to restart”.  I power-cycled the machine and got Windows to boot.

I checked my hard drive with Crystal Disk Info, but found nothing out of line in the SMART data—in fact, my terabyte HD, nearly a year old, has never had an error or a remapped sector or anything odd.  Had my partition table been truly corrupted, that would usually cause another bluescreen when I tried to boot.

OK, Windbg:

0: kd> !analyze -v
[banner omitted]
WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa800435c038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b2000010, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000010c0f, Low order 32-bits of the MCi_STATUS value.

I’d already guessed when the error happened, but to be sure, here’s the stack:

Child-SP          RetAddr           Call Site
fffff800`00ba8ac8 fffff800`02e2b917 nt!KeBugCheckEx
fffff800`00ba8ad0 fffff800`02fe84d3 hal!HalBugCheckSystem+0x1e3
fffff800`00ba8b10 fffff800`02e2b5dc nt!WheaReportHwError+0x263
fffff800`00ba8b70 fffff800`02e2af2e hal!HalpMcaReportError+0x4c
fffff800`00ba8cc0 fffff800`02e1ee8f hal!HalpMceHandler+0x9e
fffff800`00ba8d00 fffff800`02ed0eac hal!HalHandleMcheck+0x47
fffff800`00ba8d30 fffff800`02ed0d13 nt!KxMcheckAbort+0x6c
fffff800`00ba8e70 fffff880`03dd11f2 nt!KiMcheckAbort+0x153
fffff800`00b9cc98 fffff800`02ee013a amdk8!C1Halt+0x2
fffff800`00b9cca0 fffff800`02edadcc nt!PoIdle+0x53a
fffff800`00b9cd80 00000000`00000000 nt!KiIdleLoop+0x2c

The machine woke up to Windows, started running, and did its normal CPU idle procedure;  in all modern machines, the CPU halts when it is not otherwise running user or kernel code.  It’s possible the exception happened during the transition to sleep when I put the machine to bed the night before, in this event log entry:

The previous system shutdown at 11:27:54 PM on ‎6/‎28/‎2010 was unexpected.

OK, so it’s hardware.  What is the WHEA_ERROR_RECORD?

WHEA stands for Windows Hardware Error Architecture in Vista, 2008, Seven and 2008R2.  It replaces the Machine Check Architecture mechanism in earlier versions of Windows.

Parameter #2 of the bugcheck points to the hardware error record:

0: kd> dd fffffa800435c038fffffa80`0435c038  52455043 ffff0210 0003ffff 00000001fffffa80`0435c048  00000002 000003a0 000c1114 140a061dfffffa80`0435c058  00000000 00000000 00000000 00000000fffffa80`0435c068  00000000 00000000 00000000 00000000fffffa80`0435c078  cf07c4bd 4e18b789 731fc4b3 3171b52cfffffa80`0435c088  e8f56ffe 4cc5919c ab6588ba bb1349e1fffffa80`0435c098  0ced40e1 01cb1314 00000000 00000000fffffa80`0435c0a8  00000000 00000000 00000000 00000000

Right.  That’s clear.  Fortunately there are debugging extension commands for WHEA in the latest debugger.  I’ll try them.

0: kd> !wheaError Source Table @ fffff80003062b380 Error Sources

 

OK, not much info there, I’ll try one of the others.

0: kd> !errrec fffffa800435c038===============================================================================Common Platform Error Record @ fffffa800435c038-------------------------------------------------------------------------------Record Id     : 01cb13140ced40e1Severity      : Fatal (1)Length        : 928Creator       : MicrosoftNotify Type   : Machine Check ExceptionTimestamp     : 6/29/2010 12:17:20Flags         : 0x00000000

===============================================================================Section 0     : Processor Generic-------------------------------------------------------------------------------Descriptor    @ fffffa800435c0b8Section       @ fffffa800435c190Offset        : 344Length        : 192Flags         : 0x00000001 PrimarySeverity      : Fatal

Proc. Type    : x86/x64Instr. Set    : x64Error Type    : BUS errorOperation     : GenericFlags         : 0x00Level         : 3CPU Version   : 0x0000000000060fb1Processor ID  : 0x0000000000000000

===============================================================================Section 1     : x86/x64 Processor Specific-------------------------------------------------------------------------------Descriptor    @ fffffa800435c100Section       @ fffffa800435c250Offset        : 536Length        : 128Flags         : 0x00000000Severity      : Fatal

Local APIC Id : 0x0000000000000000CPU Id        : b1 0f 06 00 00 08 02 00 - 01 20 00 00 ff fb 8b 17                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa800435c250

===============================================================================Section 2     : x86/x64 MCA-------------------------------------------------------------------------------Descriptor    @ fffffa800435c148Section       @ fffffa800435c2d0Offset        : 664Length        : 264Flags         : 0x00000000Severity      : Fatal

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)  Status      : 0xb200001000010c0f

 

We’re getting somewhere.  The Processor Generic section categorizes this as a bus error.  Section 2 gets a bit more detailed:

Error         : BUSLG_OBS_ERR_*_NOTIMEOUT_ERR (Proc 0 Bank 4)  Status      : 0xb200001000010c0f
I have no information on how to decode the status word.  Doing a search on BUSLG turns up a few hits related to memory errors in FreeBSD.  The “Bank 4” wording implies memory hardware—this machine has four sticks of 1G each for a 4G system.
MSDN has a description of WHEA error events, though none of them describe my error at all.
A likely scenario may be that the system was put to sleep and entered sleep normally, but there was a power glitch during sleep, or on wakeup, that affected the standby power that keeps the RAM alive.  If you put the system to sleep and turn off the power, Windows will complain about it in the event logs.

I’ve seen a lot of quirks with this particular system but this is a new one.   I’ve often had BIOS messages that tell me,

A HyperTransport sync flood occurred on last bootHit F1 to Resume
Sure, I see that message and think, OMG the sync flooded, get a mop!  It’s not a very actionable error.
Unless this happens again, I’m not going to do anything about this.  The motherboard is elderly, 4 years old, and I plan to get a new board when it hits its fifth birthday this time next year.   If it happens every day for the next month though….

Local News Site Crashes, Part 3: Resolution?

Previous parts 1 and 2

After nosing around for a while and not finding any clue on the local news site crash, it’s back to the beginning.

Does anything in the stack show up in search?  Here are the top 15 or so of over 80 entries in this thread’s stack:

0:005> kv
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
0301a500 695981c2 00000000 00000000 00010100 mshtml!CMarkup::DetachElemCtxStream+0x64
0301a520 69575a5e 00000000 00000000 09e34b40 mshtml!CMarkup::DetachElemCtxStream+0x30
0301a554 694b7f43 04fd6c30 10e49194 04fc3830 mshtml!CAPProcessor::Evaluate+0x21d
0301a59c 69598299 00000000 00000000 09e34b40 mshtml!CDoc::SubmitForAntiPhishProcessing+0x1c4
0301a5b4 694c4e81 0301a628 125d82b8 00000000 mshtml!CMarkup::CheckCtxInfoThreshold+0x4c
0301a5c8 694250c2 09e34b40 00000002 00000001 mshtml!CElement::AddCtxInfoHelper+0xa5
0301a5e8 69478a42 00000002 69478a4c 125d82b8 mshtml!CAnchorElement::AddCtxInfoToStream+0x1e
0301a5f0 69478a4c 125d82b8 0301a778 00000000 mshtml!CImgElement::ExitTree+0xa (FPO: [0,0,0])
0301a614 693565e0 0301a628 09e34b40 00000000 mshtml!CAnchorElement::Notify+0x142
0301a768 693559f2 0301a874 002a7ea0 00000001 mshtml!CSpliceTreeEngine::RemoveSplice+0x2eb
0301a848 69356ea9 0301a880 0301a88c 11f74090 mshtml!CMarkup::SpliceTreeInternal+0x83
0301a898 693561ea 0301a8d4 0301a910 00000001 mshtml!CDoc::CutCopyMove+0xca
0301a8b4 692fcfd6 0301a8d4 0301a910 00000001 mshtml!CDoc::Move+0x16
As it happened, I got a hit from CMarkup::DetachElementCtxStream.  It’s in this long thread on MSDN: “IE 8.0.6001.18702 Unmanaged exception on MSHTML.DLL (innerHTML)”  It’s an ongoing and very interesting thread about browser crashes, input limits and the coding quality of a very popular social networking site.
 
Somewhere in that thread another function, CDoc::SubmitForAntiPhishProcessing, was mentioned. 
 
Internet Explorer’s SmartScreen Filter.
 
Several people in the thread suggested the problems went away in their scenarios when SmartScreen was turned off.
 
Took me three blog posts to find out the same thing, but there it is.  When SmartScreen is turned off, this news site loads successfully and does not crash.
 
I’m wondering if that very, very, very long list of URLs I once found in a dump, was input to SmartScreen?   The average commercial website loads so much content to the browser:  Pop-ups, pop-unders, pop-overs, pop-throughs, multiple Flash movies, and what seem like a million small frames and boxes that explode into millions of Twitter URLs upon the slightest accidental mouseover.
 
I can’t believe this would only happen to IE;  the other major browsers seem to be equally at risk of exceeding list and table limits with the average web site.
 
At least I can see this newspaper now.  Not that I like its editorial slant but at least IE is not in the way.  I still want it fixed so I can turn SmartScreen back on.
 
 


Local News Site Crashes, Part 2

As mentioned in my last post, a local news site was crashing on me and I wanted to learn more about what was causing it.  I had the HTTP request records from Fiddler, but I didn’t think its results were conclusive enough for me.  What could I find out in the debugger?  It’s the first tool I run for a kernel crash (bluescreen) but I had never tried to analyze an application crash with it.

First, I tried !analyze –v.  This command is the title of an internals blog I regularly read, but it is also the command that automatically analyzes a dump and determines the cause of a crash.  It is the first command often given in a kernel debugging session.  What is it here?

   1: ***    Your debugger is not using the correct symbols                 ***
   2: ***                                                                   ***
   3: ***    In order for this command to work properly, your symbol path   ***
   4: ***    must point to .pdb files that have full type information.    
   5: ***                                                                   ***
   6: ***    Certain .pdb files (such as the public OS symbols) do not      ***
   7: ***    contain the required information.  Contact the group that      ***
   8: ***    provided you with these symbols if you need this command to    ***
   9: ***    work.                                                          ***
  10: ***                                                                   ***
  11: ***    Type referenced: jscript!FncInfo                               ***
  12: ***                                                                   ***
  13: *************************************************************************
  14: *** ERROR: Symbol file could not be found.  Defaulted to export symbols for msidcrl40.DLL - 

OK.  I’m not in Microsoft so I won’t get those symbols.  I probably wouldn’t even posted this if I were in MS.  However, WinDbg helpfully tells me that “an exception of interest can be accessed via .ecxr”.  Let’s see this exception record:

0:005> .ecxr
eax=00000000 ebx=00000000 ecx=04fef280 edx=0301a424 esi=04f934a0 edi=00000000
eip=695981f6 esp=0301a4f0 ebp=0301a500 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246
mshtml!CMarkup::DetachElemCtxStream+0x64:
695981f6 8b07 mov eax,dword ptr [edi] ds:002b:00000000=????????

We’re getting someplace.  This is almost certainly where IE went boom.  The EDI register is supposed to point to somewhere in memory where the data is, but it is all zeroes so when it is dereferenced…it’s a null pointer.

 
What else was it doing?  My next step, a stack trace (kv)—this is just part of it:
0:005> kv
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
0301a500 695981c2 00000000 00000000 00010100 mshtml!CMarkup::DetachElemCtxStream+0x64
0301a520 69575a5e 00000000 00000000 09e34b40 mshtml!CMarkup::DetachElemCtxStream+0x30
0301a554 694b7f43 04fd6c30 10e49194 04fc3830 mshtml!CAPProcessor::Evaluate+0x21d
0301a59c 69598299 00000000 00000000 09e34b40 mshtml!CDoc::SubmitForAntiPhishProcessing+0x1c4
0301a5b4 694c4e81 0301a628 125d82b8 00000000 mshtml!CMarkup::CheckCtxInfoThreshold+0x4c
0301a5c8 694250c2 09e34b40 00000002 00000001 mshtml!CElement::AddCtxInfoHelper+0xa5
0301a5e8 69478a42 00000002 69478a4c 125d82b8 mshtml!CAnchorElement::AddCtxInfoToStream+0x1e
0301a5f0 69478a4c 125d82b8 0301a778 00000000 mshtml!CImgElement::ExitTree+0xa (FPO: [0,0,0])
0301a614 693565e0 0301a628 09e34b40 00000000 mshtml!CAnchorElement::Notify+0x142
0301a768 693559f2 0301a874 002a7ea0 00000001 mshtml!CSpliceTreeEngine::RemoveSplice+0x2eb

The full trace was over 80 entries deep!  The usual strategy is to look at the topmost 5 or 10 entries in the stack since they’re “near” the problem area.  The crash happened in CMarkup::DetachElemCtxStream.  On the left the arguments to the function (args to child) are listed.  Some are zero, suggesting that that function got the bad pointer from one of its parent callers. 

I disassembled the code of DetachElemCtxStream and traced through it:

0:005> u @eip
mshtml!CMarkup::DetachElemCtxStream+0x64:
695981f6 8b07 mov eax,dword ptr [edi]
695981f8 57 push edi
695981f9 ff5004 call dword ptr [eax+4]
695981fc 8b8680000000 mov eax,dword ptr [esi+80h]
69598202 8b08 mov ecx,dword ptr [eax]
69598204 50 push eax
69598205 ff5108 call dword ptr [ecx+8]
69598208 899e80000000 mov dword ptr [esi+80h],ebx

While it seems to involve jumping to a previously-constructed dispatch table, I don’t know what else to make of it.  I did trace through its callers for a bit but didn’t know what I was looking for. (I am familiar with x86 assembly code but do not code in it or look at it regularly.)  Instead, I wanted to look at some registers and some stack arguments to see if they pointed to interesting data.  Now you know why I wanted a full user dump.

We’ll see if some of the registers or stack arguments point to interesting text.

For the most part, most of the registers and the arguments to the first five entries off the top of the stack weren’t interesting.  In earlier debugging sessions with different dumps of IE, I once found a long list of URL’s in Unicode.  A very long list.   I wasn’t able to find that in this dump without spending all week on it.  I found one interesting text pointed by the ECX register, about 572 bytes in:

0:005> db @ecx + 0n672
04fef520 1f 00 00 00 00 00 00 00-68 00 74 00 74 00 70 00 ........h.t.t.p.
04fef530 3a 00 2f 00 2f 00 77 00-77 00 77 00 2e 00 73 00 :././.w.w.w...s.
04fef540 61 00 6c 00 65 00 6d 00-6e 00 65 00 77 00 73 00 a.l.e.m.n.e.w.s.
04fef550 2e 00 63 00 6f 00 6d 00-2f 00 00 00 00 00 00 00 ..c.o.m./.......
04fef560 00 00 00 00 00 00 00 00-37 aa c0 36 00 00 00 8c ........7..6....
04fef570 2f 00 61 00 6a 00 61 00-78 00 2f 00 6c 00 69 00 /.a.j.a.x./.l.i.
04fef580 62 00 73 00 2f 00 73 00-77 00 66 00 6f 00 62 00 b.s./.s.w.f.o.b.
04fef590 6a 00 65 00 63 00 74 00-2f 00 32 00 2e 00 32 00 j.e.c.t./.2...2.
It appears to be a list of links in that page.  Somewhere in the dump is text of the current web page;  I’d seen it before, but not this time. 
 
At this point, it’s perfectly acceptable to just take the top stack entries, throw them in a search engine and see who else has seen this problem.  That’s what I’m doing next.


Local News Site Crashes, Part 1

Our local paper redesigned its website a month ago.  Ever since, this is what I and many others have seen when opening it for the morning.

Sometimes, a website will crash one time due to an isolated error.  A third-party web analytics site once made an error in its HTML that brought down every site that used their services.  This sort of error gets found and corrected very quickly.

But this went on over days.  Rarely, the site would stay open for reading only to crash when opening another story I wanted to find the problem, even though I have no stake or obligation to do so.  I didn’t think I could get the newspaper interested in my bug report so I tried to find out what I could with my own knowledge of Windows internals.

First of all, I needed a crash dump of the failed process, to wit, Internet Explorer.  Windows 7 (and Vista) do not save crash dumps for applications by default.  (Note that this has nothing to do with the settings for kernel dumps or bluescreens;  those are handled through the familiar sysdm.cpl control panel applet.)

MSDN has a page describing how to configure user-mode dumps. 

There’s only one setting we need to enable the dumps.  In Regedit, go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps.  Under that key, create a new DWORD value named DumpType.  Set its value to 2.  This will make Windows perform full dumps of the application, which we will need to make any headway in this diagnosis.  Restart the computer.

When an app crashes, Windows will now put its full crash dump in the LocalApps folder (normally c:\users\<user>\AppData\Local\CrashDumps.)  It will store up to 10 dumps before overwriting any.  These defaults can be changed per the MSDN page but these are fine here.

Next, I installed Fiddler.  This is a really ingenious HTTP proxy.  It uses the built-in proxy settings, that you may have seen in the Internet Options dialog, to redirect HTTP traffic to itself, capture it and display it, much like WireShark and Network Monitor, but with special emphasis on HTTP debugging.   It would tell me what was requested when IE crashed.  Fortunately, the crash was repeatable so I captured it with Fiddler:

The main window of Fiddler is very much like other network tracing tools.  A list of sessions opened is in the left pane.  The right pane has details on a particular session and the lower right pane has even more details.

There are a lot of requests made to open the typical web page.  In a crash like the one I experienced, the web page pops up and one can see headlines and content, but a few second later, the crash dialog comes up.

Note request #149 which I have circled.  It goes to watson.microsoft.com.  This is where Windows Error Reporting sends your crash data.  The crash had happened already here.  Any of the requests prior to this could have crunched IE, either immediately or a short time afterwards.  I have highlighted the prior request, #148, which is to ad.trafficmp.com, a very common ad-serving site.  The requests that came afterwards occurred when I dismissed the error dialog and IE tried to reload the page.

I’d hoped there was some Javascript code from that site that would pop out at me as being “bad” (recursive code with a bug, say.)  But nothing stood out.

Since I had full dumps of IE during the crash, it was time to run the Windows debugger.  That’s my next post.


16-Bit Installer Support in Windows 64

Followup to my TIE Fighter post.  I was not wrong about 16-bit installer support.  From MSDN: 

…For older applications that use a 16-bit stub to launch a 32-bit installation engine, 64-bit Windows recognizes specific 16-bit installer programs and substitutes a ported 32-bit version.

16-bit DOS, Windows, or OS/2 applications often use a 16-bit stub to check the machine type, then launch a 32-bit installation engine to actually perform the installation. To enable installation of applications that use this technique, 64-bit Windows substitutes 32-bit versions for the following 16-bit installer programs:

Microsoft Setup for Windows 1.2

Microsoft Setup for Windows 2.6

Microsoft Setup for Windows 3.0

Microsoft Setup for Windows 3.01

InstallShield 5.x

The registry key that defines these installer shims is at HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\NTVdm64.  The list, according to Microsoft, cannot be extended.

I was certain that TIE Fighter used InstallShield (as did the majority of apps in the era) but what version?

I used Sysinternals’ Strings utility on the setup file (on the install CD, \INSTALL\SETUP.EXE).  And got this (redacted a bit):

CompanyName
InstallShield Corporation, Inc.
FileDescription
Setup Launcher ( SETUP.EXE) 
FileVersion
3.00.111.0
LegalCopyright
Copyright InstallShield Corporation, Inc. 1990-1996 Phone : (847) 240-9111 
ProductName
InstallShield
ProductVersion
3.00.111.0

That is it.  The installer is version 3.0 and the shim only works with 5.x.   I noted elsewhere in the Strings output, clues as to the real age of the installer at the time—there are references to MIPS and Alpha architectures, which have not been in Windows for a very long time (MIPS was discontinued around the time of NT4 and Alpha never made it past Windows 2000 before being assimilated by Compaq.)

So much for that.