Fighting Power Outages at SATV

The month of May at SATV has been …

[Darkness]

[BEEEEEEEEEEEEEEEEEEEEP] [Sound of UPS alarms] [Sound of cell phone message notifications]  [Sound of cell phone ring]  [Repeated]

We’ve had several power interruptions in Downtown Salem in the past month, and several others earlier in the year, one of which happened in the middle of an update to the Exchange server.  We lost half a day’s email on that one.

This was one of our live shows:

Salem Now Intro Interrupted.

I was working for that show at the time and I was just about to hit a key on our Inscriber to start the show introduction when the power hit, or didn’t. 

A week later, I was at SATV again around 6 PM, processing the video for the Salem Commission on Disabilities meeting on which I sit.  Darkness.  Alarms.  My cell phone message tone sounded over and over.  I was in a dark room, which was not fun but at least I had the light of my laptop to see by.  This was a longer outage, lasting about 25 minutes.

SATV doesn’t have backup power building-wide.  We have a UPS in the server room, and two more in Cablecasting.  These worked as well as can be expected during a power failure.  We used to have a Comcast-managed demarc equipment room in the furthest corner of our facility, but this has been literally ripped out and replaced by 6U worth of fiber optic equipment in our Cablecast racks.

A notable gap is our phone system;  the PBX itself is protected but the individual phones are not—they stop working during a power incident.  It’s a very sad reminder of the days when all phones and even some PBX’s were provided by Ma Bell and powered by the central office and stayed up no matter what.  I have investigated PoE (Power Over Ethernet) switches—SATV is planning a refresh of our network hardware—but there are few affordable options to us.

We are too small to have generators—we are in leased space and the only place they could go is the roof, if our landlord would even allow it;  as well, it would not be cheap.  (I have never been approached about running a Home Depot generator in the building.  Fortunate, since I’d then have  to explain about enclosed spaces, fire regulations and CO.)

We thought about using a spare UPS we have to power the control room;  Our control room rack power is supplied by two power cords with NEMA 5-20 plugs.  Our UPS doesn’t have compatible sockets, in fact none of ours do.   Unless we want to completely rewire the rack, we can’t use our UPS..

In talking with Sal, we’re reluctant to get higher-capacity UPS units, since the costs rise very, very quickly when you go beyond the usual 1500 VA units, the biggest one can buy for regular office power.   UPS’s are not made to power everything for a long time, but just to carry over brief interruptions and keep the loads up until the generators start and power is switched in.

If our facility disappeared and I was asked to design a new one from scratch, I would have definitely put the server space close to the cablecast space and close to the Comcast & Verizon demarcs.  And configured for 208V three-phase in those areas.  And had a locked door.  And working HVAC.  But very few of us have been fortunate enough to get that blank slate.  Not me, either.

It’s just about impossible to predict what will happen next and how National Grid will deal with it.  Our power situation—Downtown Salem in general—reminds me of Salem 35 years ago when we had a very old water and sewer system.   Then, water mains would rupture seemingly every month.  One May evening in the late-70’s, Salem’s water main downtown let go.

It was in the train tunnel under Riley Plaza.  There was, at one point, several feet of water in spots on street level.  Never mind the tunnel itself, which was completely submerged.  I note with cheer that virtually all of Salem’s telecom cabling ran through this point connecting with the New England Telephone (now Verizon) CO that was and still is just nearby.

DSL service—my DSL service—now runs through that tunnel.  We don’t have a map of Comcast’s cable junctions, but we do know their major cable vault downtown, near Summer & Norman Sts., was prone to flooding, too.

What surprises do our aging electrical systems have for us?  I already know that a car with a stuck accelerator near the Ward Two Social Club (on Bridge St. Neck, five blocks NE of my apartment) can and has dropped power downtown, along with the splintered pole he ran into.

National Grid is frustrating in its lack of communications;  they have a cute Google-based outage map on their webpage that is useful for National Grid itself, but not for its customers or residents.  Why can’t they do what the MBTA does?  I get service notices to my email and cellphone for bus routes I regularly take.  Why couldn’t National Grid have a system that had you put your zipcode and email address or cell number into its system and get notifications of outages or planned maintenance.

For that matter, National Grid already generates internal reports on power incidents.  Why not let us see them in a daily or weekly summary?  (“There was a 2-second drop on one phase at the Canal St. Substation, affecting some customers in South Salem…”)

This is not a problem for SATV to solve as much as it is the city’s.  Downtown Salem loses power several times a year;  we just go off the air, but the many other businesses lose money.  Big money.

Unfortunately, no neighborhood association will ever solicit for improved electrical distribution or refurbished substations.

Advertisements

Old Game, New Controller: Descent

I’ve been playing a lot of games lately.  In fact, I seem to have more games in my collection than I have time to play so I just rotate in or out as my tastes dictate.  (I got Carmageddon working under DOSBOX.  The less said about that game the better.  It doesn’t help that I have bought more games from GOG and at dollar stores recently but…)

An old game I’ve been playing lately is Descent.  I have the entire Descent series (three games) and the entire Freespace series from GOG.  I’ve had trouble playing the original Descent games (Descent and Descent 2, which used the same engine) because neither of them supports my controller very well.  DOSBOX, the virtualization software for DOS games, only emulates the native PC joystick interface—four axes on two sticks, and four buttons. 

This does not work well nor play well for this game, for reasons I’ll explain later on.

There is a solution.  Descent is one of several games whose source code has been released;  The Build engine of Duke Nukem and Redneck Rampage (not that I own these), and most famously, Doom, has also released as source.  I have played and loved Doomsday, the 3D engine for Doom and Doom 2.

A German Descent fan has developed D2X-XL, a Descent 1/2 game engine that runs on modern machines.  It, like Doomsday, implements the full lighting and shading effects that we expect from a modern video card at modern resolutions, including 16:9, which I now use.

Most importantly, it supports modern controls. 

D2X’s install is moderately complex for a typical gamer;  install the game in a directory outside “Program Files” (I have “\miscprograms\games\”) and copy the game data files from a copy of Descent 1 or Descent 2;  if you bought the GOG edition as I did, they are in C:\Program Files\GOG.com\Descent 1 and 2\.  If you run the D2X executable, it will give you a dialog box explaining which files are missing.

(If you are on a x64 system, the GOG files will be in \Program Files (x86).  D2X comes in a 64-bit version, so you may want to read the instructions on how to use that.)

I won’t go into graphics setup as that will vary;  my elderly nVidia card had to have most of the settings dialed way down so my screenshots won’t look as cool as they would be on a modern card and motherboard (next year maybe.)

The controls are set in Options/Controls/Customize Joystick. 

Before I go on, I must say that Descent is one of the more complicated games to play and control.  Only flight simulators are more specialized than this game.  In the game, you pilot a spaceship down the corridors and passages of a mine, battling rouge mechs, trying to destroy the mechs and the mine (via a reactor you blow up) in order to go on to the next mine.  Your ship has controls on all three axes, and you not only pitch and yaw like an aircraft, but you also move laterally left and right, up and down, forward and back.  And, like a fighter jet, you have an array of weapon systems to manage .

How I ever liked this game when I was younger, I do not know.  (In those days, a joystick with a hat switch was the thing to play it with.)

Here is my setup for using the XBox 360 controller.  I’ve had this controller longer than any other and I love it.

In this game, as in most games by default, the left joystick control is joystick #1 and handles the pitch and yaw.

The D-Pad control on the bottom left is not used.

The left and right “bumper” buttons (or shoulder buttons if you will) fire the primary (laser) and secondary (missile) weapons.  Here’s where it gets complex.

The right joystick, unused by most games, is used to slide (translate) from left to right and up to down.

The right trigger engages forward thrust.  The left trigger engages rear thrust.  In this controller, the triggers are treated as an extra axis and will register partial trigger movements (though not in Descent.)

The white Back button cycles through your primary weapons;  the white Start button does the same for secondary weapons.

Button A fires a flare—very necessary in most mines that are dark;  button B will bank the ship with the joystick when it is held in.  Button X replicates the forward thrust control;  Button Y replicates reverse thrust.

The joystick controls themselves can be pressed for an extra button each.  They, like the D-Pad, are unassigned.  Of the remaining controls in Descent, most are either infrequently used (cruise mode) or too dangerous (bomb release) to warrant being available on the controller.  I’ve tried to make this logical for the way I play, and it has seemed to work.

A skilled player, of which I am not one, can make his or her ship do pirouettes all day with this controller.  I’ll just settle for surviving to the next level.


Local News Site Crashes, Part 3: Resolution?

Previous parts 1 and 2

After nosing around for a while and not finding any clue on the local news site crash, it’s back to the beginning.

Does anything in the stack show up in search?  Here are the top 15 or so of over 80 entries in this thread’s stack:

0:005> kv
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
0301a500 695981c2 00000000 00000000 00010100 mshtml!CMarkup::DetachElemCtxStream+0x64
0301a520 69575a5e 00000000 00000000 09e34b40 mshtml!CMarkup::DetachElemCtxStream+0x30
0301a554 694b7f43 04fd6c30 10e49194 04fc3830 mshtml!CAPProcessor::Evaluate+0x21d
0301a59c 69598299 00000000 00000000 09e34b40 mshtml!CDoc::SubmitForAntiPhishProcessing+0x1c4
0301a5b4 694c4e81 0301a628 125d82b8 00000000 mshtml!CMarkup::CheckCtxInfoThreshold+0x4c
0301a5c8 694250c2 09e34b40 00000002 00000001 mshtml!CElement::AddCtxInfoHelper+0xa5
0301a5e8 69478a42 00000002 69478a4c 125d82b8 mshtml!CAnchorElement::AddCtxInfoToStream+0x1e
0301a5f0 69478a4c 125d82b8 0301a778 00000000 mshtml!CImgElement::ExitTree+0xa (FPO: [0,0,0])
0301a614 693565e0 0301a628 09e34b40 00000000 mshtml!CAnchorElement::Notify+0x142
0301a768 693559f2 0301a874 002a7ea0 00000001 mshtml!CSpliceTreeEngine::RemoveSplice+0x2eb
0301a848 69356ea9 0301a880 0301a88c 11f74090 mshtml!CMarkup::SpliceTreeInternal+0x83
0301a898 693561ea 0301a8d4 0301a910 00000001 mshtml!CDoc::CutCopyMove+0xca
0301a8b4 692fcfd6 0301a8d4 0301a910 00000001 mshtml!CDoc::Move+0x16
As it happened, I got a hit from CMarkup::DetachElementCtxStream.  It’s in this long thread on MSDN: “IE 8.0.6001.18702 Unmanaged exception on MSHTML.DLL (innerHTML)”  It’s an ongoing and very interesting thread about browser crashes, input limits and the coding quality of a very popular social networking site.
 
Somewhere in that thread another function, CDoc::SubmitForAntiPhishProcessing, was mentioned. 
 
Internet Explorer’s SmartScreen Filter.
 
Several people in the thread suggested the problems went away in their scenarios when SmartScreen was turned off.
 
Took me three blog posts to find out the same thing, but there it is.  When SmartScreen is turned off, this news site loads successfully and does not crash.
 
I’m wondering if that very, very, very long list of URLs I once found in a dump, was input to SmartScreen?   The average commercial website loads so much content to the browser:  Pop-ups, pop-unders, pop-overs, pop-throughs, multiple Flash movies, and what seem like a million small frames and boxes that explode into millions of Twitter URLs upon the slightest accidental mouseover.
 
I can’t believe this would only happen to IE;  the other major browsers seem to be equally at risk of exceeding list and table limits with the average web site.
 
At least I can see this newspaper now.  Not that I like its editorial slant but at least IE is not in the way.  I still want it fixed so I can turn SmartScreen back on.
 
 


Local News Site Crashes, Part 2

As mentioned in my last post, a local news site was crashing on me and I wanted to learn more about what was causing it.  I had the HTTP request records from Fiddler, but I didn’t think its results were conclusive enough for me.  What could I find out in the debugger?  It’s the first tool I run for a kernel crash (bluescreen) but I had never tried to analyze an application crash with it.

First, I tried !analyze –v.  This command is the title of an internals blog I regularly read, but it is also the command that automatically analyzes a dump and determines the cause of a crash.  It is the first command often given in a kernel debugging session.  What is it here?

   1: ***    Your debugger is not using the correct symbols                 ***
   2: ***                                                                   ***
   3: ***    In order for this command to work properly, your symbol path   ***
   4: ***    must point to .pdb files that have full type information.    
   5: ***                                                                   ***
   6: ***    Certain .pdb files (such as the public OS symbols) do not      ***
   7: ***    contain the required information.  Contact the group that      ***
   8: ***    provided you with these symbols if you need this command to    ***
   9: ***    work.                                                          ***
  10: ***                                                                   ***
  11: ***    Type referenced: jscript!FncInfo                               ***
  12: ***                                                                   ***
  13: *************************************************************************
  14: *** ERROR: Symbol file could not be found.  Defaulted to export symbols for msidcrl40.DLL - 

OK.  I’m not in Microsoft so I won’t get those symbols.  I probably wouldn’t even posted this if I were in MS.  However, WinDbg helpfully tells me that “an exception of interest can be accessed via .ecxr”.  Let’s see this exception record:

0:005> .ecxr
eax=00000000 ebx=00000000 ecx=04fef280 edx=0301a424 esi=04f934a0 edi=00000000
eip=695981f6 esp=0301a4f0 ebp=0301a500 iopl=0 nv up ei pl zr na pe nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010246
mshtml!CMarkup::DetachElemCtxStream+0x64:
695981f6 8b07 mov eax,dword ptr [edi] ds:002b:00000000=????????

We’re getting someplace.  This is almost certainly where IE went boom.  The EDI register is supposed to point to somewhere in memory where the data is, but it is all zeroes so when it is dereferenced…it’s a null pointer.

 
What else was it doing?  My next step, a stack trace (kv)—this is just part of it:
0:005> kv
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
0301a500 695981c2 00000000 00000000 00010100 mshtml!CMarkup::DetachElemCtxStream+0x64
0301a520 69575a5e 00000000 00000000 09e34b40 mshtml!CMarkup::DetachElemCtxStream+0x30
0301a554 694b7f43 04fd6c30 10e49194 04fc3830 mshtml!CAPProcessor::Evaluate+0x21d
0301a59c 69598299 00000000 00000000 09e34b40 mshtml!CDoc::SubmitForAntiPhishProcessing+0x1c4
0301a5b4 694c4e81 0301a628 125d82b8 00000000 mshtml!CMarkup::CheckCtxInfoThreshold+0x4c
0301a5c8 694250c2 09e34b40 00000002 00000001 mshtml!CElement::AddCtxInfoHelper+0xa5
0301a5e8 69478a42 00000002 69478a4c 125d82b8 mshtml!CAnchorElement::AddCtxInfoToStream+0x1e
0301a5f0 69478a4c 125d82b8 0301a778 00000000 mshtml!CImgElement::ExitTree+0xa (FPO: [0,0,0])
0301a614 693565e0 0301a628 09e34b40 00000000 mshtml!CAnchorElement::Notify+0x142
0301a768 693559f2 0301a874 002a7ea0 00000001 mshtml!CSpliceTreeEngine::RemoveSplice+0x2eb

The full trace was over 80 entries deep!  The usual strategy is to look at the topmost 5 or 10 entries in the stack since they’re “near” the problem area.  The crash happened in CMarkup::DetachElemCtxStream.  On the left the arguments to the function (args to child) are listed.  Some are zero, suggesting that that function got the bad pointer from one of its parent callers. 

I disassembled the code of DetachElemCtxStream and traced through it:

0:005> u @eip
mshtml!CMarkup::DetachElemCtxStream+0x64:
695981f6 8b07 mov eax,dword ptr [edi]
695981f8 57 push edi
695981f9 ff5004 call dword ptr [eax+4]
695981fc 8b8680000000 mov eax,dword ptr [esi+80h]
69598202 8b08 mov ecx,dword ptr [eax]
69598204 50 push eax
69598205 ff5108 call dword ptr [ecx+8]
69598208 899e80000000 mov dword ptr [esi+80h],ebx

While it seems to involve jumping to a previously-constructed dispatch table, I don’t know what else to make of it.  I did trace through its callers for a bit but didn’t know what I was looking for. (I am familiar with x86 assembly code but do not code in it or look at it regularly.)  Instead, I wanted to look at some registers and some stack arguments to see if they pointed to interesting data.  Now you know why I wanted a full user dump.

We’ll see if some of the registers or stack arguments point to interesting text.

For the most part, most of the registers and the arguments to the first five entries off the top of the stack weren’t interesting.  In earlier debugging sessions with different dumps of IE, I once found a long list of URL’s in Unicode.  A very long list.   I wasn’t able to find that in this dump without spending all week on it.  I found one interesting text pointed by the ECX register, about 572 bytes in:

0:005> db @ecx + 0n672
04fef520 1f 00 00 00 00 00 00 00-68 00 74 00 74 00 70 00 ........h.t.t.p.
04fef530 3a 00 2f 00 2f 00 77 00-77 00 77 00 2e 00 73 00 :././.w.w.w...s.
04fef540 61 00 6c 00 65 00 6d 00-6e 00 65 00 77 00 73 00 a.l.e.m.n.e.w.s.
04fef550 2e 00 63 00 6f 00 6d 00-2f 00 00 00 00 00 00 00 ..c.o.m./.......
04fef560 00 00 00 00 00 00 00 00-37 aa c0 36 00 00 00 8c ........7..6....
04fef570 2f 00 61 00 6a 00 61 00-78 00 2f 00 6c 00 69 00 /.a.j.a.x./.l.i.
04fef580 62 00 73 00 2f 00 73 00-77 00 66 00 6f 00 62 00 b.s./.s.w.f.o.b.
04fef590 6a 00 65 00 63 00 74 00-2f 00 32 00 2e 00 32 00 j.e.c.t./.2...2.
It appears to be a list of links in that page.  Somewhere in the dump is text of the current web page;  I’d seen it before, but not this time. 
 
At this point, it’s perfectly acceptable to just take the top stack entries, throw them in a search engine and see who else has seen this problem.  That’s what I’m doing next.


Local News Site Crashes, Part 1

Our local paper redesigned its website a month ago.  Ever since, this is what I and many others have seen when opening it for the morning.

Sometimes, a website will crash one time due to an isolated error.  A third-party web analytics site once made an error in its HTML that brought down every site that used their services.  This sort of error gets found and corrected very quickly.

But this went on over days.  Rarely, the site would stay open for reading only to crash when opening another story I wanted to find the problem, even though I have no stake or obligation to do so.  I didn’t think I could get the newspaper interested in my bug report so I tried to find out what I could with my own knowledge of Windows internals.

First of all, I needed a crash dump of the failed process, to wit, Internet Explorer.  Windows 7 (and Vista) do not save crash dumps for applications by default.  (Note that this has nothing to do with the settings for kernel dumps or bluescreens;  those are handled through the familiar sysdm.cpl control panel applet.)

MSDN has a page describing how to configure user-mode dumps. 

There’s only one setting we need to enable the dumps.  In Regedit, go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps.  Under that key, create a new DWORD value named DumpType.  Set its value to 2.  This will make Windows perform full dumps of the application, which we will need to make any headway in this diagnosis.  Restart the computer.

When an app crashes, Windows will now put its full crash dump in the LocalApps folder (normally c:\users\<user>\AppData\Local\CrashDumps.)  It will store up to 10 dumps before overwriting any.  These defaults can be changed per the MSDN page but these are fine here.

Next, I installed Fiddler.  This is a really ingenious HTTP proxy.  It uses the built-in proxy settings, that you may have seen in the Internet Options dialog, to redirect HTTP traffic to itself, capture it and display it, much like WireShark and Network Monitor, but with special emphasis on HTTP debugging.   It would tell me what was requested when IE crashed.  Fortunately, the crash was repeatable so I captured it with Fiddler:

The main window of Fiddler is very much like other network tracing tools.  A list of sessions opened is in the left pane.  The right pane has details on a particular session and the lower right pane has even more details.

There are a lot of requests made to open the typical web page.  In a crash like the one I experienced, the web page pops up and one can see headlines and content, but a few second later, the crash dialog comes up.

Note request #149 which I have circled.  It goes to watson.microsoft.com.  This is where Windows Error Reporting sends your crash data.  The crash had happened already here.  Any of the requests prior to this could have crunched IE, either immediately or a short time afterwards.  I have highlighted the prior request, #148, which is to ad.trafficmp.com, a very common ad-serving site.  The requests that came afterwards occurred when I dismissed the error dialog and IE tried to reload the page.

I’d hoped there was some Javascript code from that site that would pop out at me as being “bad” (recursive code with a bug, say.)  But nothing stood out.

Since I had full dumps of IE during the crash, it was time to run the Windows debugger.  That’s my next post.