Postmortem: Vista Hangs after Three Days due to Taskeng spawning

I’ve finally found the resolution for my Vista hangs after four months.  Did I do the right things diagnosing it?  What could I have done better, or really, quicker?

From the start, it’s obvious in hindsight that the original symptoms—hanging on wakeup and on PDA docking—were very misleading and pointed me towards hardware problems.  Other than the bad memory I replaced (with still no idea why it was bad), my hardware was fine.

I did turn away from hardware towards performance issues as the root cause.  I would have been better going to Performance Monitor first;  it’s much improved in Vista and very useful.  I should have only taken out the kernel tools if perfmon didn’t yield anything up, but that’s what happens when you start with hardware as an assumption.

Even when I used kernel tools like WinDbg, I missed a big clue.  Here’s the output of !vm, from my most recent crash dump several weeks ago, listing memory in use, most importantly process memory in use:

*** Virtual Memory Usage ***
    Physical Memory:      851636 (   3406544 Kb)
    Page File: \??\C:\pagefile.sys
      Current:   3713744 Kb  Free Space:   1774576 Kb
      Minimum:   3713744 Kb  Maximum:     10219632 Kb
    Available Pages:      176293 (    705172 Kb)
    ResAvail Pages:       618754 (   2475016 Kb)
    Locked IO Pages:           0 (         0 Kb)
    Free System PTEs:      58252 (    233008 Kb)
    Modified Pages:          103 (       412 Kb)
    Modified PF Pages:        56 (       224 Kb)
    NonPagedPool Usage:    22548 (     90192 Kb)
    NonPagedPool Max:     523072 (   2092288 Kb)

Nothing wrong with nonpaged pool usage.

    PagedPool 0 Usage:     18814 (     75256 Kb)
    PagedPool 1 Usage:      7492 (     29968 Kb)
    PagedPool 2 Usage:      4200 (     16800 Kb)
    PagedPool 3 Usage:      3870 (     15480 Kb)
    PagedPool 4 Usage:      3702 (     14808 Kb)
    PagedPool Usage:       38078 (    152312 Kb)
    PagedPool Maximum:    523264 (   2093056 Kb)

Nor with paged pool. Whatever was happening, it wasn’t from my drivers.  Now on to the processes:

    Total Private:       1033896 (   4135584 Kb)
         11310 iexplore.exe     25548 (    102192 Kb)
         0484 svchost.exe      25456 (    101824 Kb)
         111a0 iexplore.exe     25304 (    101216 Kb)
         0f18 explorer.exe     21786 (     87144 Kb)
         03cc svchost.exe      20625 (     82500 Kb)
         11f24 OUTLOOK.EXE      19878 (     79512 Kb)
    

Lots of memory in use, but the processes themselves seem normal so far…

         4d6c taskeng.exe        409 (      1636 Kb)
         43bc taskeng.exe        392 (      1568 Kb)
         9178 taskeng.exe        388 (      1552 Kb)
         3f24 taskeng.exe        388 (      1552 Kb)
         1123c taskeng.exe        386 (      1544 Kb)
         2dd0 taskeng.exe        384 (      1536 Kb)
         2d80 taskeng.exe        384 (      1536 Kb)
         0314 taskeng.exe        383 (      1532 Kb)
         b9cc taskeng.exe        382 (      1528 Kb)
         7184 taskeng.exe        382 (      1528 Kb)
         58a8 taskeng.exe        382 (      1528 Kb)
         1988 taskeng.exe        382 (      1528 Kb)
         [and nearly 1000 more instances!!!…]

That should have tipped me off right there, but it didn’t.

For months, I had run Process Explorer and had seen multiple instances of taskeng.  I had always assumed Task Scheduler, rewritten for Vista, had a pool of processes to run tasks, just as IIS does.

I never thought to count the tasks!   Had I only realized that Task Scheduler didn’t work that way!  Remember that Windows Internals covers much more the kernel and drivers themselves, but about the many services and administration mechanisms inside Windows. 

I hadn’t been as familiar with the newer Vista administration tools, since many of them depend on Windows Server 2008, which I’ve only really gotten used to over the summer on my own SBS box.  Familiar tools like the Event Log console are a little different.  At SATV we’re still on Windows XP/Windows Server 2003 (and even Windows 2000) so part of me is still accustomed to the older platforms.

Before I congratulate myself on sticking with the problem for five months, I should note that most people would have reformatted and reinstalled by now and it would be just another story in the “bad Vista” narrative that’s been in the IT press for three years.  I shouldn’t ever expect patience like that from a computer problem in this fast-paced world.  I’m perhaps too patient.

The other story is that Windows, for all that we bitterly condemn its faults, is remarkably resilient.  I have seen workstations that I was convinced were trashed when I couldn’t reach them from the network, only to find that their users were working away not realizing anything was wrong with their machines (refreshing Group Policy fixed that.)

There are many, many, many Windows installations that are seriously messed up, yet their users have no idea anything’s amiss.  Workstations and laptops shipped with crapware almost certainly qualify as “messed up”, sadly.

We have Macs at SATV that are quirky too.  It’s the reality of using a very complex machine;  do we reformat and rebuild our cars when they don’t start in the morning?

(Obligatory Linux comment:  I can’t see spending the same time with an Ubuntu distribution and having the same success;  most Ubuntu users have to reformat and reinstall regularly when new versions come out, and there is no concept of running your 5-year old program on Linux.  It just isn’t done.)

But, as you can see from the image at the top of this post, everything’s fine.  I have lots of programs open (and lots of tabs in IE) but only about 2G memory in use.  Looks and runs fine now.

What’s the next problem?

Advertisements

Fixing multiple instances of Taskeng that respawn

In my last post, I explained how I found a problem with Task Scheduler that was causing it to respawn multiple instances of the process taskeng.

NickDownUnder in the TechNet forum thread “Help 90+ Taskeng!” had the answer:  This is being caused by IE’s feed reader.  IE has a mechanism and an API for storing and updating RSS feeds, and a task that does the updating, User_Feed_Synchronization.  If this task encounters an error, it will respawn itself every 5 to 10 minutes.  The task runs whether IE is open or not.

The fix which I’ve adapted from Nick’s instructions, is to delete all the user feed synchronization tasks, disable IE feed updates, and re-enable them.  Make sure you have a backup or have run System Restore first.

Instructions:

  1. From the Start Menu, type task scheduler.
  2. Right-click on task scheduler and select Run as Administrator.  Accept the UAC prompt.
  3. The Task Scheduler console will come up.  In its menu, select View/Show Hidden Tasks.
  4. In the center pane, you should see a list of tasks.  Amongst these, there will be a task (or tasks) named User_Feed_Synchronization-{xxxxxxxx…}, where x is a series of letters and numbers (it’s a GUID).
  5. Right-click on this task (and not on any others!) and select Delete.
  6. Repeat this for all other instances of User_Feed_Synchronization that you find.
  7. Next, go to IE.  Select Tools/Internet Options.  Select the Content tab and under Feeds click Settings.  Under Default Schedule, clear the checkbox marked Automatically Check Feeds for Updates.  Click OK twice.
  8. Reboot.
  9. After reboot and login, go to IE, Tools/Internet Options/Content and reenable feed updates.  You may need to do this for each user on the computer.

In Task Scheduler, you can go to the History tab of each instance of User Feed Synchronization to see if there are any errors;  error messages that recur every 5 minutes are an indication of this problem.  It’s easier to just delete all of those instances and let IE recreate them.


Vista Hangs After Three Days Uptime: Too Many Processes!

At long last, I get to post a solution!  A few posts ago, I alluded to a bad problem I’ve been having with my Vista workstation involving a memory leak.  I have been working this problem for five months with much frustration along the way.  I want to describe how I found the problem—this will be a long post.  Next post will have the solution, and I’ll have a postmortem too.

Several months ago, my Vista workstation started hanging on me.  This would happen when I woke up the machine from sleep for the day, when I connected to a VPN, and when I put my Windows Mobile PDA in its cradle to sync.  These circumstances informed, or I should say, betrayed my troubleshooting efforts.

First I did the obvious:  pull hardware (my Firewire card that saw little use, my USB card reader) and software that I didn’t use.  Unsurprisingly, no change. 

I focused on USB thinking that problems with the USB bus were causing my problems.  I applied a patch from Microsoft and a registry fix (KB953367) that purported to improve USB reliability.  No changes.

Drivers:  Eventually, over the course of several months, I refreshed all the drivers in my machine.  Good for my machine, but not for my problem.

I tried every debugging tool I knew of—WinDbg, Kernrate, Xperf—and a few tools that I didnt.  When my system hung on wakeup, I had to do the “Crash on Ctrl-Scroll” trick to bluescreen the machine on purpose so I could get a dump.  The dumps I would get didn’t yield much information, or rather too much info.  Sometimes I wouldn’t get a dump at all.  Hardware? 

I had gotten several real bluescreens throughout.  Memory?  I got new memory for my SBS box and put the old memory in the workstation.  I had 3G in the machine (2 1G and 2 512M sticks) and swapped it out for 4 1G sticks making 4G (or 3.5G since I still run 32-bit Vista).  No crashes.  I put off that motherboard I was about to buy.  But still hangs.

I couldn’t get anything meaningful from the flood of data from the kernel tools that I tried.  I couldn’t use Process Explorer—it crashed too!

I tried a different tack:  Performance Monitor (perfmon.msc).  I monitored some counters, and found an interesting trend.  You can see it at the top of this post.  Over the course of 12 hours or so, my Process Total Virtual Bytes went from 17,990,000,000 bytes, an already very high value, up to 48,067,000,000 bytes.  (!!)

I had other graphs (unfortunately not saved) that showed a very high ski-slope of virtual memory usage over three days.  It’s probably best I don’t remember the exact values, but when I could get Process Explorer to work, it reported 4.0G of virtual space in use during a “normal” session where I had a few tabs of IE and Outlook open.

I had a memory leak. 

That was one of my first breaks in the case.  The second came when I was looking at user profiles.

I live alone.  Despite having a full SBS 2008 server in the house, it only has one user, and my workstation has only one user.  I do, as is good practice, run as a regular user and have another admin account. 

I logged in as admin and let the system idle overnight while monitoring virtual memory. 

The memory graph looked reasonably flat.

That ruled out, for the most part, the kernel, the drivers, and nearly every service from guilt.  Windows was mostly not corrupted.  Was it my profile?

I logged back on as my regular user (with an elevated command prompt, as I usually do) and got my last and biggest break.  I had been using Powershell to take snapshots of my process activity, since I couldn’t run Process Explorer.  I had been suspecting that Explorer or some other process was leaking and wanted to get a before and after snapshot to look at in Excel.  This is the command I used:

get-process | sort VirtualMemorySize | export-csv vmsnap.csv

After my system had been up for awhile, I took my second snapshot and noted something strange:

PS C:\temp> dir vmsnap*


    Directory: Microsoft.PowerShell.Core\FileSystem::C:\temp


Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---        10/27/2008   9:28 PM      94920 vmsnap.csv
-a---        10/28/2008  10:11 AM     236901 vmsnap2.csv

WTF?!  One snapshot’s that much bigger than the other? 

Only way that could happen is if there were way more rows in the second snapshot, meaning more processes.  I logged off and back on to my “good” admin account and counted processes, like this:

(get-process).Count
71

This is the normal number of running processes in a “good” system.  I restarted my machine and logged back on to my regular user and repeated that command.  Same result.  Now I waited. 

Two days later, I repeat the command:

(get-process).Count
880
 

(!!!!!!)  Mommy!

Was it a virus?  I know of “fork bombs”, and was about to try Rootkit Revealer to see if my machine was infected (I have never gotten malware infected on any machine I owned) but looked at my process list and saw something else:  A zillion instances of taskeng, the Microsoft Task Scheduler, rewritten for Vista.  This command line shows it nicely:

(get-process | where {$_.Name –eq “taskeng”}).Count
817

Eeeepp!

Task Scheduler was spawning tasks almost as fast as it could.  I didn’t see this when I looked at memory per-process since I’d been thinking of a leak within a process and not a spawning process, but it was a perfect explanation!  No wonder Process Explorer crashed—it was starved for memory to begin with, never mind when it had to reserve memory to display all those processes!

It explains the hang on wakeup from sleep, since Vista has to notify all the processes upon wake.  It doesn’t explain the hangs when I sync to my PDA or connect to a VPN, but I’m assuming these were caused by a low resource condition. 

I did some digging to find out exactly what Task Scheduler was doing.  That’s my next post.


Vista Hotfix Available for Memory Leak

If you have a memory leak in your Vista SP1 system, hotfix KB949700 might help you.  It seems to work so far for a very frustrating memory problem that I have been fighting with for four months.  Sadly, I still seem to have the problem, but the hotfix just pushed it off a bit.  This is worth a blog post if I can ever find out what’s wrong.

Hat tip to the Virtualbox forums for this one.


I passed my Windows Internals exam!

A few months ago, I got a tip from one of the blogs I read about a new Windows Internals beta exam.

In my years as an IT professional specializing in Microsoft, I have never sought certification.  It’s expensive to study and sit for many certification exams, and I have a very limited training budget.  Our executive director at SATV has never pressed me about formal certification, although he once pursued an MCSE in the Windows 2000 era (but never sat for the exams.)

But Microsoft offers beta exams at no charge, and full credit if you pass them.  So this was a good opportunity.  I’ve always been good at digging into WinDbg and looking at crash dumps.  I’ve had too many real-life problems to solve with my own personal machines, never mind SATV’s, not to be familiar with the internal workings of Windows..

This exam got a blog post with an amusing title: "Microsoft is developing a new, super hard certification test":

This exam is not intended for the great networker masses, it is aimed at high level engineers who have extensive and in-depth knowledge of windows and the windows architecture. These are the folks who find the deep errors and faults that the rest of us can’t. From the Microsoft website, the preparation for this test involves a thorough knowledge of the PSTools developed by Mark Russinovich (you might remember him from Sony’s root-kit debacle a few years back). These tools enable you to delve deeper into the operating system than you are able using the built-in tools.

I think this is going to be a very interesting exam as it will definitely separate the geek from the uber-geek. I hope that this is the start of more tests in this same genre (maybe not just on networking) – pushing the level of knowledge (and having a certification to prove this knowledge) and perhaps encouraging others to push themselves as well.

I did have to study all of the tools I heard of (including PStools, which I am well familiar) and a few that I did not.  I slept with Windows Internals every night.  I ran Windows in a virtual machine and crashed it so I could connect a debugger to it from the host machine. 

I’m supposed to say I found it hard to sit in the Prometric center in Brookline and take the test, but it went by very fast.  Some questions tripped me up because I’m a system administrator by trade, not a programmer (though I have a BSCS and have basic familiarity with modern programming concepts.)

But the test paled in comparison to some of the really difficult Windows crash/bug problems I’ve had to work on.  Sadly, I am working on several very frustrating Windows problems at the moment that have eluded me;  they are the sort of problems that make you try everything, that lead you into corners of Windows with obscure log files and DLL’s you never heard of, tools you never knew existed.

If I could only solve these, Microsoft would have to give me the certification!

But no need.  This morning at 12 AM local, I got the email that Windows certification candidates hope for,  “Congratulations on passing your recent Microsoft Certification exam, inspiring confidence for your employer, your peers, and yourself with a widely-recognized validation of your skills on Microsoft technology…”

So, I passed:

I’m not done yet.  A few weeks after I took the Internals exam, I got an invitation to the SBS 2008 certification test beta and sat for that exam and am waiting to hear.

So everybody send me your crash dumps for debugging!  I will “!analyze –v” for food!  (In-joke;  !analyze is the command in WinDbg that attempts to analyze Blue Screens of Death, but not plaid ones!)


Fixing Windows Media Services behind a Firewall with SBS 2008

For as long as I’ve run a personal SBS server, I have run a Windows Media feed of my local police radio traffic, using a scanner plugged into my server.  For the better part of a year, I haven’t been able to do this;  Everything bad that could happen to my feed, has, from a broken radio, to a broken server, and now Windows Media Services 9.5 is broken!

WMS 9.5 is part of Windows Server 2008 (and SBS 2008) and blogger Random explains the problem:

I’ve talked to a number of people that are having problems streaming from Windows Media Services 2008 on Windows Server 2008 when the server is behind a NAT firewall, some proxies, or load balancers. Clients on the internal network work just fine. In a network trace you’ll see at WMS is returning a 503 Service Unavailable as the first response to the RTSP DESCRIBE or HTTP GET.

WMS is doing a DNS query for the domain name in the RTSP or HTTP request when the request is not the local NetBIOS, local DNS, or local IP. If a WMP client is requesting content through a NAT or similar device (such as some proxies and load balancers that hides or translate an external URL to an internal URL), the requested address might be something like mms://streaming.contoso.com/live. However internally the server name might be WMS01 or WMS01.corp.contoso.com. This generally is only going to happen if you’re using Network Address Translation.

WMS 9.5 now contains a cache/proxy in the box. What’s happening is when WMS doesn’t recognize the requested URL as itself it is assuming that the request may be a proxy request. This happens even if the cache/proxy plug-in is disabled. Because internal clients would use the internal IP, NetBIOS, or DNS name, the server recognizes those request as intended for the local server itself.

The workaround is to edit the HOSTS file on the WMS server, in c:\windows\system32\drivers\etc\ and add the folllowing line:

127.0.0.1   <public address you are streaming from>

In my case this is:

127.0.0.1   n1kgh.gotdns.org

This address is a dynamic IP I maintain with dyndns.org (and my ham radio call!)

It works fine.  In fact, I have had excellent uptime, as seen in the screenshot above;  my stream ran for 15 days, only interrupted by a patch.

You can hear the stream on Windows Media at mms://n1kgh.gotdns.org/salemscanner  It includes Salem police, fire and surrounding communities.