December 13, 2004

How to handle and interpret BSODs on Windows

Today I read a post by Susan in which she makes some good recommendations on how to get a dump when a system crashes. I found a few of her points could be expanded upon and slightly modified, and started to comment on her blog. Then I realized it might be useful to others in the community, and decided to post it here.

There are a couple of things to consider when dealing with a kernel crash dump. You should read Susan's article first in order to understand her recommendations on setting up kernel dumps on Windows.

  1. Turn OFF the "Automatically restart" checkbox. When the system BSODs you will then be able to SEE the crashdump on screen. This is important as you can typically SEE the offending driver that is causing the fault. More importantly is that it allows you to see the TYPE of fault. If you don't do this, the system will automatically restart and you will miss this vital information. NOTE: If this is a headless server hosted somewhere, you may not wish to do this... your system will not reboot until told to.. which means it will hang forever.
  2. Turn OFF the "Overwrite an existing file" checkbox. Otherwise, if you have multiple faults you will never know about them since you would overwrite the previous crash dump. In some cases where "Paged Pool" is corrupted due to an offending driver, you may trigger a secondary fault that is harder to track down if you try to rectify it before it can be pushed to OCA. If you overwrite the file, you will be out of luck in comparing the dumps.
  3. In a pinch a minidump is fine. 9 times out of 10 having the complete dump of memory is useless unless the debug team require to debug virtual memory issues (like BAD_PAGED_POOL_HEADER faults). A little known fact is that the data Microsoft fires off to OCA is actually a bastardization of the minidump, which is then sent to code on the OCA server which goes through a filtering algorithm to determine the true cause of the failure, and is then placed into the buckets for each vendor. If they need more information (like the full dump) you get routed to a special web page, and are then asked to upload it.

If you want to get geeky and LOOK at the fault (in case you skipped point one and the system rebooted), it is possible. To do this you will need to download Microsoft's kernel debugger called WinDbg. Then take the following steps:

  1. Start WinDbg
  2. Create a directory in the root of C: and call it "symbols" (mkdir c:\symbols)
  3. Click on File, Symbol File Path. Here you will enter the symbols path, which is needed to effectively read debug information. Since Microsoft makes its public symbol server available to us, lets use it.

    The path will be: SRV*c:\symbols*http://msdl.microsoft.com/download/symbols

  4. Once you apply the symbol path, select the "Reload" checkbox and then hit Ok.
  5. Now go File, Open Crash Dump and load the file you wish to view. The minidump will typically be located in %systemroot%\minidump (in my case C:\windows\minidump).
  6. When you hit Ok, the debugger will contact the symbol server, sync the data and load the debug screen with the data that it can read.
  7. Now at the cmd prompt type !analyze -v
You now have a full debug of what has faulted, with all the information you need. At this point you can check information on the stack, look at the last instructions before fault and do all the magic a kernel guy does. Chances are, most of the information will be foreign to you. Don't fret. That's what people like me are for. Just look for something that says:

Probably caused by : foodriver.sys  ( foodriver+4f20 )

Chances are, THAT is your offending driver (and the offset is the location, so if you let the vendor know they will know right down to the line number).

What do you do now? Well remove it. If the driver loads on demand, not a problem, you can simply remove the offending software. But what if the driver loads at boot time? You will never be able to do that, since the system will constantly crash on startup!

That is why the "Recovery Console" exists. Stick in your W2K3 or XP CD and boot from it. When asked, hit "R" for Recovery Console. When prompted type in your Adminstrator's password. Once logged into the console type net stop <drivername>. This will set the driver to NOT load on boot. You should then be able to boot up, and remove the offending driver.

Susan, tell your friend there is no need to reformat and reinstall. Just use the tools available to you to remove the offending driver!

Posted by SilverStr at December 13, 2004 07:37 AM | TrackBack
Comments

I have taken on-board your advice about turning off the Auto restart and Overwrite - sound advice now that you have explained it in a logical fashion.
But I have a query - you say a mini-dump is probably sufficient. Is that the same as Small memory dump (64 KB)? What is the use of the Kernel memory dump?
Anyway, with my server setup (SBS2003), despite what I thought was a big enough partitiion of 15GB, I am now faced with free space of only just over 4GB, and I have 4GB of RAM, so in theory although I could have done one full memory dump (which, by the way, is not shown as an option on my server as I had set the pagefile at 1GB on C: and the rest on another drive) before I ran out of space.
Presumably, if the system had tried to dump another full memory dump after a second crash (and overwrite is off), it would have failed in that process and possibly introduced more complications when trying to get the server up and running again, due to the lack of any free space.

Posted by: Richard Cass at December 13, 2004 08:55 AM

Hi Dana. Great post.
A few comments....

First, small typo: I would point out that your command "analyze -v" should really be "!analyze -v" unless there is !analyze that does not require the ! which I am not familiar with (entirely possible, but at least in kd that's what I'm seeing in the dump I have in front of me).

Secondly, on your comment "Chances are, THAT is your offending driver (and the offset is the location, so if you let the vendor know they will know right down to the line number)." It is not really right to say that the line # cited in the crash is root cause of the issue (even if it is the right driver, which it may or may not be). Reason is, perhaps there was previous corruption which a later line of code fell victim to. See issues that need pool tags for at least one example of what I mean.

Third, I'd add that there are concerns in creating dmps depending upon size of your paging file and amt of physical memory on the box. There is some documentation out there on this, so I won't recite it all here. But it is worth remembering.

Finally, I'd say windbg is not our kernel mode debugger. Rather, a DLL (or a few DLLs) really act as the debugger. Windbg is a GUI for it. I point that out as I do both user and kernel mode debugging, but don't use windbg (I use cdb/kd). :)

Posted by: Eric Fleischman at December 13, 2004 09:24 AM

Richard,

The small memory dump is the mini dump. That will be sufficient for most cases. 64K doesn't normally gobble up to much resources :)

In your example of 4GB of ram, remember that the dump is of KERNEL memory, and not unallocated or usermode memory. The normal rule of thumb is that approximately one third of your physical memory will be available for the paging file up to about a gig. This wouild typically range between 100MB - 800MB depending on the system. In your case, I would guess it would be up at around the 800-1Gig range, when it full use.

Posted by: SilverStr at December 13, 2004 09:38 AM

Hey Eric,

Very good points. I have corrected the typo. You are right that the bang (!) is needed.

As to your point about the line # cited... although you cannot predict if it IS the offending line (especially with pool corruption issues), it WILL give you an area to look at. I have found that in combination with other information (such as !pool) you can walk the data and figure out if someone trampled the pool header, and where its coming from. Typically though that is done on a live dump, not on a post-analysis from a kernel dump.

I just commented to Richard about the paging file issue with physical memory. Basically the issue exists that under normal situations the kernel memory dump will be about one third of the physical memory, depending on allocation needs and optimization techniques. And chances are really good that this info isn't needed to find the offending driver. Of course if life was full of absolutes, we wouldn't be having this discussion. There WILL be times when you want that memory dump. And you will need to make the room for the dump.

And finally, I will conceed that WinDbg is the GUI front end to Microsoft's debug libraries. A really nice one if you ask me, especially with the integrated help which has FINALLY become very useful in the latest year or so. As much as I am a cmd line guy, I prefer Windbg over kd. Just preference I suppose.

Good comments. Thanks for posting!

Posted by: SilverStr at December 13, 2004 09:47 AM

Ugh all you crazy UI people. If it can't be done off of the command line debuggers, I am uninterested!!! :)
I am in the minority of a lot of 'folks around here I'll grant, but I do use the command line almost exclusively. (I say almost as when I teach new 'folk, I try to use the UI as I know that is what they will probably use eventually.....but I can rarely show them UI-only features :))

The live scenario is great for you and I to discuss, not as great for a customer. Most mission critical systems feel pain for a single reboot for the dump itself, nevermind the prospect of live debugging it (read: hours of time in some cases instead of the min amt of time for writing the dump + reboot). So I try and stay away from live unless absolutely required.

Posted by: Eric Fleischman at December 13, 2004 03:55 PM