Fixing Virtual Machine Crashes

ubercloud-iconThis article describes some https://ausweb.com.au/ steps you can follow and tools you can use for troubleshooting why a virtual machine crashes. As more and more organizations migrate their physical servers to virtual machines, the types of problems associated with physical servers are becoming increasingly more common in enterprise environments. For example, in the old days system administrators sometimes had to wrestle with physical servers crashing with the notorious Blue Screen Of Death (BSOD).

Also known as a stop screen, this blue screen provides only minimal information about what might have caused the problem but it can suffice in some circumstances to help you identify the cause of the crash. If your server (either physical or virtual) is running Windows Server 2008 or Windows Server 2008 R2, a good source of information on how to troubleshoot BSODs is Chapter 32 Troubleshooting Stop Messages of the Windows 7 Resource Kit from Microsoft Press, and most of the information in this chapter is still relevant for Windows Server 2012 and Windows Server 2012 R2 as well.

What if the information on a stop screen isn’t sufficient to help you troubleshoot why your virtual machine crashed? Or what if your server simply hangs and becomes unresponsive? In that case you can still see the desktop user interface of the Windows Server operating system, but pressing keys or moving the mouse produces no response (the desktop has frozen). In such cases you’ll need a memory dump which is basically a binary file containing the contents of a portion of your server’s memory when it crashed. Windows Server 2008 and Windows Server 2008 R2 provide four options for configuring memory dumps:

None – No memory dump occurs when the server crashes.
Kernel – Only the kernel memory is dumped.
Small – Also known as a minidump, this memory dump contains the smallest set of information that might be useful for helping to identify the problem.
Complete – Everything in system memory is dumped.
Complete memory dumps can be huge and require that your paging file be at least 1 MB larger than the amount of memory in the server. Complete dumps are always named Memory.dmp and are found in the %SystemRoot% folder, and each new complete dump overwrites any previous complete dump. Kernel memory dumps are also named MEMORY.dmp and are similarly found under %SystemRoot%.

At the other end of the spectrum is the small memory dump, and each time a small dump occurs a new file with a time-coded name of the form MiniMMDDYY-NN.dmp is created on your system in the %SystemRoot%\Minidump folder. More information on these different types of memory dumps can be found at http://support.microsoft.com/kb/254649.

Note that Windows Server 2012 introduced a new type of memory dump called Automatic as shown here:

However, it turns out that this new option really only produces a Kernel dump but with a smaller footprint than in previous Windows Server versions. For more information about this new option, see http://blogs.technet.com/b/askcore/archive/2012/09/12/windows-8-and-windows-server-2012-automatic-memory-dump.aspx.

Bugcheck the server

If the Windows Server operating system on your server freezes and becomes unresponsive to user input, you can try to forcibly bugcheck the operating system. What this does is deliberately crash the system, which generates a *.dmp file in your %SystemRoot% folder or its minidump folder depending on the type of memory dump you’ve configured on the server. Before you do this however, make sure your server is configured to generate either a Complete or Kernel (Automatic) memory dump as the amount of information in a Small dump is often insufficient to isolate the reason for a system crash (and be sure to reboot your server after making such a change). Also ensure that there is sufficient free disk space on your server for the memory dump file to be created either by freeing up space on the existing drives or adding an additional drive (or VHD/VHDX if the server is a virtual machine) to hold the memory dump file.

If the problem is reproducible (that is, if you can consistently place your server into an unresponsive state by performing certain actions on it) then you can edit the registry to allow you to manually crash the server from the keyboard. The relevant registry key to create has the name CrashOnCtrlScroll and is of type REG_DWORD and should be configured to a value of 0x01. For a PS/2 style keyboard, you create this registry key here:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\i8042prt\Parameters

For a USB keyboard however, you need to create it here:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\kbdhid\Parameters

Once the relevant registry key has been created and assigned a value of 0x01, you can now manually crash the server by holding down the right CTRL key and pressing the ScrLk (Scroll Lock) key twice. A bugcheck (crash) occurs and a memory dump is written to disk.

If your server however is a Generation 2 virtual machine running on a Windows Server 2012 R2 Hyper-V host, you’ll need instead to create the CrashOnCtrlScroll registry key in the guest operating system in this location:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\hyperkbd\Parameters

Either way be sure to reboot the guest operating system of your virtual machine once you have made any of the above registry changes so they can take effect.

NMI

If the CTRL+ScrLk+ScrLk method described above won’t work for some reason, you may be able to use the Non-Maskable Interrupt (NMI) switch that causes an NMI on the system processor if this functionality is available on your server system. For more information on this, see http://support.microsoft.com/kb/927069.

Debug-VM

If your virtual machine is running Windows Server 2012 R2, you can use the new Debug-VM cmdlet of Windows PowerShell to bugcheck the virtual machine and generate a dump file. This cmdlet injects a non-maskable interrupt (NMI) into the guest operating system virtual machine which results in a bugcheck of the operating system. For the syntax and examples of using this cmdlet, see http://technet.microsoft.com/en-us/library/dn464280.aspx.

VM2DMP

If your virtual machine is running on a Windows Server 2008 or Windows Server 2008 R2 Hyper-V host, you can use the Hyper-V VM State to Memory Dump Converter tool (VM2DMP) to perform a complete memory dump even if a different (or no) memory dump option is configured in the guest operating system. You simply save the state of your unresponsive virtual machine, obtain VM2DMP, and follow the instructions in http://blogs.technet.com/b/virtualworld/archive/2010/02/02/vm2dmp-hyper-v-tm-vm-state-to-memory-dump-converter.aspx.

There’s one problem however: the above blog post says you can obtain VM2DMP from the MSDN Archive Gallery at http://code.msdn.microsoft.com/vm2dmp but it’s no longer available there because it was never a supported tool in the first place. So if you need it you’ll either have to get it from a trusted colleague or download it from a trusted third-party source somewhere. Unfortunately I can’t recommend third-party sources so you’re on your own here.

Another problem is that VM2DMP is not compatible with the saved state format used by Windows Server 2012 or later Hyper-V, which leads us to the next topic.

LiveKd

If your virtual machine is running Windows Server 2012 or later, or if it’s running an earlier version of Windows Server and you don’t have access to the VM2DMP utility, you could use LiveKd, a Windows Sysinternals utility, to run the Windows debugging tools (Kd and Windbg) on your Hyper-V host and use it to generate a memory dump of your virtual machine. This approach has a few downsides however:

You’ll need to understand how to use the Windows debugging tools. This is not trivial, but you can find lots of helpful posts like this one in various TechNet blogs to get you started. Also be sure to check out this post on Mark Russinovich’s blog. And for tons more info on this topic, see Windows Bugcheck Analysis in the TechNet Wiki at http://social.technet.microsoft.com/wiki/contents/articles/6302.windows-bugcheck-analysis.aspx.
You may need to pause the virtual machine to write out the memory dump file, and doing this might produce some changes in the memory state of the unresponsive virtual machine. However, best practice is for you to pause your virtual machine before using the debugging tools to generate a memory dump because if you don’t pause the virtual machine the resulting dump file may be inconsistent.
It can take a long time (possibly hours) to generate the memory dump file, depending on the size and memory configuration of the virtual machine.
What to do next

Let’s say you’ve got a memory dump file from your virtual machine that crashed. What can you do next to troubleshoot the problem? You basically have two choices:

Zip the dump file, upload it to OneDrive or some other sharable cloud location, and call Microsoft Support and ask them to have a look at it and tell you what might have gone wrong.
Become an expert in interpreting memory dumps.
I would personally go with the latter if it’s a production virtual machine that’s important for your business operations because becoming an expert in bugcheck analysis is something that most sysadmins have neither the time nor inclination to pursue. But it’s your choice of course.

It may be that this area is one where VMware still has the upper hand over Hyper-V because they’ve released a tool you can use to convert a snapshot of a virtual machine running on an ESX host into a .dmp file you can then analyze using the standard Windows debugging tools.

This tool is called the Checkpoint To Core Tool (vmss2core) and you can find more info about it at http://www.vmware.com/pdf/snapshot2core_technote.pdf and you can download the tool from VMware Labs at https://labs.vmware.com/flings/vmss2core.