Home > Computer Hardware, Software > Further fun with Citrix XenServer

Further fun with Citrix XenServer

I’ve previously written about some of our experiences with moving most of school’s servers onto Citrix’s XenServer platform, a process that we are still learning from every day. Last week we hit a known issue inside the product we were not aware of due to us not having taken the time to read all the errata, hardware compatibility lists and so on.

The problem we hit is that XenServer 5.6 has some compatibility issues with the new power saving features introduced in Intel’s Nehalem and Westmere server chips. If you enable the C3 and C6 power saving states in the server BIOS, the processors will attempt to save as much power as possible by putting some of its cores into a sleep like mode when load is low on said CPU. Unfortunately, there appear to be bugs in how the processors handle resuming from these states, as well as some other random issues that can be hard to pinpoint.

Initially we had no problem with this until about 2 weeks ago. After a massive power failure, we started noticing that the server wasn’t lasting beyond 3 days uptime. At first we let it go as a minor once off issue, but when it happened twice more, we could no longer ignore it. Microsoft Exchange has a pretty robust database engine, but too many unclean shutdowns would eventually cause corruption. While faulty hardware was always a possibility, it seemed more likely that this was some sort of software issue on the host causing the problem.

Cue searching on the internet, which lead to this Citrix document. My colleague and I immediately felt that this was the root of the problem, as we had enabled the C states when we set up the server. My colleague took the chance on Friday afternoon when it was quiet to restart the server and change the settings in the BIOS. Although it’s a bit sad that we’ll lose some of the power saving features of the new CPU’s, we’d take that any day over poor uptime and crashing hosts.

As of this writing, the server has been up for over 48 hours now without issue, and I will continue to keep an eye on it. The week ahead at work should tell us if the C states were indeed causing the problem. If the server stays up past 3 days uptime, we will know for sure.

In the above linked document, Citrix say their engineers are looking into the problem. Hopefully they will get it fixed by the time of the next release, so that we can make full use of our hardware.

  1. Chad T.
    March 26, 2012 at 06:51

    Wow, we have had the same problem, but on a Linux/CENTOS server. We have spent the last month in a nightmare trying to figure out what is the heck is going on. Our internal resources, our server management company and our server infrastructure host all pointed to a hardware problem (RAM). We turned every stone upside down, tested every inch of the system, but it seemed every 3rd day the server would shut down out of no where. Plus every log, every hardware test, and even the power supply swap all came back perfect. We migrated from a cluster that although had high server loads, has not seen offline status unplanned since 2008. We move into the next greatest platform, literally went from atari to x-box 360 of 2050, and we have been down over 10 times in 30 days (all unplanned of course).

    The last thing that has been attempted is the C-State Status, matter of fact this is what one of the 20+ technicians we worked with had to finally say:

    After inspecting the BIOS I found that C-STATE was enabled. C-STATE basically powers down idle cores on your processors. This can cause your OS to crash and make it very difficult to find any logs on why it did.

    I am hoping this is the solution as this was done today, 3/25/2012. If this post is accepted I will post again if we make it past the third day mark, fingers crossed.

    I guess if you cry enough or pay enough, although we been paying enough, trust me, someone will figure the solution out. I hope if anyone searches online about why their new server is crashing that is leveraging Westmere server chips then they can find these posts, wish I did as I wouldn’t have to fight to keep my job or went through the month long headache to realize new technology has it’s flaws!!!

    • astraltraveller
      March 28, 2012 at 22:45

      Hey there. I’m glad my post helped you in the end. It’s part of the reason why I started this blog in the first place. I hope that your servers are behaving better now with the C States disabled.

      Our Xen Server was upgraded to version 6, but I think the problem still exists in this version. Anyway, we never switched the states back on, and the server hasn’t died like that since.

      Funny thing is, I think this problem is more severe on systems that have a Linux kernel. Windows always seems to handle power management far better than Linux for some reason, at least in my limited experience.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: