Home > Computer Hardware, General > When your servers go nuts

When your servers go nuts

Last week was the first week I’ve worked alone at work since early 2010. Back then, my colleague was on a week’s leave after our massive network migration. This time, it was because I was now on my own. The first two days of the week went smooth enough, nothing too crazy happening.

Wednesday was the start of the chaos. The day itself started out smoothly. Lunch time, our principal took the school’s admin staff (including myself) out to high tea. It was a nice way to spend 2 hours off campus doing nothing more than relaxing and chatting with colleagues. Eventually it ended and we headed off back to work. I sat down at my computer and noticed that Outlook had frozen. This was bizarre, as it has never locked up like that before. A moment later, the head of IT came to tell me that staff had not been able to access their email for a short while. I tried to Remote Desktop into the Exchange server, only to find that it wouldn’t connect. I opened up XenCenter, only to find that it couldn’t connect as well.

I proceeded into the server room to discover that the big virtual server we had bought earlier this year had frozen rock solid. Exchange, Anti-Virus and internet access/filtering was down for everyone. I had to force a reboot via the reset switch, and that’s when things got out of hand. After the power cycle, the server tried to boot from the network, which was strange to me. Network boot was the last option on the boot order list and should not have happened. I tried resetting again, only to have the same issue. Another restart and I decided to check out the BIOS. Lo and behold, the 16GB Kingston SSD drive that was the “brain” of the server was not being seen in the BIOS. Now I knew I had a real problem. I tried swapping SATA cables and changing the port on the server’s motherboard, to no avail. Our hardware supplier suggested I try the drive in a desktop pc to see if it would pick up. No luck either, and valuable time wasted.

Left with no other choice, I slotted in a 160GB magnetic hard drive into the server, disconnected the RAID array to be safe and re-installed Citrix Xen Server onto the new hard drive. After that, I searched the internet for a long time to find a guide on how to recover existing storage repositories. I was sweating bullets now, as I could not lose the data on the RAID array. The VM’s were backed up, but not the master Xen installation. It was something I always wanted to look at but never got around to fiddling with. I broke the 11th Commandment (Thou shalt backup) and it came back to bite me on the ass. I’ve since learnt how to backup XenServer, and it’s not difficult at all. Should the system ever get sick again, I can restore things much quicker now.

Luckily I was able to find a guide online that let me reconnect the array to the new Xen installation and I began the procedure of recovering the servers. I got everything back up except the two internet filtering/access servers. By this time it was near 19:00 and I had to go home, too exhausted to continue. The important things were back and safe.

Thursday morning was spent trying to get those 2 servers up and running, and failing that, setting up new servers in their place. It was challenging work, but with the support and understanding of the staff, we were back to full speed by 11:00. I spent the rest of the day catching up on other work. Thursday evening, I remotely reset the main server so that it could reconnect to it’s iSCSI target for back up purposes. That went smoothly and all was working when I logged off.

Friday morning on my way into work, I got a SMS from the head of IT, telling me that nobody could log into their email, even though the server was up. Exchange was rejecting the user name and password. I theorised that the Exchange server had lost its connection to Active Directory, and that a restart should solve the problem. When I got into work, I discovered that people were struggling to log in. It turned out that the Exchange server was fine, but that the main server wasn’t. One of the hard drives in its RAID array has decided to go a bit nuts. I had to force reset the server. The HP RAID controller told me that the drive needed to be rebuilt, so I selected yes. The server booted up and proceeded to work as normal, if a bit slower thanks to the one drive being rebuilt. Thanks to the various HP monitoring tools, I was able to keep a close eye on the server. The server behaved itself the rest of the day.

All in all, the events of the week left me feeling exhausted. That being said, there was a feeling of pride that I’d managed to survive and that despite the downtime and frustration, nothing was lost and that I’d survived my first crisis. Unfortunately, my good feeling did not last into the weekend, as I ended up developing a common cold. Such is my life…Sarcastic smile

  1. Abrasive Colleague
    September 16, 2011 at 08:30

    Well done my ex-colleague. Job well done. Just be carefull that it does not happen too often, as people loose their “understanding” streak pretty quickly…

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: