Sorry folks, we had an enormous amount of down time in the past 24 hours. Traffic has been way up and I blamed it on that, but it turns out the load wasn't the problem (good old reliable Linux!).
Yesterday morning a 7 ton A/C unit in the hosting facility took a nose-drive. By noon the server room was hot enough to crash the web server. I did a remote reboot not knowing the A/C was down. By 4pm it crashed again. By 7pm again. Then 4 times from 7 - 10pm. Little did I know the server facility was getting hotter and hotter and causing the FTE web server to die. I arrived at the facility at 10:45pm and found, to my shock, AC repairmen frantically working on the system and FTE's servers frying in a 105 degree room.
I did everything I could with what I had to keep them running cool (box fan, putting 1U of space between each unit, leaving the covers off) but by 4am it was apparent that none of that would do the trick.... by this time the system wouldn't stay up for more than 5 minutes. I called it a night, went home for 2 hours of sleep.
So my day has been spent getting 2 monsterously high speed system cooling fans (average PC fan move 10-20 CFM, these move 94 CFM each). The servers are now up and running again, for several hours without a problem but we're not out of the woods yet. The AC system is a specialized unit and will take several days to get the parts. They were installing a "portable" 7 ton unit the size of my Ranger in the place, temp was down to 90 by the time I left and they expect it to stay around 70 (65 is considered optimal) unit the permanent unit is replaced.
Sorry folks.... if you think you were frustrated with FTE that past 24 hours... put yourself in my shoes!
Well... off to bed and hoping the server moniter doesn't page me with a crashed system again!
The system that kept dying had 4 gigabytes of RAM, two CPUs, a very large enterprise level SCSI hard drive controller (with its own CPU and RAM) and 4 15K RPM hard drives and 4 10K RPM hard drives. The hard drives, because they spin faster than your typical hard drive, generate 2-3 times more heat so its like have about 16-20 hard drives in a case. There's a LOT of heat generated... the power supply for this sucker is a 650 watts.
It figures this would happen just before the Rally..
Amazing what heat will do to equipment. One of the sites I work on is filled with rack mounted microwave Rx/Tx pairs for TV classes around the state. It had an A/C problem a few years ago and the heat caused the power supply to fry in each unit. Lost about 3 a day. Brand new, state of the art PS at $1500 each.
This stuff happens. I've found cheap insurance by keeping 4 box fans with extention cords in storage. Take covers off and keep the air flowing.
Our main server area at work we have to maintain at 55 degrees. Heat is the biggest killer of IT equipment for sure. They will let an office AC go out and spend a week making sure you do the paperwork right to get a replacement but let one of the AC units in the IT room go out and you better have a new one in that day. There are 3 5-ton units for this room, one will do but the others are standby and they have us cycle through them every week to ensure all will work if and when needed.