Tuesday, March 6, 2007

This Is Why Our System i Was Failing

After nearly 18 hours of continuous downtime troubleshooting, we found the culprit with IBM. Since there was something pointing to a perceived power failure being the problem, IBM decided to isolate the specific frame that was the issue. But we had to find it first. Since we have two 570 system units and 6 0595 expansion towers in the rack, that would take a lot of power ups and downs. This morning IBM decided to disconnect the SPCN cables from the 570 system units and we were able to power up the box. After eventually isolating it to the second expansion tower (via multiple power up attempts focused on daisy-chaining towers), we isolated it to having something to do with the 2nd tower.

We already banked on having to simply replace the backplane on one of the towers, so we had a backplane en-route. But when one of the IBM techs opened the tower and removed the power supplies, he saw this wedged underneath the power supply against the T14 9-pin connector (this is the connector that a UPS can connect to which alert the system of a power loss...):



This is one of the 4 small electro-static insulators that are placed around the opening of the hole through which the power cable connects to the power supply. Apparently there was enough contact to short out the connector and alert the system of a power loss. And it's been up for about 5 hours now...

WOW!

No comments: