(Originally posted December 28 at 4:33 PM EST):
Aleph will be temporarily taken down between the window of 1 – 3 AM EST (-0500 GMT) on Saturday, December 29th to remove the TCP offload engine jumper from the server. Aleph will be inaccessible during 3 minute stints as the server is rebooted. During this window the network driver (bnx2) will be upgraded in an ongoing attempt to resolve a rare situation resulting in a dropped packet. Aleph’s kernel will be temporarily upgraded to the 2.6.24 release candidate, which includes an updated bnx2 driver. Possibly, the packet rot may resolve this issue; at this time the exact cause is unknown.
I have isolated the problem to the BCM5708 chipset present in all of the Dell PowerEdge 1950 servers. Whether it’s a defect in the kernel drivers, proprietary TOE support (should be inactive on Linux…), Intel’s I/OAT TCP offload feature, message signal interrupts sent to the network card, or something else is still unknown.
1/3/2008: Still no resolution on the issue. I’ve escalated it to the mailing lists to see if anyone else has similar issues.
1/5/2008 at 8:49 PM EST: New NIC is in Aleph. Given there are no further packet drops between now and tomorrow the remaining servers will have new NICs installed. A reboot would be required.