(If you think that this entry is about life coaching: no, it's not. Just (very) technical stuff here.)
Quirky technical problems are my daily bread. This one was a toughie and I hope that its solution will help others with similar symptoms. I will first sketch the basic infrastructure, then the problem, our debugging steps and the solution(s).
The affected system consists of two Windows-based W2K3 (Windows 2003) Servers with SP2 installed, running in a NLB cluster with two nodes. Each cluster member hosts an IBM WebSphere 6.0 environment which is controlled by a third Network Deployment server. The Websphere Cell consists of a configured cluster of two nodes (our both servers).
Each server (we're just talking about the cluster here, forget that ND machine) has a Xeon processor at 3 GHz and is packed with 10 GBytes of RAM. The network interfaces (4 in total; each server has a frontend and a backend NIC) are HP NC373F PCIe Multifunc Gig Server Adapters with newest drivers. The load balancing happens at the frontend network interfaces, where the load of the web servers, whose IP addresses are bound to these interfaces, are balanced. The backend interfaces are unclustered and unbalanced, usually they're used for maintenance and administration, here they are also used to send the web server's answers back to the client.
This cluster is connected to a host machine (an IBM system) via Connect Direct (this realizes the connection to a DB2 database on the host) and MQ Series. Our deployed web application (running under WAS 6 and thus a Java app) gets its request from a client application (another Java app running at the user's deskop) and sends messages and also gets / sends data from and to the host.
The client application could be used for approximately one minute (or longer). It crashed with a "socket write error" exception and claimed that the TCP-connection to the web server was lost. In the web server's log we detected a "Socket Timeout Error". Obviously the TCP connection was broken! This behaviour was completely new to us, because previous tests in the test environment were running without problems. The architecture of the test system was identical to that of the productive environment. The only difference was the hardware: the productive servers are HP G5 machines, while the test servers are G3 servers.
After checking the whole application environment and every WebSphere setting available, we finally found the error in the NIC driver configuration. Obviously our very special software setting (Windows 2003 Server SP2, WAS 6) didn't harmonize with the NIC driver settings on this special hardware. It was the "Receive-Side Scaling" option that we found activated. After switching it off the problem disappeared and the TCP connection between client application and web server wasn't interrupted any longer. Since not every NIC does allow this setting to be set manually, there's a registry key where you can configure it on the affected server (after SP2 is installed): start regedit and look for the key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters. Now set up a new DWORD value EnableRSS and set it to 0 (zero). If problems persist, you may set up another DWORD value DisableTaskOffload and set it to 1 (one). More detailed instructions.
This error was extremely difficult to detect, because all other network operations worked flawlessly. It was possible to connect and administrate via RDP, to copy huge amounts of data, and all WebSphere handling including deployment and node synchronisation via ND server worked like a charm. The problems occured only while running the application. We suspect some very strange side effects somewhere between driver settings and Java network operations.
Of course, problems didn't stop here. We learned, that the queue wasn't served correctly, because an older version of our software was still running on another (older) server. After stopping this old installation, everything went fine, finally!