|
ON THE WIRE
Big Bad Watchdog Bites Network
by Bill Alderson and J. Scott Haugdahl
Nearly every protocol in networking has some form of "keep alive" or polling mechanism during periods of inactivity between a host and workstation or a server and client. These protocols are designed to drop connections and free resources whenever a workstation drops off the net. Unfortunately, if the protocol doesn't work properly, workstations can drop off the network for no apparent reason.
Scott:
Such was the case during our recent investigation of a customer's campus local area network where users' NetWare connections were constantly being dropped by their file servers.
Bill:
The problem appeared to be random and unrelated to any significant event.
Scott:
Trying to troubleshoot a random problem with no consistent repeatability is a network manager's worst nightmare.
Bill:
Some users appeared to lose their connection more than others. Some would be fine for a few weeks before having the problem again. Others wouldn't report the problem, since having to reboot your machine a few times a day while running Windows became an accepted practice.
Scott:
What? Having to reboot Windows? We never have to do that, right?
Bill:
This problem had been going on for several months, as previous attempts to solve the problem had been unsuccessful.
Scott:
One such previous solution was to segment the LAN further using a switch, thinking that it was a load-related
problem.
Bill:
I guess it's time for a little forensic analysis, right Dr. Quincy?
Scott:
Right, Sam.
Bill:
After observing the network for several hours and not seeing the usual "bad things" that we see in networks, we decided to focus on dropped connections due to the NetWare watchdog process.
Scott:
The watchdog process begins whenever a server hasn't seen a packet from a workstation in a given amount of time. A server will send a keep alive packet to a workstation it hasn't heard from in five minutes (by default). The workstation then replies with a keep alive response packet back to the server.
Bill:
If the server sees this packet, then it won't bother the workstation again for another five minutes. But if it doesn't see the packetę
Scott:
ęthen it will try again in one minute. This process continues until a reply is seen or a fixed number of keep alives are sent with no replies (nine more by default). At that point, the server will drop the connection.
Scott:
It wasn't known for sure if the server always dropped a connection due to the watchdog process. For instance, changing the watchdog parameters didn't seem to have much effect on the problem. The user also showed us a trace where the workstation did reply, and the reply reached the server segment properly, so how could it have been a watchdog problem? Could it have been a NetWare bug?
Bill:
We proceeded to capture packets, watched the server console for timeouts, and contacted users whenever the server reported a watchdog timeout.
Scott:
By correlating theory, actual packet transactions and user feedback, we found the answer.
Bill:
One thing that became clear was servers with hundreds of connections and users not accessing these servers for an extended period of time (such as when accessing local data, doing terminal emulation or even going to lunch) would
get the boot.
Scott:
Normally this shouldn't happen, because of the watchdog process.
Bill:
Another observation, seen early in our analysis, was occasional receiver congestion being reported by servers on the 16-Mbps server ring.
Scott:
Normally, a few receiver congestion errors now and then is not a big deal, and we usually just blow them off. After all, the server ring was extremely busy, and there just weren't enough receiver congestion reports to be overly concerned.
Bill:
But we knew that a receiver congestion error meant a server was unable to receive a packet because of a full receive buffer.
Scott:
By plugging our analyzer into the concentrator downstream from one of the servers, we could see packets addressed to that server. We could then observe when the address recognized bit was set, but not the frame copied bit, indicating receiver congestion.
Bill:
As it turned out, there was a high correlation to the frame copied bit not being set on watchdog packets returned from workstations, and the server reporting receiver congestion two seconds (by default) later.
Scott:
The problem was that these watchdog return packets were small and returned from a hundred or so workstations simultaneously, at a rate faster than the server's adapter and driver could process.
Bill:
It's kind of like loading a bow with 100 arrows and shooting them all at the target at the same time.
Scott:
If I were a server, I'd quiver, too. The next correlation as evidenced by our packet trace was that whenever a returned watchdog packet couldn't be buffered, the server would begin the one minute timeout cycle.
Bill:
Now and then, a few unlucky workstations never got their replies buffered, even with multiple tries, and the server thought it never heard from them and dropped their connection!
Scott:
There are many potentia
l solutions to this problem. To name a few: increase the receive buffer space in the server's adapter (you couldn't do that in this case); increase the comm buffer space in NetWare (it was already at several hundred); get a more powerful adapter in the server; or keep the server happy some other way, so it doesn't have to go through the watchdog process for all those workstations.
Bill:
The short term solution was to install a small TSR in each workstation that would periodically (within five minutes) send an unsolicited keep alive packet back to the server. The longer-term solution is to try a different Token-Ring adapter in the server that can process back-to-back small packets more efficiently.
Scott:
Of course, when the customer upgrades to NetWare 4.1, they'll have to check for this problem all over again.
Bill and Scott are principals at Pine Mountain Group and can be reached at otw@pmg.com. Portions of the actual trace file from selected columns are available via Pine Mountain Group's Home Page on the Web (http://www.pmg.com).
November 1, 1995
|