home news blogs forums events research newsletter whitepapers careers


Network Computing Network Computing Powered by InformationWeek Business Technology Network
InformationWeek 500 Conference -- September 14-16, 2008 Registed Today!

IMMERSE YOURSELF:

SOA

  |

Data Center

  |

802.11n

  |

Data Privacy

  |
APO  |

Virtualization

  |

NAC

  |

Security

  |

Network Mgmt

  |

Enterprise Apps

  |

Storage & Servers


ON THE WIRE

Big Bad Watchdog Bites Network

by Bill Alderson and J. Scott Haugdahl

Nearly every protocol in networking has some form of "keep alive" or polling mechanism during periods of inactivity between a host and workstation or a server and client. These protocols are designed to drop connections and free resources whenever a workstation drops off the net. Unfortunately, if the protocol doesn't work properly, workstations can drop off the network for no apparent reason.

Scott: Such was the case during our recent investigation of a customer's campus local area network where users' NetWare connections were constantly being dropped by their file servers.

Bill: The problem appeared to be random and unrelated to any significant event.

Scott: Trying to troubleshoot a random problem with no consistent repeatability is a network manager's worst nightmare.

Bill: Some users appeared to lose their connection more than others. Some would be fine for a few weeks before having the problem again. Others wouldn't report the problem, since having to reboot your machine a few times a day while running Windows became an accepted practice.

Scott: What? Having to reboot Windows? We never have to do that, right?

Bill: This problem had been going on for several months, as previous attempts to solve the problem had been unsuccessful.

Scott: One such previous solution was to segment the LAN further using a switch, thinking that it was a load-related problem.

Bill: I guess it's time for a little forensic analysis, right Dr. Quincy?

Scott: Right, Sam.

Bill: After observing the network for several hours and not seeing the usual "bad things" that we see in networks, we decided to focus on dropped connections due to the NetWare watchdog process.

Scott: The watchdog process begins whenever a server hasn't seen a packet from a workstation in a given amount of time. A server will send a keep alive packet to a workstation it hasn't heard from in five minutes (by default). The workstation then replies with a keep alive response packet back to the server.

Bill: If the server sees this packet, then it won't bother the workstation again for another five minutes. But if it doesn't see the packetę

Scott: ęthen it will try again in one minute. This process continues until a reply is seen or a fixed number of keep alives are sent with no replies (nine more by default). At that point, the server will drop the connection.

Scott: It wasn't known for sure if the server always dropped a connection due to the watchdog process. For instance, changing the watchdog parameters didn't seem to have much effect on the problem. The user also showed us a trace where the workstation did reply, and the reply reached the server segment properly, so how could it have been a watchdog problem? Could it have been a NetWare bug?

Bill: We proceeded to capture packets, watched the server console for timeouts, and contacted users whenever the server reported a watchdog timeout.

Scott: By correlating theory, actual packet transactions and user feedback, we found the answer.

Bill: One thing that became clear was servers with hundreds of connections and users not accessing these servers for an extended period of time (such as when accessing local data, doing terminal emulation or even going to lunch) would get the boot.

Scott: Normally this shouldn't happen, because of the watchdog process.

Bill: Another observation, seen early in our analysis, was occasional receiver congestion being reported by servers on the 16-Mbps server ring.

Scott: Normally, a few receiver congestion errors now and then is not a big deal, and we usually just blow them off. After all, the server ring was extremely busy, and there just weren't enough receiver congestion reports to be overly concerned.

Bill: But we knew that a receiver congestion error meant a server was unable to receive a packet because of a full receive buffer.

Scott: By plugging our analyzer into the concentrator downstream from one of the servers, we could see packets addressed to that server. We could then observe when the address recognized bit was set, but not the frame copied bit, indicating receiver congestion.

Bill: As it turned out, there was a high correlation to the frame copied bit not being set on watchdog packets returned from workstations, and the server reporting receiver congestion two seconds (by default) later.

Scott: The problem was that these watchdog return packets were small and returned from a hundred or so workstations simultaneously, at a rate faster than the server's adapter and driver could process.

Bill: It's kind of like loading a bow with 100 arrows and shooting them all at the target at the same time.

Scott: If I were a server, I'd quiver, too. The next correlation as evidenced by our packet trace was that whenever a returned watchdog packet couldn't be buffered, the server would begin the one minute timeout cycle.

Bill: Now and then, a few unlucky workstations never got their replies buffered, even with multiple tries, and the server thought it never heard from them and dropped their connection!

Scott: There are many potentia l solutions to this problem. To name a few: increase the receive buffer space in the server's adapter (you couldn't do that in this case); increase the comm buffer space in NetWare (it was already at several hundred); get a more powerful adapter in the server; or keep the server happy some other way, so it doesn't have to go through the watchdog process for all those workstations.

Bill: The short term solution was to install a small TSR in each workstation that would periodically (within five minutes) send an unsolicited keep alive packet back to the server. The longer-term solution is to try a different Token-Ring adapter in the server that can process back-to-back small packets more efficiently.

Scott: Of course, when the customer upgrades to NetWare 4.1, they'll have to check for this problem all over again.

Bill and Scott are principals at Pine Mountain Group and can be reached at otw@pmg.com. Portions of the actual trace file from selected columns are available via Pine Mountain Group's Home Page on the Web (http://www.pmg.com).

November 1, 1995







Ready to take that job and shove it?

Function:

Keyword(s):

State:
SPONSOR
RECENT JOB POSTINGS
CAREER NEWS
Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

Ari Balogh was named to the post of chief technology officer as the companys for a "realignment" of employees.










InformationWeek U.S. IT Salary Survey 2008
Salaries for business technology professionals are falling. Here's what you need to know in order to make good hiring decisions and personal career choices. Download Today
 
ROLLING RIGHT ALONG
Follow key Network Computing Reviews from conception to completion. This Week: Holistic APM.



Network Computing Reports Emerging Enterprise Podcast Series: Secrets to Success








TechSearch


Microsite of the Week


Powerful Information at Your Fingertips



InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space
App Infrastructure   |   Messaging & Collaboration   |   Network & Systems Mgmt   |   Network Infrastructure   |   Security  |   Storage & Servers   |   Wireless   |   Enterprise Apps
About Us  |  Contact Us  |  Site Map  |  Technology Marketing Solutions  |   Briefing Centers
Copyright © 2008  United Business Media LLC  |  Privacy Statement  |  Terms of Service  |  Your California Privacy Rights