ON THE WIRE / BILL ALDERSON &J. SCOTT HAUGDAHL
The Case Of The Disappearing Print Jobs
Q: Our network consists largely of Sun servers and workstations, NetWare servers and personal computer workstations. We have an application where the personal computer workstations print to a print queue on a NetWare server. That queue is then read by a Sun system that prints the file to a postscript printer attached to that Sun. The problem is that the print jobs sometimes "disappear." The print files appear to get to the NetWare queue, but sometimes they never make it to the Sun's printer. Help!
Bill
: At first blush, this appears to be an application layer problem.
Scott
: Right. There are two application layer protocols involved (and, of course, all of the underlying protocols needed to carry the application layer requests): first, the printing protocol from the workstation to the Novell server; and, second, the queue service protocol between the NetWare server and the Sun. The flow of print data is illustrated in "Sun Reading NetWare Print Queues" (next page).
Bill
: Our customer has a Windows-based protocol analyzer, so he captures a good print flow and a failed print flow, and sends the traces to the vendors. The vendors differ as to the cause of the problem, each citing the other as the cause. With time passing and the problem remaining unresolved, we become involved as the trace file is downloaded to our BBS.
Scott
: One of the first details we note is that the PCs send the print data to the Novell server over IPX, while th
e Sun reads the data from the NetWare queue using TCP/IP. Since we verified the customer's claim that the data always makes it to the NetWare queue, we concentrate on analyzing the print data between the Sun and NetWare, eliminating the IPX analysis for now.
Bill
: Since several of our analyzers are equipped with an expert system, we convert the file to those analyzers using our TraceTool file conversion utility, Pine Mountain Group's software that can convert among different protocol analyzer formats, and voila…
Scott
: None of the expert systems finds anything wrong!
Bill
: Right!
Scott
: We note that expert systems are not intended to replace the network analyst! Rather, they are intended to assist the human expert in finding fundamental problems faster. The human expert must always make the final decision about the problem-finding and never assume there are no problems with a given set of packets just because a machine-based expert system didn't find anything wrong.
Bill
: The expert tools are great for what they do, but don't panic when the system "finds" potentially bogus problems -- always question its findings or lack thereof. This certainly isn't the first time a problem slipped by our protocol analyzer's expert system.
Scott
: That goes without saying. Now, what was the problem with our customer's print job?
Bill
: We analyze the customer's trace and discover a number of packet retransmissions at the end of the print job. The NetWare server is sending the same TCP sequence number four times, with increased time-out delay between each packet before giving up. We note that on previous packets, TCP is breaking up 2,048 bytes of print data into two packets -- since in this case, the DLC layer is Ethernet with a maximum transmission unit (MTU) of 1,500 bytes of encapsulated data per packet -- and the Sun sends one TCP packet back to the NetWare server, acknowledging both TCP p
ackets. The Sun normally responds with an acknowledgment in about 1.5 ms.
Scott
: For some reason, the Sun fails to acknowledge (ACK) the last packet of a print job, causing the entire job not to print. So Sun looks guilty.
Bill
: Exactly. But sometimes that last packet would be ACKed OK, as was the case with the "good" print trace. So we scrutinized that last packet from the bad print trace and discovered that...
Scott
: Spare me the suspense!
Bill
: ...the TCP checksum was invalid, causing the Sun to reject the packet! Meanwhile, Novell was accusing Sun and Sun was accusing Novell!
Scott
: I guess Sun was right this time.
Bill
: When armed with the evidence, our client went back to Novell and explained that it was calculating the TCP checksum incorrectly on certain length packets. This is a case where checksums other than the standard 32-bit CRC in all Ethernet and Token-Ring packets are important (see the "Error Coverage" chart, below).
Scott
: Thankfully, almost all analyzers note whether or not the TCP checksum is correct in the packet decodes. Now it would be great if they would add it to their flags or expert analysis so we won't have to go hunting for it!
Bill and Scott are principals of Pine Mountain Group, Inc., and spend their time troubleshooting large networks, training end users in protocol analysis, and developing tools to allow users to make better use of their protocol analyzers. They can be reached at otw@pmg.com
.
Checksums In An Ethernet Packet With TCP/IP
Unlike the 16-bit IP checksum that covers only the 20-byte IP header and none of its data, the 16-bit TCP checksum protects the data encapsulated in IP. Normally, the DLC checksum, in this case, Ethernet's 32-bit CRC, protects all data in the packet, including the TCP/IP information. However, as packets go through routers, the DLC checksum changes, since the packet is forwarded to another rou
ter or end node, with new DLC source and destination addresses. The IP checksum is also recalculated as the time-to-live (TTL) field is decremented. Therefore, an additional checksum is desirable to cover the possibility that an error could occur in the application data (in this case, the data contained in the TCP segment).
|