TCP retransmits and Window size “0”, a problem? maybe, maybe not. By: Larry Brasher
A good network admin always monitors their network, being constantly on the lookout for any network hiccups, flooded routers, cyclic redundancy checks (CRC), broken routes, general network congression and on and on and on, always looking for anything that the network gremlins can throw their way. One of the primary weapons of choice as they enter this arena of battle are packet analyzers. They come in different vendor flavors such as netmon, wireshark, tcpdump etc…. What often comes up as a discussion point is a common question and uncertainty of:
?“I see TCP retransmits and TCP window size set to 0, is that a problem?”
The answer is “yes”, “no” and “maybe”, believe it or not, its all of them.
?TCP\IP has matured and improved over the years with many robust enhancements, auto-tuning and security features. This being said, at its nature is reliable due to its error correction, congestion and data flow control. There are many a chapter written in tech books on this very thing and the internet has no shortage of information on the subject.
Also not to forget the numerous RFC’s and other articles referenced at the bottom of this post.?
Note: This article assumes the reader has a basic understanding in TCP inner workings, general understanding of networking and network capture reading.
So back to our question, are TCP retransmitted and TCP windows size 0 a problem?
TCP retransmit
First lets look at?TCP retransmits?and what they are.?Simply put:?
?a.) if data is not sent within a certain amount of time and\or there is packet loss,?the data is then retransmitted.
?b.) if a sender receives duplicate acknowledgements, fast retransmit mechanism suspects a?loss of data and can trigger a retransmission
?
Ok, so what are causes of this packet loss, what’s the cause of latency??Most reasons are already mentioned at the beginning of this article.?Routers flooded with too much traffic and\or the result of the memory capacity of the router and it cannot keep up with the demand, malformed packets, an issue with the endpoints or any of the hops between the two endpoints.?Yes, I know, it’s a chase down and like a detective you’ll need to follow the breadcrumbs.?
?Let’s take a look at a wireshark capture first.?
Environment:?Virtual machine.
Server: Windows 2012R2
Server Role: WSUS
Action taking place: WSUS synch with Microsoft.?
?Example: In my home lab, here’s a capture of my WSUS server synching updates with Microsoft.
The capture is of 118637 packets. I’ve gone ahead and filtered the capture to just display TCP retransmits.?
You see that out of 118637 packets, 8 show up as a retransmit.?So the question now is, “Is this too much? Is this a problem?”
In this case, nothing failed, WSUS synched, no errors and no problems.?8 tcp retransmit packets is 0.0067 % of 118637, so retransmits in this case does not even equal 1% of the traffic.?It’s not even enough to raise an eyebrow of concern.?
Now if there were 25K TCP retransmits, it’d be 21% of the traffic and that’s sure to get your heart to racing as now you have to roll up your sleeves and start chasing down a root cause.
How do you remediate this??There are several different good 3rd party vendor network monitor applications or?services, such requiring varying agents, logging and polling that can help narrow down exactly where the problem may lie.??Let’s assume that you don’t have any of these, so it very well may become a process of elimination.
?
Things to check
NIC: Update NIC driver and firmware.?You can’t go wrong here as it helps to have the NIC operate more efficiently, after all a NIC is only as good as the driver that’s written for it. Check for a duplex mis-match on the NIC’s as well compared to the actual link bandwidth.?If the NIC is teamed via 3rd party utility, you may need to see if there’s an upgrade on that front as well.
Router\Switches\network layout: Check your routers and switches to see if they are flooded and dropping packets. Update your routers and switch firmware if possible. If you’ve reached the router’s capacity of what it can handle, it may be time to reassess your network architecture for a revamp. Give your network layout a look, you may need to consider a bandwidth upgrade.?Remember each hop not only adds to latency but also a another possible point of packet loss.??Upon supporting Microsoft Data Protection Manager in the distant past, imagine this scenario:?
A heavily used server in an internal DMZ being backed up not only on flat files but for a BMR snapshot. The backup itself taking place from several states away.?Yes, that’s right, states.?J?The traffic goes through 4 firewalls and a handful of routers BEFORE it leaves the site. The receiving end, again several states away, is behind 2 other firewalls and a handful of routers.?This is not taking into account all the connecting devices not in control by the network admin. So, tcp retransmits? You bet, many, like grains of sand on a beach my friend.?
In this case it was realized by the admin that it was more efficient to have the DPM server at the local site to the server being backed up.?A comparison trace showed a very minimal amount of retransmits compared to before.
Firewalls: Check your firewalls.?If your traffic is being torn down, analyzed and reassembled and sent on its way by an aggressive application layer or deep packet inspection firewall, this would be an excellent place to look for packet loss, latency and\or malformed packets upon the reassembly.?Does your firewall have an update for it? How about a firmware update? Check its logging. If in the logging you see a packet get received at “x” time and then forwarded on many, many milliseconds or even many seconds later compared to all the other packets on average, then you have to ask yourself as to why? Why the big latency compared to the other packets.?Firewalls are awesome and have come a long way on their capabilities. You can configure some to really tear down packets and compare the contents to a huge list of exploit rules.?This in itself very well may be the bottleneck which can possibly be resolved via a simple update provided by the vendor.?So see if the firewall has an upgrade for it to provide functionality enhancements. Also much like a router, it may even be able to have its memory increased for its task?
As you can see all in all, you’ll need to check every single hop if possible and gauge its performance.
TCP Window Size 0
Now on to TCP Window Size 0, this is an advertisement of how much data to send.?
In short the receiving side tells its counterpart: “Keep the connection opened but don’t send me anything. You want to know how much to send me, here, send me this much 0.”
?And the connection stays opened and does not close.?Once more data can be handled the window will advertise a larger data size than “0”.
?Ok, so what causes the TCP window 0??
Usually is comes down to performance, simple as that.?The TCP receive buffer is full and it cannot receive any more data as its not finished processing what it currently has in its buffer. Using the same capture as above, I’ve filtered the display for just a TCP window size of 0.
Here you see that this time the packets we are interested in are much more numerous than the TCP retransmits.
For window 0 size we have 755 packets out of 118637 but it’s only 0.6% of the capture, again not even 1%.?As mentioned before if this were 25K which is 21%, yes, that’d be a problem. A performance problem but a problem nonetheless.?Usually things will work and traffic will continue, just at a snail’s pace.
How do you remediate this? You have to look at the endpoint sending the widow of “0”.?Many times a simple perfmon counter will show just how busy the server may be. So the TCP buffer is full waiting on the server to free up enough resources to process what’s already queued up before asking for more data to be sent.?
Things to look at
The NIC: In addition to the already mentioned suggestions for the NIC in the TCP retransmit section, there are already built in mechanisms in place to enhance network traffic such as TCP chimney offloading and\or RSS (receive side scaling).
RSS?technology enables the processing of network packets belonging to the same TCP connections to be distributed across multiple processors in the system if your server has more than one processor.
?TCP Chimney Offload?is a networking technology that helps transfer the workload from the server CPU to a network adapter during network data transfer. This is of course assuming the adapter can support it.
?Even these though are not always without issue and there may be some functionality limitations such as when factoring in teamed NICs.
Most often if TCP Chimney is not functioning properly then an update of the NIC driver will correct the problem. This is important to remember as updating the NIC driver is often overlooked as a possible solution. Please remember that a NIC is only as good as the driver that is written for it and an updated driver may correct many underlying issues.
?If it is suspected that TCP Chimney or RSS is not operating as expected then try toggling them off one at a time for testing. Doing this is simple and turning off TCP Chimney or RSS does NOT require a reboot of the server. When finished with your testing they can easily be turned back on or switch them to whatever state they were in before your testing. Note: Articles at the end discuss how to toggle these settings.
Server Resources: General server resources need to be observed including counters for logical disk, physical disk RAM, pagefile, all the TCP counters, processes and CPU.?Most likely this will be the smoking gun.?More often than not for TCP Window size “0” you’ll see the server being slammed for resources.?It may be taxed to its limit for one or a combination of resources. Look at the event log for warnings of resource exhaustion, if you’re not that adept in perfmon reading there’s an inbox system defined counters that can be ran that give you about a 60 second snapshot of the servers resources.?Keep in mind though this is not a sustained counter, it just runs for about 60 seconds.?So if you’ll need to start it during time of normal to heavy use for the server.
Let’s take a look at an all too common scenario.?
?Example: You have a backend SQL server, the database is on D: along with the pagefile.?Its running the minimum amount of RAM and has CPU is bare minimum.?SQL is installed and running and during times of stress you see RAM consumed and the D: drive being hammered with high reads\writes and disk I\O is high. I’d expect nothing different. I’d also expect to see a lot of packets with a TCP window size of “0”, this being whether you see a failure in SQL traffic or not.
A network trace shows TCP Window of “0”, well yeah, I can see that.?The server is focusing the bulk of its resources on SQL operations.?RAM is exhausted so its paging to D: where the pagefile is and is also trying to do SQL functions on that same drive.?So I’d expect it to thrash the drive which in turn increases over all I\O.?
In this case the receiving server telling the sender “Send me ‘0’, the buffer is still full.?I’m behind on processing data so keep the TCP session open but give me ‘0’. Once I can free up some resources to empty the buffer, I’ll send you an advertisement as to how much to send me but for now, just ‘0’ “.??
So is this a problem if things are still working in SQL??Well as in the captures shown, it’ll depend on the amount percentage you see as well as SQL transactions taking place and whether or not things in SQL are excessively queued and results delivered slowly.?
If the SQL work in question isn’t critical or the performance hit is rare, you very well may want to just live with it as it is.?If the SQL work is crucial to the business and\or the perf hit is a constant and sluggishness is NOT acceptable then a call to action is required.?
At a minimum for consideration should be:
a.) Move the pagefile to either C: or another drive other than where the SQL db resides.
b.) Increase the RAM
c.) Increase the CPU horsepower if possible.?
d.) While moving the pagefile to another drive would help, if the disk thrashing is still taking place try to upgrade to faster disk.?(SSD, RAID configuration etc..)
e.) consider offloading some of the work to another SQL server if possible.
There are other things I won’t get into that you can do but ultimately it’ll depend on your business needs as to how far you need to tweak your server for performance bottlenecks.
Filter drivers in place: I’ve already written two articles discussing the effects of too many and\or wayward filter drivers can have over server performance. No use to reinvent the wheel so here you go:
Summation: Our initial question “I see TCP retransmits and TCP window size 0, is that a problem”
And you can see now how the answer is truly?“yes”, “no” and “maybe”.?It’ll all come down to the percentage of that traffic you see in the capture.??
As in my capture example, there were a handful of each in comparison to the overall amount of packets.?TCP retransmits took place as part of its reliability characteristic and TCP Window 0 was set, keeping the session open but not accepting data every now and then. There were no issue or problems from the application side of functionality.??So in my case the answer would be “no” it’s not a problem.
The tires for my car need 44 PSI of air to operate optimally.?Everyday I look at my tire pressure on the onboard console of my car and it fluctuates a pound or three one way or the other depending on the outside temperature. Its not feasible to check and adjust the PSI every day to make sure it always stay at 44 PSI and never a pound over or under.
Likewise, it’s a fantastic goal to never have any TCP retransmits or TCP Window 0 packets but not very realistic. So you’re left with baselining and comparing and monitoring, always on the lookout in case the network gremlins get too out of control and if they do, you're ready.....you have an action plan in place.
References
Description of Windows TCP featureshttps://docs.microsoft.com/en-us/troubleshoot/windows-server/networking/description-tcp-features?
?Description of Windows TCP features
RFC 675 – Specification of Internet Transmission Control Program, December 1974 Version
RFC 793 – TCP v4
STD 7?– Transmission Control Protocol, Protocol specification
RFC 1122 – includes some error corrections for TCP
RFC 1323 – TCP Extensions for High Performance
RFC 1379 – Extending TCP for Transactions—Concepts [Obsoleted by RFC 6247]
RFC 1948 – Defending Against Sequence Number Attacks
RFC 2018 – TCP Selective Acknowledgment Options
RFC 5681 – TCP Congestion Control
RFC 6298 – Computing TCP's Retransmission Timer
Presently dealing with some Zero Window states showing up on phones and elsewhere, causing some fun gremlins in operations. Solid article, and yes, most of the times I've seen this on the receiving end, it's a buffer issue and yes, even Linux may need some tuning on its stack.
Network and Cybersecurity engineer - Freelance
1 年GREAT ARTICLE!! Thank you so much for sharing your knowledge :) It's been very informative to me!
Infrastructure Technology Specialist VP - Citi
3 年You’re too kind sir.
Mapping Azure Solutions to outcomes for Federal Government customers
3 年Larry Brasher Thanks for sharing!! Definitely will leverage this one. Awesome read brother.