Explaining sch_cake's statistics
One universal complaint I have with nearly the entire fq_codel'd and cake'd userbase nowadays is the lack of statistics collection. While some graphing tools exist, few post results, and we had sunk ages of time into coming up with valuable and useful statistics on how the link was behaving in the cake implementation in particular. But we haven't explained them all that well before, so perhaps the onus is more on us as to describe their value to the user. So here's a quick run through of what they mean, and perhaps you'll be inspired to take a look at your own stats.
root@turris:~# tc -s qdisc show dev eth2
qdisc cake 8010: root refcnt 9 bandwidth 35Mbit diffserv3 triple-isolate nat nowash ack-filter split-gso rtt 100.0ms raw overhead 0
... I went into what these options can mean over here: https://forum.mikrotik.com/viewtopic.php?t=179307
?Sent 597930110 bytes 2985748 pkt (dropped 958, overlimits 1327048 requeues 10) ?backlog 0b 0p requeues 10?memory used: 140800b of 4Mb
... every drop generally represents saving a latency excursion measured in 100s of milliseconds, lasting for potentially many minutes. This particular link is not highly loaded at the moment. If there is a persistent backlog, you might consider looking harder at your traffic.
?capacity estimate: 35Mbit
... This shaper is configured for 35Mbit.
?min/max network layer size:??????????42 /???1514
?min/max overhead-adjusted size:??????42 /???1514
?average network hdr offset:??????????14
... Doing framing right is especially important on DSL, PPPoe and cable
... As for the below.. This instance is configured to take advantage of the most common diffserv markings and is a superset of the venerated wondershaper tool. Remarkably, some traffic on this network (somewhere) is actually trying to mark some packets appropriately!
??????????????????Bulk?Best Effort???????Voice
?thresh??????2187Kbit??????35Mbit????8750Kbit
... bulk is limited to a minimum of 5% of the bandwidth. Voice is the most common set of diffserv marks for voice (some cell phones do use this), and in this case is gross overkill (64Kbit is the limit for most voice), so it is hard to exceed this figure. Our hope was that more videoconferencing traffic would mark appropriately. Best effort is where everything else goes.
?target?????????8.3ms???????5.0ms???????5.0ms
... this is the "codel target" for queuing latency. It is a target, not a fixed figure. At really low rates (below 4Mbit), cake autoscales the target to account for the largest packet possible.
?interval?????103.3ms?????100.0ms?????100.0ms
... Interval is an assumption of the max RTT on the path. Now we get into more detailed stats:
?pk_delay????????11us????????24us???????1.3ms
?av_delay?????????3us?????????5us????????90us
?sp_delay?????????3us?????????2us?????????2us
?backlog???????????0b??????????0b??????????0
... peak Delay measures the impact of recent bursts on the system. Average delay is that. Sparse delay is probably the most important stat out of these in that if your sparse packets are getting delayed you have a really large workload on the system. All these are EWMAs and in order to make sense of them need to be sampled and plotted every few seconds.
... a persistent backlog is not an error, but a sign you have one or more long-running flows, like a backup,or bittorrent. If this is really big and stays that way, you might have a unresponsive flow on the network.
?pkts????????????7720?????2965829???????13157
?bytes????????3919616???589592983?????4665886
... just bytes and packets. Seeing stuff actually fall into these classes indicates you are using them. There are tools to reclassify certain kinds of traffic into these tins like https://forum.openwrt.org/t/qosify-new-package-for-dscp-marking-cake/111789/ - I note that I just prefer to slam cake on an interface first, get it configured properly, and then, maybe, worry about further classification.
?way_inds???????????0???????41685???????????0
?way_miss???????????3??????181466?????????281
?way_cols???????????0???????????0???????????0
... these are statistics on how well the 8 way set associativity of cake is working. A lot of way_cols means you have a LOT of different kinds of traffic flowing through and most likely a persistent backlog. It's really rare to see way_cols except under a sophisticated DDOS. Another subtle point is with big numbers here, the fair queuing portion of cake is doing it's job, much, much better than a FIFO ever could.
?drops??????????????0?????????125???????????0
... We helped save on 125 latency excursions (bufferbloat) on this cake instance thus far. Not a lot, but I've only been running it for a few hours with just me as a workload!
?marks??????????????0???????????0???????????0
... No ECN enabled transports are enabled on this link. We kind of expect to see this number be bigger in the future if L4S is rolled out.
?ack_drop???????????0?????????833???????????0
... This cake instance is configured to drop extra TCP acks under pressure. This helps increasingly more on asymmetric links with Down/up ratios worse than 10x1.
?sp_flows???????????1???????????2???????????1
?bk_flows???????????0???????????1???????????0
?un_flows???????????0???????????0???????????0
... This is the currently active number of flows that meet each catagory cake tracks.
?max_len?????????1514????????6056????????1514
... The 6056 figure indicates this router does GSO/GRO - bulking up sequential packets into one big packet. While GSO and GRO save on CPU, the big packets really can hurt interflow latencies, so cake splits them back up into individual packets.
?quantum??????????300????????1068?????????300
... At low bandwidths, a smaller quantum interleaves packets better, but costs CPU. A 300 byte quantum costs 6x through a loop than a 1500 byte quantum. Cake has a heirustic to set this that we could possibly increase to larger than a MTU but we haven't got around to that.
I hope y'all find this useful, and check your stats when your network is behaving badly... AND when it's behaving well!
"If you do not take risks for your ideas you are nothing. Nothing." N.N.T. | #LibreQoS & #bufferbloat :-) PS: Bandwidth is a lie!
1 年ha!
@dtaht:matrix.org - Truly speeding up the Net, one smart ISP at a time
2 年Trick question, does anyone know why there's a spike at T+30 in this plot?