ASHviz: Can you box that, please?
This installment explores the distribution of sampled event latencies from the ASH dump using `geom_boxplot( )`.
ASH fix-up and TIME_WAITED
When ASH samples a session that is waiting, the value of TIME_WAITED is set at 0. When the wait event completes, the exact time spent waiting is recorded and the ASH mechanism updates TIME_WAITED in the last of its samples for this event with this value. We refer to this "update after sampling" feature as the "ASH fix-up." A number of important values may be fixed-up after sampling by this mechanism, but TIME_WAITED is of especial interest and figures in many of the ASHviz investigations.
Events data frame
Given that only fixed-up samples in ASH will have a nonzero value for TIME_WAITED, we can create a data frame of all events sampled thus:
Events <- ashDF %>% filter( TIME_WAITED > 0 )
Box-whisker plots
The box-whisker plot produced by `geom_boxplot( )` in the ggpplot2 package is a high-level visual sketch of the distribution of a given set of data values. The "box" is drawn with upper-lower bounds at the 75th and 25th percentiles, meaning that half the given data values are within this box. The "whiskers" extend 1.5 * the inter-quartile range (range between 25th and 75th percentiles) above and below the box bounds, and "outlier" points beyond these limits are explicitly drawn. Finally, within the box the median value is indicated by a thick horizontal line. So this visual mechanism gives rapid coarse-grained insight into the compactness or spread of the data, range of values, existence of outliers, and skewness of distribution.
We can draw a box-whisker plot over the TIME_WAITED values in the Events data frame as follows:
ggplot(data = Events, aes(x = "ALL WAITS" , y = log10(TIME_WAITED))) +
geom_boxplot()
Which yields the following plot:
We actually plot over log10(TIME_WAITED) to characterize the distribution on a log scale. Under this scheme the value 3 corresponds to 10**3 microseconds or 1 millisecond, and the value 6 corresponds to 1 second.
Even though our code is about as simple as it gets, the plot is informative about the overall collection of events recorded in our ASH dump. The box hows half the event latencies lie between 10 milliseconds and 200 milliseconds or so. The median splits the box pretty evenly so there is not much skew in the bulk of the values, however there appear to be a strong cadre of outliers on the low-latency side, with more than a few values below 100 microseconds. On the high side notice events above 1 second in duration, important in that individual wait events may have been sampled multiple times.
Wait class differences
Wait events cover a large multitude of serialization and delay mechanisms, supporting many purposes and contexts in an operational Oracle database. Understanding event latencies is an important part of understanding system performance altogether, as well as solving specific issues. So lumping all events together and characterizing the group distribution may not really be that useful in a practical sense.
One natural breakdown of all events is by wait class. We can do this in geom_boxplot by setting the x aesthetic to WAITCLASS. Also, I added red reference lines at 1 millisecond and 1 second:
The result is one boxplot per wait class summarizing all event latencies from that class. There is definitely additional information to be gleaned from this more detailed view:
- the bulk of event latencies in all classes are between 1 millisecond and 1 second
- "Idle" events were sampled even though this is not supposed to occur
- "Network" events have much larger variation than any other class
- "Commit" events show tight distribution around 10 milliseconds but with a significant cluster of outliers well above 100 milliseconds
However in truth the view above is not really very satisfying. For one thing, there is no indication of how many events are summarized by each boxplot, and since all plots are given equal visual real estate this may lead to the classic "thinking too hard about too little" issue.
Another problem stems from the fact that there may be many or few event types in any given wait class and their latency characteristics may be very different. So a class-level plot may summarize many event types together, or only one or two, depending. Confusing.
It seems that when looking at event latencies, either looking at them all together or isolating at the event level are the two best options.
Unfortunately, when I changed the x aesthetic to use EVENT_ID instead of WAITCLASS, the unreadable mess in the header of this article was the result. I even went further in the notebook, also coloring the plots by EVENT_ID. Disaster.
Boxes of boxes
We can combine the grouping by wait class with the summarization by event using faceting. The plot below still uses EVENT_ID as the x aesthetic but facets by WAITCLASS, producing a separate collection of individual event boxplots for each class.
Here we see very different numbers of event types have been sampled per class, from a single type in both the "Commit" and "Idle" classes to numerous types in the "Cluster" and "Other" classes. Observe that the broad range of values in the "Network" class is really the union of several smaller ranges. Similarly, the "Application" class combines 3 event types with very different latency characteristics.
This is probably about as much mileage as we will get from this investigation. The boxplot has provided some good high-level insights into sampled event latencies by type and wait class. It's a good start anyway.
references:
ASH Architecture and Advanced Usage last and best presentation on ASH from my days at Oracle