ASHviz: Issue at the x-axis
Take another look at the plot in header above.
This plot aggregates ASH data by STATE_CLASS using SAMPE_TIME as the x-axis aesthetic. Take a second to think about what data points are plotted: for every distinct SAMPLE_TIME aggregate (count) all the rows (sessions) in each STATE_CLASS. How many distinct x-axis values will have one or more points plotted on the y-axis? Recall the ASH samplers on the 4 RAC instances start and operate independently, and - as we will see in later articles - they time their activities quite closely. There should thus be no overlapping in SAMPLE_TIME across instances, so each data point will be a within-instance aggregate. So just in the 40 minute period between 11:40 and 12:20 that makes 9600 distinct x-axis times:
- 60 (seconds/samples) * 40 (minutes) * 4 (instances) = 9600
The plot above suffers significant overplotting in both x and y dimensions, among other issues. It is certainly not devoid of meaning, however, and some color and point size manipulations might help better show that:
- Instances consistently have 5 or fewer active sessions on CPU
- User I/O ranges between 10 and 20 active sessions consistently
- Cluster waits are normally around 8 active sessions but experienced 2 episodes when they spiked on at least one instance
This is consistent with what we know about the workload from earlier investigations, if a bit less crisply evident.
OK, now I have to confess that I began these investigations using SAMPLE_ID for the x-axis rather than SAMPLE_TIME. You see, I've mostly used single instance Oracle databases, and even in RAC usually referenced the local V$ASH. So I'm used to thinking of SAMPLE_ID as a proxy for SAMPLE_TIME and with some nice properties because it is a monotonically and sequentially increasing integer (instead of an unwieldy datetime.) So for quick and dirty analyses it works quite nicely.
Look at the plot below, which simply swaps SAMPLE_ID for SAMPLE_TIME for the x-axis from the earlier plot.
It's kind of a featureless blob, truly a mess. What is going on? There are still the three bands of activity for CPU, Cluster wait, and User I/O wait. But notice that on the y-axis the range of these bands is about 4x the corresponding range on the earlier plot, and there are exactly 4 instances. So SAMPLE_ID must be common across the 4 independent streams of samples and the values for 4 instances are aggregated together for each. Completely absent from this plot is any sense of one or more incidents or unusual activity, so the picture is almost useless.
In the first plot instance activity was separated on the x-axis because sample times were distinct, in the second activity was aggregated across instances because SAMPLE_ID is common. So a confusing aggregation was introduced by carelessly choosing x-axis and not being fully cognizant of the shared SAMPLE_ID.
In this plot we separate by instance using color but keep using SAMPLE_ID as the x-axis:
Now some of the familiar structure emerges. Notice there seem to be 4 activity peaks, and these seem to show instances 1 and 2 peaking together twice and instances 3 and 4 peaking together twice. Curious. Since our prior investigation showed the peaks happening together on all instances at the same TIME, does this mean that the same SAMPLE_ID occurs close in time for instances 1 and 2, and likewise for instances 3 and 4? Just what is going on between SAMPLE_ID and SAMPLE_TIME anyway?
To answer this I created a simple yet informative visualization: a line chart plotting SAMPLE_TIME on x-axis and SAMPLE_ID on y-axis, coloring by instance.
Indeed, instances 1 and 2 form a pair and likewise instances 3 and 4, and the nature of their pairing is just as we surmised: generating the same SAMPLE_ID very closely in time. However, the two groups generate the same SAMPLE_ID some 5 minutes apart from each other. This largely answers why the earlier plot was such a mess: it was aggregating together activity on x-axis that did not happen concurrently. Time-shifted aggregation, meaningless.
So what are the take aways from this excursion? Well, certainly to not use SAMPLE_ID as a proxy for time and especially in RAC contexts where SAMPLE_ID from different instances will be offset in time, perhaps significantly. Some x-axis aggregation should always be done (especially in RAC contexts) to provide a unified time line. Aggregating 60 samples into minute-by-minute views of ASH is a favorite strategy, and the notebook ASH data frame provides the MINIT column for just this purpose.
When using ggplot the first two decisions to make relative to a visualization are:
- what data frame to plot?
- what variable is the x-axis?
Another lesson here: don't rush into plotting a visualization without being fully mindful of the x-axis.