登录查看更多内容

ASHviz: Dark matter 2

John Beresniewicz

Architect at Oracle

发布日期: 2019年5月16日

This article extends the discussion of "dark matter" in ASH by exploring a completely new source of data about event latencies that does not suffer from sampling bias toward longer latencies. This new and possibly independent source of event latency information could provide important additional insights into latency distributions during the period of the ASH dump. It may provide an interesting comparison and test of accuracy for the count estimator introduced previously.

ASH `TIME_WAITED` and dark matter

A previous Jupyter notebook and related articles (see references below) discussed the fact that the ASH sampling mechanism is biased toward sampling longer latency events and thus event latencies in ASH cannot be treated like a random sample. Descriptive statistics like median and average over these samples do not represent unbiased estimates, and should not be used.

We also saw a means whereby the event latency sampled in the ASH column `TIME_WAITED` can be combined with the sampling interval to provide an estimate of event occurrences. Aggregating these estimates over larger time intervals can give unbiased picture of the distribution of event latencies.

Sampled and estimated latency densities

Recall the violin plots of probability densities over event latencies both as sampled and weighted by the event count estimator:

Note again that the sampled density is almost entirely above 1 millisecond latency, while nearly half the estimated density is below 100 microseconds. This large proportion of low latency events indicated by the estimate is almost invisible in the raw ASH data as they have been sampled relatively few times. This is the basis for my "dark matter" reference.

There is something about the event count estimate that I don't quite understand fully, which is that our confidence in it must decrease with the sample latency. When the latency is high, larger than the sampling interval, we are certain the count is 1 as the sampler was guaranteed to see the event and only that event. On the other hand, it seems that if the sampler happens to collect a rare 100 microsecond event, the estimate projects that it happened 10000 times, perhaps far too many. That is, a "lucky hit" by the sampler of a very low latency event could throw the estimate way off from reality. My intuition is that we should be able to draw confidence intervals around the estimated density that would reflect this. I wonder how much to believe the large pool of probability below 100 microseconds in the estimate weighted density on the right. Maybe the confidence bands are so wide at these low latencies that virtually anything is possible.

CPU samples: the new dark matter?

So far the event latency analysis has focused on events sampled into ASH and the final latency recorded into `TIME_WAITED` by the fix-up mechanism. However, there is another potential source of event latency information in ASH, hiding in plain sight. And, significantly, this source is seemingly independent of event sampling bias to longer latencies.

`V$SESSION.WAIT_TIME`

`WAIT_TIME` has always been one of those problematic columns in `V$SESSION`, what to do with it? ASH samples active waits and fixes up samples with wait time in `TIME_WAITED`, but `V$SESSION` has been recording `WAIT_TIME` since well before ASH.

According to the Oracle 10.2 documentation, `WAIT_TIME` is interpreted as follows:

A nonzero value is the session's last wait time. A zero value means the session is currently waiting.

So if recollection serves, DBA 1.0 used this column to identify when sessions queried from `V$SESSON` were on CPU. The actual value in `WAIT_TIME` was less important than that it was nonzero and therefore session is using CPU. The context is interactive queries against `V$` tables to diagnose performance issues, and knowing the previous event latency incurred by a session currently gobbling CPU couldn't seem less relevant. So this column always seemed rather useless.

HOWEVER, now ASH exists. And `WAIT_TIME` is part of the sampled data. And sessions ON CPU are sampled. And these sessions will record (nonzero) event latencies into ASH (presumably in microseconds, like `TIME_WAITED`.)

Thus it seems that every ASH row for a session ON CPU may be hiding an event latency in `WAIT_TIME`. This is super interesting and completely relevant to investigating the distribution of event latencies using an ASH dump. These values are a completely independent source of event latencies from the event samples.

Density of `log10(WAIT_TIME)`

Going by the documentation, a simple filter for nonzero `WAIT_TIME` should be enough to isolate these values. Consider a violin density plot of these "hidden" latencies. We also filter out values larger than the sampling interval:

p <- ggplot( data = ashDF %>% filter(WAIT_TIME > 0 & WAIT_TIME < 1000001)

      ,aes( y = log10(WAIT_TIME))) +

  geom_violin(aes( x = "C:CPU_WAITTIME"))

With the result:

Well, at first glance this density looks very different from both earlier densities. There is still the narrow "neck" at 100 microseconds, which is interesting given the apparent independence of this sample from the other. The vast bulk of the density is between 100 microseconds and 10 milliseconds, but there is a nontrivial amount between 10 and 100 microseconds. So there are apparent similarities as well as differences with the other latency density plots.

Compare all three densities

We now have the following sources for describing the distribution of wait event latencies as sampled by the ASH dump:

Sampled event counts - ASH fixed-up events (`TIME_WAITED > 0`) counts by latency
Estimated event counts - ASH fixed-up events weighted by the inverse latency count estimate
CPU sampled event counts - ASH ON CPU wait samples (`WAIT_TIME > 0`) counts by latency

Here is the source code to plot the three densities together using `geom_violin` and with medians. The two "dark matter" densities are filled in two different grey scales.

p <- ggplot( data = Events %>% filter(TIME_WAITED < 1000001)
            ,aes( y = log10(TIME_WAITED))
            ) +
    geom_violin(aes( x = "B:ESTIMATED"
                    ,weight = EST_COUNT / sum(EST_COUNT)
                    ,fill = I("black")
                    )
                ,alpha = .5
                ,draw_quantiles = (.5))      +
    geom_violin(aes( x = "A:SAMPLED"
                    )
                   ,draw_quantiles = (.5))   +
    geom_violin(data = ashDF %>% filter(WAIT_TIME > 0 & WAIT_TIME < 1000001)
               ,aes( x = "C:CPU_WAITTIME"
                    ,y = log10(WAIT_TIME)
                    ,fill = I("grey")
                    )
                ,alpha = 1
                ,draw_quantiles = (.5))


  
p

And the resulting plot:

Observations

It is very interesting that the density plot based on `WAIT_TIME` from ON CPU samples in ASH is almost an "average" of the two densities based on event samples from the same ASH dump. It's as if the upper portion of the sampled density was squashed down and the lowest portion of the estimated density was squeezed up into a shape that is much fatter in the middle and much more evenly dispersed than the others. Also notice the median values of the CPU and event estimated densities are quite close.

The actual events registering latencies will in general be completely different, however it is possible for an event to be sampled as an event `TIME_WAITED` initially followed by a CPU sample in which it is the last event and thus recorded into `WAIT_TIME`. No reason to think this is an issue.

It's also worth noting and remembering that these event latencies can include IDLE events, unlike the ASH event samples. This is one reason for only using events shorter than the sampling interval (`TIME_WAITED < 1000001`) as IDLE events can be long.

Discussion

This new "dark matter" source of event latency samples is extremely interesting. There is a lot to think about and investigate, including:

what randomness properties or biases may pertain in these event latencies?
is other event information in the row consistent (same event?)
the magic millisecond

Sample radomness

It seems pretty clear that that these event latencies from CPU sample `WAIT_TIME` do not suffer the same bias to longer latencies that the raw ASH samples do. The sampler has sampled a session on CPU, the duration of the last wait event for the session can have absolutely no bearing on that.

So I am tempted to think of this new dark matter as a (relatively) random sample of event latencies, perhaps much better for describing the distribution than the event sample latencies.

However, it seems that there could be other forms of bias that issue from the nature of these new sample latencies. For instance, very long events may never be sampled. Or a single wait event followed by multiple seconds on CPU could be sampled multiple times. The proportion of time that sessions spend on CPU vs in wait events determines the relative volumes of events in the two streams: CPU-dominant sessions mostly sample ON CPU and into the `WAIT_TIME` stream, while WAIT-dominant sessions sample mostly into the `TIME_WAITED` stream.

Thinking about it a bit more, the new dark matter are events that immediately preceded the work currently being done by an ON CPU session. Each `WAIT_TIME` represents a working session that was last held up by an event of this latency.

Event information

A critical question is whether the `EVENT_ID` and `P1`, `P2`, `P3` variables in the CPU samples will match the latency in `WAIT_TIME`? It seems very likely that they do and if so that we may have random samples of event latencies by event and wait class in ASH. That is worth digging into a bit and testing / verifying. There could be new and interesting event analytics to dream up using these event latency samples, perhaps for monitoring use cases.

Magic millisecond

There is something special about the value 10^3 microseconds (1 millisecond) with respect to the count estimator function. This is the value at which the conversion from latency to estimated count of occurrences is the identity. Recall that for sampled event latencies shorter than 1 second (10^6 microseconds) the count is estimated by:

count = 10^6 / latency

For the latency value of 10^3 the count-latency conversion function is 1-1 since:

10^3 = 10^6 / 10^3

This is exactly the point where the count-latency function derivative is equal to -1. Not exactly sure but I think this means that latencies near 10^3microseconds have about equal impact on the sampled and estimated densities.

Conclusion

This consideration has been extremely interesting, and probably worth extending. In particular, the prospect of looking closely at events and their latencies from the ON CPU samples in ASH seems quite promising. As well there may be value for monitoring or other performance analysis use cases.

notebook

Link to this notebook on github:

darkMatter2

references

Previous related ASHviz articles:

要查看或添加评论，请登录

John Beresniewicz的更多文章

Estimating OLTP Execution Latencies Using ASH

2021年2月16日

Estimating OLTP Execution Latencies Using ASH

I want to share something super-useful about Active Session History that I came to understand only last week. Examining…

17 条评论
ASHviz: Fiddling with violins

2019年5月6日

ASHviz: Fiddling with violins

The last ASHviz installment, Densities and dark matter, was a bit of a cognitive burden, but the concepts introduced…
ASHviz: Densities and dark matter

2019年5月1日

ASHviz: Densities and dark matter

This installment gets into some deeper concepts relative to visualizing event latency distributions as well as using…
ASHviz: Can you box that, please?

2019年4月27日

ASHviz: Can you box that, please?

This installment explores the distribution of sampled event latencies from the ASH dump using `geom_boxplot( )`. ASH…
ASHviz: Issue at the x-axis

2019年4月25日

ASHviz: Issue at the x-axis

Take another look at the plot in header above. This plot aggregates ASH data by STATE_CLASS using SAMPE_TIME as the…
ASHviz: Accidentally good

2019年4月23日

ASHviz: Accidentally good

This is a short blurb about being sensitive to whether a visualization that works well in a specific case will…
ASHviz: Visualizing ASH dumps with Jupyter Notebooks

2019年4月22日

ASHviz: Visualizing ASH dumps with Jupyter Notebooks

This article begins what I hope will be an interesting series focusing on some data visualization research I have been…
Visualizing Performance Benchmarks (4) - Validate, analyze, conclude

2018年11月16日

Visualizing Performance Benchmarks (4) - Validate, analyze, conclude

In this final episode, we VALIDATE our suspicions about the file-based configurations bottlenecking on read I/O…

12 条评论
Simple SQL Injection Vulnerability Testing

2018年11月13日

Simple SQL Injection Vulnerability Testing

According to The Open Web Application Security Project (OWASP), injection remains the number one category of security…

3 条评论
Visualizing Performance Benchmarks (3) - Start Small and Predict

2018年11月12日

Visualizing Performance Benchmarks (3) - Start Small and Predict

So far in this series we've seen some nice visualizations of elapsed time data for loading a large number of 5GB files…

See all articles

ASHviz: Dark matter 2

John Beresniewicz

Architect at Oracle

ASH `TIME_WAITED` and dark matter

Sampled and estimated latency densities

CPU samples: the new dark matter?

`V$SESSION.WAIT_TIME`

Density of `log10(WAIT_TIME)`

Compare all three densities

Observations

Discussion

Sample radomness

Event information

Magic millisecond

Conclusion

notebook

references

John Beresniewicz的更多文章

社区洞察

其他会员也浏览了

Exploring Two-Sample Kolmogorov-Smirnov Test with Simulations

How to predict Healthy and Faulty sounds with MFCC (Mel-frequency Cepstral Coefficients), SVM and deploy them to Streamlit

Unlock Your Potential: Enroll in a Data Science Course in Uttam Nagar Today!

Unified

Statistical functions and methods from NumPy, pandas, and SciPy

Grind 75 - 23 - Maximum Depth of Binary Tree

Longest Subarray With Sum K

Get started with the CoFFee toolbox to explore your multibeam sonar raw data

BackTracking : WordSearch :4April POTD

?? Day 138 of 365: Principal Component Analysis (PCA) for Dimensionality Reduction ??

ASH `TIME_WAITED` and dark matter

Sampled and estimated latency densities

CPU samples: the new dark matter?

`V$SESSION.WAIT_TIME`

Density of `log10(WAIT_TIME)`

Compare all three densities

Observations

Discussion

Sample radomness

Event information

Magic millisecond

Conclusion

notebook

references

John Beresniewicz的更多文章

Estimating OLTP Execution Latencies Using ASH

ASHviz: Fiddling with violins

ASHviz: Densities and dark matter

ASHviz: Can you box that, please?

ASHviz: Issue at the x-axis

ASHviz: Accidentally good

ASHviz: Visualizing ASH dumps with Jupyter Notebooks

Visualizing Performance Benchmarks (4) - Validate, analyze, conclude

Simple SQL Injection Vulnerability Testing

Visualizing Performance Benchmarks (3) - Start Small and Predict

社区洞察

其他会员也浏览了

Exploring Two-Sample Kolmogorov-Smirnov Test with Simulations

How to predict Healthy and Faulty sounds with MFCC (Mel-frequency Cepstral Coefficients), SVM and deploy them to Streamlit

Unlock Your Potential: Enroll in a Data Science Course in Uttam Nagar Today!

Unified

Statistical functions and methods from NumPy, pandas, and SciPy

Grind 75 - 23 - Maximum Depth of Binary Tree

Longest Subarray With Sum K

Get started with the CoFFee toolbox to explore your multibeam sonar raw data

BackTracking : WordSearch :4April POTD

?? Day 138 of 365: Principal Component Analysis (PCA) for Dimensionality Reduction ??