登录查看更多内容

Visualizing Performance Benchmarks (4) - Validate, analyze, conclude

John Beresniewicz

Architect at Oracle

发布日期: 2018年11月16日

In this final episode, we VALIDATE our suspicions about the file-based configurations bottlenecking on read I/O saturation with a simple but effective bar chart visualization. Next we ANALYZE a nice plot of the tradeoff between load server host CPU (hashing tuples) and the distributed database's write I/O capacity (persisting tuples) which is the essential tradeoff in the loading process once the input I/O bottleneck is removed. Finally, we CONCLUDE with a simple line chart to compare the overall ingest capacity of the three configurations, the actual original purpose of the study.

Validate: File Input is I/O Bound

We begin by justifying our conclusion that the two file-based benchmark configurations were bottlenecking on read I/O. The plots below show bar charts of Average Active Threads on both CPU and in I/O Wait over entire benchmark runs for each of the degrees of parallelism. This metric is an OS analog of Average Active Sessions that we compute for Oracle database time (follow link for a brief talk on the subject.) It's simply total time (on CPU or in I/O Wait) divided by total elapsed time and represents the average "load" in Linux for each of these resources.

These plots show CPU and I/O load on the load client host for the two configurations that input data to the load pipelines from files. Technically, the orange bars show I/O backlog as they measure the average number of threads waiting to do I/O. When any orange shows up, it means the system is I/O saturated and there is a backlog.

Recall that the load client host reads input files, parses them, and sends tuples on to the load server processes. In the top chart, both client and server are co-located on same host. In the bottom chart, the load server is on a different host (the chart shows only the client host.)

First notice that I/O saturation (orange bars) occurs sometime after 5 parallel load pipelines and before 10, in both cases. This is consistent with our earlier prediction from the single pipeline tests that input I/O would bottleneck at 6 or 7 pipelines, an encouraging validation. The growth in I/O backlog once saturation occurs is linear in the number of parallel load pipelines, every extra 5 pipelines increase the load incrementally by pretty much the same amount.

On the CPU side, notice the large difference in CPU load between the configurations. This is because in the top configuration, a single machine is hosting both the load client and load server processes. The load servers are basically hashing machines and use a lot of CPU. The CPU load is uniform after 5 parallel pipelines because the number of tuples being fed into load servers is constant once I/O saturation occurs. Note also that our single pipeline tests indicated load server CPU was approximately 5x of client CPU, and the top chart CPU looks very consistent with being 6x the bottom chart, as predicted. This is another encouraging validation of observations in the single pipeline case holding firm in the multi-pipeline benchmark runs. There are apparently no unexpected side effects moving from one to multiple concurrent ingest pipelines.

Analyze: Loader CPU vs DB Write I/O

Now we get to see my favorite plots of this whole effort. Here we plot the CPU load on the load server host (Y-axis) against the aggregate database write volume over all the cluster workers (x-axis.) This is the essential tradeoff that we want to measure when assessing the maximum ingest capacity of the cluster: the tuple hashing capacity of the load server vs the write I/O capacity of the entire database cluster. I didn't discuss the benchmark data design in any detail but here it is useful to mention that the design is such that tuples are evenly distributed among all database shards throughout the entire execution of an ingest pipeline. There should be no skew in load over shard hosts at any point in time or altogether.

The data was collected from atop records captured during runs. Each minute of the run was assigned a degree of parallelism depending on when in the run it was measured. Load host CPU (active threads) and database host disk write throughput (MB/sec) were aggregated over these 1-minute buckets and plotted against each other in dot plot. Color was used to distinguish degree of parallelism. Dotted red line is drawn at the number of cores on the load server host. I/O capacity of the workers is not known, else we would have drawn a vertical boundary there. A simple linear regression was drawn through the plot, just because. ;)

First let's look at representative runs from the two file-based configurations:

Both plots are remarkably similar, and this is not unexpected because both suffer from the aforementioned input I/O saturation. We see that there is an increase in both CPU and disk write throughput after 5 parallel loads, but both stall out into a big "cloud" of data points above that. Again, this is because both configurations are bottlenecked on input I/O at the client (above 7 concurrent pipelines) and thus loader CPU and database write I/O are capped at the bottleneck values. There is a slight difference in that the bottom plot shows more CPU usage, but this is expected as a consequence of this configuration hosting both the load client and server, the extra CPU is from the load client processes (which use some CPU parsing tuples from raw input.)

In contrast, the plot for the stream-based input run below extends much farther into both X and Y dimensions: more CPU power is being leveraged to hash tuples and much more tuple data is making it onto disk. The input I/O bottleneck has been eliminated and we are truly pitting hashing CPU vs disk persistence capacities against each other.

It's inconclusive which antagonist is stronger and it almost looks like a tie. CPU is well above the core line, so hyperthreading is being leveraged and saturation is likely not far off. On the disk side, both 25 and 30 parallel loads seem to cover pretty much the same range (6500-7500 MB/sec) so that may be close to the limit. (I should have examined I/O wait on the database hosts here but didn't.)

I really like these plots as they clearly show how relieving congestion at the beginning of the pipeline broke the dam and unleashed the real power of the cluster to load data. It's worth noting that the streaming configuration was not originally designed into the testing strategy, but implemented out of necessity once it became apparent that the parallel pipelines were stalling on input well before the load server hash power or database host write power were challenged. Another benchmarking principle: ADAPT to new information.

Conclude: Platform Ingest capacity?

Ok, we are almost home. This has been a long segment in a long series, but there is one thing left to do and that is to answer the original question motivating the benchmark: what is the ingest capacity of this distributed database cluster?

So let's do a simple line chart of ingest capacity (TB/hr) vs load parallelism and plot each of the 3 configurations separately. Data are averaged across all runs for each configuration.

There is nothing really unexpected in this plot, and that is good. It looks like we have a solid answer to our question, and a lot of confidence in our results. Regardless of how we configure the load pipelines, using file inputs caps ingest capacity at around 4.5 TB/hr, achieved at 7 degrees of parallelism. On the other hand, streaming data directly into the load pipelines enables the cluster to ingest almost 7.5 TB/hr up to 30 degrees of parallelism. Since the curve is pretty flat at 30 the limit has been reached and we need to add either loading host CPU or database write capacity (additional shard hosts) to go beyond this value (or both.)

Epilogue

This series has been about using data visualization in the context of understanding and communicating about performance benchmarks. I'm not making specific recommendations for how to visualize your performance data, but rather wanted to show how using visualization can really boost your understanding of what is going on. I didn't delve much into benchmark design, only enough for readers to understand what was being shown.

It's worth noting that the benchmark design made a crucial decision early on that profoundly affected the results, which is the size of each tuple in the 5GB input data. Clearly, the larger the tuple size the greater the disk write pressure per volume of tuples hashed (CPU pressure) whereas the smaller the tuples the greater the hashing (CPU) pressure per MB of data loaded. I decided that 1K tuples was a good compromise, not too big and not too small. The relationship of ingest capacity to tuple size is an interesting topic for future benchmarks.

Again, thanks for persisting if you made it this far. Comments, questions, and of course "likes" are welcome.

Prior articles in this series:

Visualizing Performance Benchmarks - Repeatability

Visualizing Performance Benchmarks (2) - First see ALL the Data

Visualizing Performance Benchmarks (3) - Start Small and Predict

Neil Gunther

Founder/Computer Scientist, Performance Dynamics

6 年

re: Your title.?cf. VAMOOS: Visualize, Analyze, Modelize, Over and Over, until Satisfied ?https://goo.gl/KJMiEF :)

3 次回应

Neil Gunther

Founder/Computer Scientist, Performance Dynamics

6 年

You know this is begging for USL analysis. :) ??https://speakerdeck.com/drqz/the-data-analytics-of-application-scaling-and-why-there-are-no-giants

1 次回应

查看更多评论

要查看或添加评论，请登录

John Beresniewicz的更多文章

Estimating OLTP Execution Latencies Using ASH

2021年2月16日

Estimating OLTP Execution Latencies Using ASH

I want to share something super-useful about Active Session History that I came to understand only last week. Examining…

17 条评论
ASHviz: Dark matter 2

2019年5月16日

ASHviz: Dark matter 2

This article extends the discussion of "dark matter" in ASH by exploring a completely new source of data about event…
ASHviz: Fiddling with violins

2019年5月6日

ASHviz: Fiddling with violins

The last ASHviz installment, Densities and dark matter, was a bit of a cognitive burden, but the concepts introduced…
ASHviz: Densities and dark matter

2019年5月1日

ASHviz: Densities and dark matter

This installment gets into some deeper concepts relative to visualizing event latency distributions as well as using…
ASHviz: Can you box that, please?

2019年4月27日

ASHviz: Can you box that, please?

This installment explores the distribution of sampled event latencies from the ASH dump using `geom_boxplot( )`. ASH…
ASHviz: Issue at the x-axis

2019年4月25日

ASHviz: Issue at the x-axis

Take another look at the plot in header above. This plot aggregates ASH data by STATE_CLASS using SAMPE_TIME as the…
ASHviz: Accidentally good

2019年4月23日

ASHviz: Accidentally good

This is a short blurb about being sensitive to whether a visualization that works well in a specific case will…
ASHviz: Visualizing ASH dumps with Jupyter Notebooks

2019年4月22日

ASHviz: Visualizing ASH dumps with Jupyter Notebooks

This article begins what I hope will be an interesting series focusing on some data visualization research I have been…
Simple SQL Injection Vulnerability Testing

2018年11月13日

Simple SQL Injection Vulnerability Testing

According to The Open Web Application Security Project (OWASP), injection remains the number one category of security…

3 条评论
Visualizing Performance Benchmarks (3) - Start Small and Predict

2018年11月12日

Visualizing Performance Benchmarks (3) - Start Small and Predict

So far in this series we've seen some nice visualizations of elapsed time data for loading a large number of 5GB files…

See all articles

Visualizing Performance Benchmarks (4) - Validate, analyze, conclude

John Beresniewicz

Architect at Oracle

Validate: File Input is I/O Bound

Analyze: Loader CPU vs DB Write I/O

Conclude: Platform Ingest capacity?

Epilogue

John Beresniewicz的更多文章

社区洞察

其他会员也浏览了

How Does A CPU Run Code in Computer?

PCI Bus Stands for What? Basic and Extension

‘top’ reporting accurate metrics within containers?

What is the Bus in a Computer System?

Introduction to Cache Coherence with Solutions

Full 100G TCP Offload for AMD Alveo Accelerator Card | TOE100G-IP Core

Unveiling the Powerhouse: IBM x3850 Technical Specs

Introduction of NVMeTCP25G-IP 4 sessions with DMA on Alveo U50 Card

A Survey of Computing Paradigms - From Literature to Machines

Tearing Down the Memory Wall

Validate: File Input is I/O Bound

Analyze: Loader CPU vs DB Write I/O

Conclude: Platform Ingest capacity?

Epilogue

John Beresniewicz的更多文章

Estimating OLTP Execution Latencies Using ASH

ASHviz: Dark matter 2

ASHviz: Fiddling with violins

ASHviz: Densities and dark matter

ASHviz: Can you box that, please?

ASHviz: Issue at the x-axis

ASHviz: Accidentally good

ASHviz: Visualizing ASH dumps with Jupyter Notebooks

Simple SQL Injection Vulnerability Testing

Visualizing Performance Benchmarks (3) - Start Small and Predict

社区洞察

其他会员也浏览了

How Does A CPU Run Code in Computer?

PCI Bus Stands for What? Basic and Extension

‘top’ reporting accurate metrics within containers?

What is the Bus in a Computer System?

Introduction to Cache Coherence with Solutions

Full 100G TCP Offload for AMD Alveo Accelerator Card | TOE100G-IP Core

Unveiling the Powerhouse: IBM x3850 Technical Specs

Introduction of NVMeTCP25G-IP 4 sessions with DMA on Alveo U50 Card

A Survey of Computing Paradigms - From Literature to Machines

Tearing Down the Memory Wall