Measurements and Metrics
Miguel Pinilla
Technology and Supply Chain Executive @ Salduba Technologies | PhD, Manufacturing Information Systems
Context
Measurements and metrics are essential for data driven operation and management of systems, organizations or processes.
As mentioned in the introduction article to Performance Measurement , A common problem is that we don’t pay enough attention to how those all important metrics are obtained and how to properly interpret them.
The measurement process is the set of mechanisms, activities and calculations that take a system in operation and produce information to support decision making and improve the operation of the system itself.
To make the best decisions, it is not enough with seeing the metrics and reports, but also to understand what they mean and how they were produced. Pulling the thread from the desired results requires digging through four layers:
Graphically:
Starting with what reports are best to understand the performance of a system, we can then pull the thread and see what metrics and measurements a monitoring system needs to put in place and what signal should be monitored. Measurements and signals are easier to understand with an example of an API endpoint performance, including an example of how they can be implemented. From this exercise we’ll be able to extract the overall structure of the measurements and metrics process so that we can apply it to other situations and domains.
Recap of the System Model and Relevant Measurements
To explore how Reports are obtained from basic signals in the four layers describe above, it is useful to recall the simplified system model in the introduction article . The summary of the model as a Job Flow diagram:
Where the system processes jobs as they arrive, consuming resources in the process and with successful and unsuccessful outcomes. As the system is going to have some finite capacity, arriving jobs may have wait for resources to be available, in which case they would queue up before the system starts working on them.
The performance of this type of systems can be characterized along multiple dimensions, each of them with their corresponding indicators:
Reports
Reports are directly consumed by decision makers, so they need to be concise, understandable and effective in conveying the intended information. The Visual Display of Quantitative Information is a classic resource for designing effective reports in visual form.
There are several core principles shared by effective reports:
The performance characteristics described in the previous section can be made concrete with a set of graphical reports that follow these principles. The specific values in the examples below are generated by using a simple G/G/k Queue simulator that can be found as a Jupyter Notebook and associated Python files in Github at perf_measurements.ipynb
For these examples, ticks are an arbitrary unit of time, different applications may user different units (e.g. milliseconds for API performance, hours for e-commerce fulfillment, …). Each report instance will cover a given time period of observations of the system.
Throughput Report
Throughput shows the amount of output produced by the system in the period. In these examples, it is simply represented by the number of tasks completed in the period. If the value from each task varies from task to task, this can be replaced by a sum of the values of tasks completed during the period.
The report provides:
The lower chart shows a histogram distribution of the throughput values during the period. In the case of throughput and other indicators, variability is as important, if not more important than the average value. Presenting the statistical distribution of values gives the users valuable information on the behavior of the system, particularly when used in combination with yield or other quality reports. The vertical axis can be labeled with the raw count of instances that fall in each bin (Frequency) or alternatively with a percentage of the total number of occurrences in the report.
Distribution information can also be computed for each point in the report, as its value is itself computed from multiple measurements as we will see later. In this case, Candlestick Charts similar to those used in stock pricing can be used, being careful to not overload a single chart with too much information that would make it too noisy.
Yield/Availability-Reliability Reports
Yield is the degree of success that a system has in producing its outputs. In the case of discrete outputs this is the simple percentage of successful outputs against the total number of jobs processed.
An example report is:
This report provides the following information:
In the case of yield, the values on the vertical axis are adimensional percentages.
Reliability or Availability reports are very similar to Yield reports, but applied to a continuous output. The only difference is in the calculation of the percentage values in the report. In the case of continuous processes, the percent is computed by taking the quantity of acceptable product (or time in the case of pure reliability reports) against the total production of the period represented by the data point.
Time Reports
The service times indicators are all similar. Taking the Lead Time as the example, the associated reports can be made almost identical to the Throughput ones:
The vertical scale in the timeline chart and the horizontal scale in the histogram are in time units (simulation ticks in the example). For service time reports, the control levels can represent directly externally defined service goals or percentiles of the data as information provided to the decision makers.
Work In Progress (WIP) Report
Work In Progress follows the same pattern, with the units being the count of jobs currently in the system. Similarly to Throughput reports, the simple count of jobs can be replaced by the sum of the value assigned to each job or other dimensions like the expected time or cost to process them.
Correlation Reports
The reports described so far show the evolution of a metric of the system against time or a statistic (the distribution histogram) of that dimension. To understand the behavior of the system it is also useful to directly report relationships between two different metrics. For dynamic systems, two reports offer particular insights.
Lead Time vs. Throughput
This report helps understand the behavior of the system under different loads. This relationship is proportional to the inverse of the spare capacity of the system. Spare capacity which goes to zero as the throughput approaches the maximum that the system can support, driving unbounded increases in lead times. All systems are constrained by their resources. To minimize their cost, they tend to operate at high utilization values, resulting in very high variability of lead times. As lead times are a key components in SLA’s, it is particularly important to understand the actual values that the system is experiencing.
This report shows the Lead Time (time between arrival and completion) of jobs when plotted against the throughput in an interval. The Lead Time is computed as the average of all the jobs that complete in the specific interval. Similar reports could be done for the P95 value of the jobs lead times, or other relevant percentiles for SLA evaluation.
The report also shows a fitted curve that follows Kingman’s Formula for G/G/1 queues :
While this formula is not useful for predicting the behavior of more complex systems, it is a decent tradeoff to use it as a relatively simple curve shape to summarize noisy metric data.
When using the Utilization instead of the throughput, the limit effect of maximum capacity becomes very easy to see.
Lead Time vs. WIP
The second important insight when comparing metrics against each other is the behavior of lead time against the WIP in the system. This is a useful report because the statistical behavior of these two dimensions follows Little’s Law for most systems, resulting in a linear relationship between them where the slope estimates to the long running average of the system’s throughput. Deviations from this estimate indicate temporary changes to the job arrival rate or service times.
Following Little’s law, the fitted line is a linear regression.
Note that the choice of the curve to fit to the data is driven by a-priori knowledge or assumptions on the behavior of the system (hyperbolic for the previous two, linear for this one). This way deviations from these assumptions can be spotted in the report by ill-fitting curves. Although not provided here, information associated with the report itself should include a measure of the curve fit (e.g. R2) and information to help interpret its shape and meaning. In the example above, the fit is less than perfect, particularly in the lower WIP levels, which may indicate an issue with the simulation being properly “warmed up”, showing transient effects in the data or other operational anomalies if the data were to represent a real system.
Metrics
Metrics provide the values that get presented in the reports. The implementation of a metric results in a collection of numbers or symbols and represents a characteristic of the operation of a system. Metric values are associated with a time interval in which the metric is computed and a point in time which is the end of that interval.
To define a metric, we need to define the calculations to produce its values based on the information available during an interval. For the System model considered above, several metrics are commonly used:
Throughput
Throughput metrics represent the value created per unit of time by the system. This can be as simple as counting the number of successful jobs completed in a period, or adding a measure of value associated with each successful job.
The main Throughput Metric for a time interval \((T_s, T_e]\) is commonly represented by the lambda greek letter. The calculation is the sum of the value measurements of each job (V(J)) that is completed in that interval. Formally, for interval i:
Throughput metrics can also be expressed as an Utilization value by normalizing them against a stated Capacity for the system during the same interval, with Utilization being 100% when throughput reaches that value. Utilization is commonly referred to as rho in the operations literature. It is important to select the Capacity number so that utilization never surpasses 100% so that it can be fitted with curves with a shape following 1/(1-rho) without singular points.
Throughput metrics are frequently associated with business goals, leading to the definition of target or reference levels for the system performance like:
Yield
Yield metrics are computed based on the Yield measurements for the period, the percentage of failures with respect to all completed jobs, yield = N(unsuccessful) / N(successful)
Depending on the volatility of this percentage, Yield can also be computed using a calculation period that covers multiple trailing sampling periods. In this case, multiple yield metrics can be defined attending to the maximum, average, or percentile thresholds. For these more advanced statistics, it is critical to consider the number of sample points available during the calculation period to ensure a representative value.
Service Times: Wait Time, Lead Time, Processing Time
Time metrics represent lengths of time that specific jobs or tasks take along their processing by the system and they are computed as statistics of those periods measured for a population of jobs. The population of jobs is typically all the jobs that complete in a given interval of time, which usually coincides with the data points presented in the reports above. In some cases the interval considered for the population of jobs is longer than the gap between consecutive report points leading to trailing average type of statistics, intended to smooth over noisy data or eliminate high frequency components of the measurement itself. Commonly used statistics for time metrics include centrality statistics like median, average or mode across all considered jobs, dispersion statistics like standard deviation or limit statistics like P90 or P95 of the population of jobs compared to a desired benchmark or SLA.
There are three specific metrics for a single stage system as the one shown above: Lead time (W_t), Processing Time (tau) and Wait time (W_w):
Their formal definition:
From their definitions it is obvious that:
领英推荐
Work In Progress (WIP)
Work in progress is evaluated at the end of the interval over which the metric is computed. It is a simple count of the number of jobs that have arrived but have not completed yet. In the Operations literature, it is commonly designated as (L). Formally, for the interval:
It will be:
Signals and Measurements
Signals and Measurements is where the interface with the real-life system happens. To make the description more concrete, let’s assume that the system is an API Endpoint in a system and we will be producing the metrics and reports described above with API calls being the jobs in the system.
Reactive computer systems, like those supporting websites, enterprise systems, or even control systems deliver value by responding to external inputs with correct information and internal changes to their state (its successful outcomes) or occasionally producing an error due to incorrect inputs or internal conditions (its scrap). Observability of API performance is a very common topic in DevOps practice and can be formulated as a straight forward application of the concepts defined above.
This simplified model can be criticized as not covering information processing in multiple steps or more complex communication topologies, yet the core ideas stay applicable and are easily extended to cover those cases.
The metric definitions in the previous section rely on being able to identify Jobs, their arrival, start and complete times and fixing the metric intervals where the metrics themselves are calculated. The basic information for each job can be represented by a table with one row per job and the following values as columns:
Running systems don’t produce this neat representation directly. Observability tools usually produce a stream of events as a Job is being processed through the system. These events are the signals that the system emits.
Signals
The job signals available from an API are the events of request and responses for the API as captured in a log. A JSON example of such event indicating a job arrival may be:
{
"Id": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": "2020-02-08 09:30:26.123-08:00",
"event": "REQUEST",
"details: {
"url": "https:/example.server.com/at/path",
"operation": "GET"
[...]
}
}
and the corresponding event indicating a successfully completed job may be::
{
"Id": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": "2020-02-08 09:30:27.383-08:00",
"event": "SUCCESS",
"details: {
"code": "200",
[...]
}
}
The distinction between a job arrival and a job start is frequently ignored in API monitoring but it can be an important improvement in understanding the behavior of the system. A very useful interpretation is to have the arrival event correspond to the arrival of the request to the HTTP protocol handler or Ingress component in an architecture and associate the start event with the moment the request begins processing by the business logic or back-end component.
The determination of Value for the job is dependent on the contents of the details section and the measurement system will need to perform a domain specific calculation to obtain it.
Resource signals depend a lot of the specifics of the architecture serving the API. Good candidates can be the number of IO operations associated with a job, the number of threads used and the CPU time and the memory consumed. These signals tend to be costly to obtain with the granularity of individual jobs, but can be obtained by sampling the state of the compute resources associated with the API at regular Sampling Periods using records similar to:
{
"resource": "Threads",
"period": {
from: "2020-02-08 09:30:20.000-08:00",
to: "2020-02-08 09:30:25.000-08:00"
},
"quantity": {
"amount": 163
"unit": "count"
}
}
Showing a sampling period of 5 seconds and a signal of 163 count active threads during that period. The sampling mechanism itself will determine whether this number should be interpreted as the maximum, minimum, average, value at start/end of period, etc.
Measurements
Measurements are about assigning values to the signals, to remove ambiguity without inventing a new notation, we will simply express the measurements as SQL statements against a table that has the Id of the job, Timestamp, expressed as milliseconds since the Unix Epoch the event as a VARCHAR and the details as additional columns. This is just for convenience of notation. Implementations may vary depending on technology choices and other optimizations. Popular choices for signal collection and storage are log analysis services like Splunk or New Relic . The natural key of such table would then be (ID, TIMESTAMP) as no two events for the same job should be simultaneous.
With this convention, the measurements that can be directly obtained from the stream of events:
Throughput
select count(1) from api_log
where event = 'REQUEST'
and timestamp >= @from and timestamp < @to;
select count(1) from api_log
where event = 'START' and timestamp >= @from and timestamp < @to;
Yield
select count(1) from api_log
where event = 'SUCCESS' and timestamp between @from and @to;
select count(1) from api_log
where event = 'ERROR' and timestamp between @from and @to;
WIP
select arrivals.n - completions.n from
(select count(1) as n from api_log
where event = 'REQUEST'
and timestamp <= @measurement_time) as arrivals
(select count(1) as n from api_log
where event in ('SUCCESS', 'ERROR')
and timestamp <= @measurement_time) as completions;
select starts.n - completions.n from
(select count(1) as n from api_log
where event = 'START'
and timestamp <= @measurement_time) as starts
(select count(1) as n from api_log
where event in ('SUCCESS', 'ERROR')
and timestamp <= @measurement_time) as completions;
select arrivals.n - completions.n from
(select count(1) as n from api_log
where event = 'REQUEST'
and timestamp <= @measurement_time) as arrivals
(select count(1) as n from api_log
where event = 'START'
and timestamp <= @measurement_time) as completions;
Wait Time, Lead Time, Processing Time
select s2.start_time - s1.arrival_time from
(select timestamp as arrival_time from api_log
where event = 'REQUEST') as s1
(select timestamp as start_time from api_log
where event = 'START'
and timestamp between(@start_period, @end_period)) as s2
where s1.id = s2.id;
select s2.completion_time - s1.start_time from
(select timestamp as arrival_time from api_log
where event = 'START') as s1
(select timestamp as start_time from api_log
where event in ('SUCCESS', 'ERROR')
and timestamp between(@start_period, @end_period)) as s2
where s1.id = s2.id;
select s2.completion_time - s1.arrival_time from
(select timestamp as arrival_time from api_log
where event = 'REQUEST') as s1
(select timestamp as completion_time from api_log
where event in ('SUCCESS', 'ERROR')
and timestamp between(@start_period, @end_period)) as s2
where s1.id = s2.id;
Measurements as defined consider only the jobs that have their complete event recorded during the period and does not restrict when their begin event happened, otherwise it would only consider jobs with a maximum duration equal to the sampling period.
Note
Taking the measurement based on complete events gives us the desirable property that the measurement will be stable, that is, it will not change when data from new signals become available.On the other hand, this makes the measurement a lagging indicator for changes in the signal, potentially for multiple sampling periods if typical processing times are much longer than the sampling period.
When computing measurements for historical data, the conditions of what jobs to consider may be chosen differently. For example, measuring all the jobs that started during the sampling period instead, and allow the selection to peer into the future of the sampling period for the end event of the job.
Resources Consumed
To compute the resources consumed, we’ll assume an additional table resource_usage of signals with the columns period_end being a unix Epoch, a resource_label to identify the resource being consumed and a decimal amount that represents the quantity of the resource consumed in a pre-defined unit of measure specific to the resource label (e.g. thread counts, memory Mbytes, CPU units, …)
select resource_label, period_end, sum(amount)
from resource_usage group by resource_label, period_end;
Any observation mechanism like the event log described here is itself subject to incidences and failures, so in addition to the measurements of the process as defined, it is interesting to measure the quality of the observation process by defining, for example, jobs that have completion events without their corresponding arrival or start event:
select count(s1.Id) as mismatches from
(select timestamp as arrival_time from api_log
where event in ('SUCCESS', 'ERROR')
and timestamp between ($period_start, $period_end)) as s1
full outer join
(select timestamp as completion_time from api_log
where event in ('SUCCESS', 'ERROR')) as s2
on s2.Id = s1.Id and s2.timestamp <= s1.timestamp
full outer join
(select timestamp as start_time from api_log
where event = 'START') as s3
on s3.Id = s1.Id and s3.timestamp >= s2.timestamp
and s3.timestamp <= s1.timestamp
where s2.completion_time is NULL or s3.start_time is NULL;
or similarly, start events that don’t have a completion event within a given timeout, which may indicate a failure of the data collection or an error/exception in processing the job that is not captured by the system.
Taking a Step back: The Measurement Process
The previous sections started showing what kinds of charts or reports are useful in assessing a systems performance and then walked through the metrics that support them and how to compute these metrics from measurements on the observable signals of the system. This structure of Signals/Measurements/Metrics/Reports underlies all of performance measurement methodology and it is worth formalizing explicitly to make it easier to understand and apply to other systems. Refer to the diagram above, in the Recap section for a graphical view of these concepts.
The act of Measurement is the quantification of attributes of an object or event. Measurements are obtained by assigning numbers or other symbols to observed phenomena following a consistent set of rules. In keeping up with the rigor we aspire to in the Impractical Engineer series, we will use the definitions:
Signal: An observable phenomenon which can be continuous (e.g. the voltage of a battery) or discrete (e.g. arrivals of jobs to a system for processing). Clearly, signals are constrained by the observation technology available and the nature of the observed phenomenon. This obvious statement is important when designing observability tools and to understand the limitations of the observations. E.g. if a voltmeter is only able to take measurements every second, we’ll never be able to properly measure oscillations in the Megahertz range and Measurements and Metrics built on these signals need to know these limitations to avoid mis-representing their results (e.g. a sinusoidal Mhz signal would show basically as zero in this example). More practical examples are that we won’t be able to detect traffic surges to a website if we only count requests once an hour and similar situations.
Measurement: The act of assigning a number with its unit of measure to a Signal at a particular moment in time. E.g. Using a voltmeter to read the number of Volts in the voltage signal above or counting the number of jobs in the waiting queue in the system. Assigning a single measurement to a signal is necessarily susceptible to inaccuracies and noise from the source of the signal itself, the measurement instrument, environment, etc. and it is important to consider the nature of this noise when designing the metrics that will use the measurement in order to minimize the effect of the expected noise.
Metric: A Calculation based on one or multiple measurements that provides a value (a number or other symbol) that informs about the operation of the system. These calculations can vary from a simple assignment of a color to certain values (Green-Yellow-Red) to sophisticated descriptive statistics of multiple measurements like means, percentiles, standard deviations, etc.
Metric Presentation & Evolution (Report): The presentation of how a metric changes over time to the end users that need to exercise judgments and actions based on the values of the indicator metrics.
Signals, Measurements and Metrics take place within the passage of time and, although theoretically some could be considered instantaneous or continuous,in any practical implementation, they all take a duration, or can happen only at particular moments in time. From the discussion above we need to identify the following time periods:
Measurement Period or Sampling Period: The time between two consecutive measurements of the same signal. This needs to be short enough to capture the details we want from the underlying signal. A period of half the time of the smallest expected changes is a widely accepted value based on Nyquist Theorem
Calculation Period: The time (or alternatively the number of consecutive measurements) that will be considered in the calculation of a metric value. The length of the calculation period is obviously bounded by the Measurement Period itself on the lower side and by the loss of resolution on the upper side. It needs to be long enough to reduce the expected noise in the measurements to an acceptable level. The distribution of the Sample Mean is a good guide to decide how big this period should be.
Metric Interval: The period between the calculation of two consecutive metrics. Although in many cases this is the same as the Calculation Period, some Metrics require different Metric Intervals. A well known example is the Monetary Annual Inflation that is reported every month (Metric Interval) but computed over the trailing 12 months (Calculation Period)
Reporting Period: The length of time for which multiple metric values are displayed together for evaluation. This is the length of the (x) axis in most graphs that show metrics over time.
With these concepts in hand, it is pretty much trivial to define a Measurement Process in simple steps that summarize the work presented in this article:
Copyright: ? Salduba Technologies, All rights reserved License: This work is licensed under the Creative Commons License CC BY-NC-SA 4.0
Director, North American Account Management and Customer Advocate
7 个月Miguel, as you know we're also in the data biz. Sorry to be harsh, but... the D&F industry uses data analytics that are riddled with fallacies, in part because "analysts" are often junior accountants using Excel (or worse), PowerBi. Accounting and data analysis are difference disciplines and require different training. The most common data fallacies we see are basic. For example, using the income statement to evaluate warehouse ops or freight cost performance. Ops cost and freight cost cannot generally be normalized using revenue as the denominator. Second problem we see is that accountants remarkably don't always understand the difference between "average" and "weighted average." One is never a substitute for the other. Third problem is that revenue, number of orders, and number of lines are poor measurements of throughput or capacity. Warehouse capacity ops correlate most closely with cubic volume throughput, which almost no one in D&F can even measure with any level of accuracy. D&F has a long way to go in analytics. That may be true for industry in general!