EVM Meets Big Data ?
Glen Alleman MSSM
Applying Systems Engineering Principles, Processes & Practices to Increase Probability of Program Success for Complex System of Systems, in Aerospace & Defense, Enterprise IT, and Process and Safety Industries
Abstract
We have lots of data. How can we use it to make predictions and prescriptive forecasts of future performance to increase the Probability of Program Success?
The Earned Value Management System (EVMS) maintains period–by–period data in its underlying databases. The contents of the Earned Value repository can be considered BIG DATA, characterized by three attributes – 1) Volume: Large amounts of data; 2) Variety: data comes from different sources, including traditional databases, documents, and complex records; 3) Velocity: the content is continually being updated by absorbing other data collections, through previously archived data, and streamed data from external sources. 1
With this time series information in the repository, analysis of trends, cost and schedule forecasts, and confidence levels of these performance estimates can be calculated using statistical analysis techniques enabled by the Autoregressive Integrated Moving Average (ARIMA) algorithm provided by the programming system. ARIMA provides a statistically informed Estimate At Completion (EAC) and Estimate to Complete (ETC) to the program in ways not available using standard EVM calculations. UsingARIMA reveals underlying trends not available through standard EVM reporting calculations. With ARIMA in place and additional data from risk, technical performance, and the Work Breakdown Structure, Principal Component Analysis can be used to identify the drivers of unanticipated EAC.
Introduction
This is not a post about Earned Value Management per se. It is not about the details of statistical forecasting using tools. It is not about the underlying mathematics of how these forecasts are made or the mathematics of how the “principal components” that drive the Estimate at Completion are applied.
This paper is about how to overcome the limitations of the traditional methods to identify and predict problems and Estimates at Completion.?The literature is replete with the nature of cost growth and how our “standard practices” of using EVM data fail to enlighten stakeholders to take corrective action. [RAND12], [RAND11], [RAND13], [IDA11], [IDA11a], and [IDA10].?Some of these reasons:
The Earned Value data is not based on the statistical behavior of the underlying work being performed.
With this background, we present two possible solutions, using existing data repositories to increase confidence in the EAC using statistical processes. These processes include applying the ARIMA to the past Earned Value Management data submitted monthly and PCA to identify the drivers of unanticipated growth to the EAC.
The ARIMA process has been applied in multiple settings for some time – it is not new. What is needed is to apply this method to the repository for the existing data to produce a credible EAC. The PCA efforts will require additional data not currently in most repositories, including technical performance measures, risk reduction, and other assessment of other …illities” called out in the Systems Engineering Management Plan (SEMP) of any large, complex, high-risk program.
Data Analytics
There are three types of data analytics that can be applied to data held in an Earned Value Management repository.
Descriptive – looking in the past, we can learn what happened, but it is too late to take corrective action. EV descriptive analytics,
Predictive – using past performance we can answer the question what will happen if we do the same as we’ve done in the past.
Prescriptive – past performance data used to make predictions and suggest decision options to take advantage of the predictions. EV data prescriptive analytics,
?It is the prescriptive analytics we are after. Prescriptive analytics not only anticipates what will happen and when it will happen but why it will happen. The current descriptive analytics of the Earned Value Management reporting process does not point to possible sources of problems other than manually looking through each reported data element. This is a major complaint of Program Managers.
Nice EV data report you got there, but you did not tell me what to do about the problems it shows.
Earned Value Management in a Nutshell
Government programs larger than $50M use an Earned Value Management System defined in EAI–748–D. For the Department of Defense, Department of Energy, and NASA programs a monthly submittal (DI–MGMT–81865 and NASA 533M) containing information about the Earned Value performance of the program. This information includes the Budgeted Cost of Work Scheduled (PV), Budgeted Cost of Work Performed (EV), and Actual Cost of Work Performed (AC). With these three data items, a Cost Performance Index (CPI) and Schedule Performance Index (SPI) can be computed using familiar formulas:
These formulas represent the performance of the program in units of dollars. They describe the efficacy of the money budgeted (PV), the money spent (AC), and the monetary value of the money Earned (EV). The CPI and SPI can then be used to forecast the Estimate At Completion (EAC) for the remaining work in the program using,
This Eq. (3) is one of several available for calculating EAC from the Earned Value Data.
Drivers of Program Variance
We all know programs operate with wide ranges of variance. From tabloid reports to the RAND and IDA Nunn McCurdy Breach Root Cause Analyses, the drivers of these variances are well known at the macro level shown in Figure 1.
The First Step Toward Better Estimate At Completion
Using the data in a repository, the first step is to treat each CPI and SPI monthly report for each Work Breakdown Structure element (Format 1 of the IPMR) as a data sample in a time series. This time series of past performance is the source of ARIMA forecasting of EAC. With the time series and the ARIMA, the analyst can produce “forecasts” of the possible values of the EAC based on past statistical behaviors of CPI and SPI at the lowest reporting levels – the lowest submitted SPI and CPI per WBS element. Many times these WBS elements go to the physical components being delivered by the program. Insight into the forecasts of Configuration Items or End Item Deliverables is available in the current repository.
In this section, we’ll describe how the ARIMA works and how it can be applied to the data in the repository to produce an EAC. Using ARIMA will allow us to answer the question: “What will this thing cost us when we are done? And ARIMA will do this to a known confidence level that is statistically sound. We must remind ourselves that the current formulas for calculating EAC use linear, non-statistical, non-risked adjust arithmetic on the cumulative values of SPI and CI that have had their past variances “whipped out.” So no visibility into the past statistical excursion is available for computing one of the most critical numbers on any program.
Autoregressive Integrated Moving Average (ARIMA) in a Nutshell
Autoregressive Integrated Moving Average (ARIMA) models are a class of forecasting models using time series data to understand this data's contents better or predict future data points in the time series. ARIMA models are also referred to as Box–Jenkins [BOX70]. Let us look at how ARIMA works and confirm it can be used to forecast the Estimate at Completion using the repository data from the elements of the program, including CPI and SPI at the lowest Work Breakdown Structure available.
ARIMA models have three parts:
by q).
A core assumption of the Autoregressive part of ARIMA is that the time series observations are independently identifiable. Autocorrelation refers to the correlation of a time series with its own past and future values. Autocorrelation is also sometimes called “lagged correlation” or “serial correlation”, which refers to the correlation between members of a series of numbers arranged in time.?Positive autocorrelation might be considered a specific form of “persistence”, a tendency for a system to remain in the same state from one observation to the next. Autocorrelation can be exploited for predictions: an autocorrelated time series is predictable, probabilistically, because future values depend on current and past values. This means there is no autocorrelation in the series, and the series will have a zero mean (after normalization). All the trending and seasonal components must be removed to meet this, so we are left with only the noise. The result is that only the time series' irregular components are modeled, not the seasonal or trend components.
To summarize the ARIMA parameters in preparation for the use of ARIMA(p,d,q) where:
The model used in forecasting EAC starts with ARIMA(0,1,1) – simple exponential smoothing, where ?(t-1) denotes the error at period t-1.
?(t) = Y(t-1) - ????(t-1) Eq. (4)
where ???(t-1) denotes the error at period (t-1). We can add a constant to the ARIMA(0,1,1) forecast with,
?(t) = ?? + Y(t-1) - ????(t-1) Eq. (5)
ARIMA Used to Forecast EAC In This Post
With the ARIMA algorithm and the time series of Earned Value data from a repository, we can construct forecasts of the EAC based on statistical data from each period of performance rather than the cumulative data and the current period of performance as reported in the IPMR. Our first serious problem is how to select the ARIMA parameters. It is beyond this short post to delve into this problem, so we will take a shortcut using the R tool, apply auto.arima, and have the tool figure out which is best for our time series.
We will skip all the background and theory and go straight to the outcomes.
What Can We Do With This Information?
Now that we have a mechanism for using the repository Earned Value Management data to forecast the future performance of that data, what can we do with it? First, we need to establish the data architecture for the repository contents. This starts with normalizing the Work Breakdown Structure topology. One approach is to use MIL–STD–881. There are “notional” structures for the WBS of products in the appendencies. These appendices are not inclusive, and in some cases, they are not all that helpful. But they can be a notional start to a standardized approach to capturing and recording data.
The beneficial outcomes of applying Time Series Forecasting using the data from the Earned Value repository includes:
The Second Step Toward Better Estimate At Completion
With a forecast of EAC based on past performance of CPI and SPI using time series analysis and ARIMA, we can ask about improving that forecast by adding other measures of performance that should already be applied to programs through the Systems Engineering processes and described in the Systems Engineering Plan (SEP) and the Systems Engineering Management Plan (SEMP). The contents of these two documents are shown in Table 2.
领英推荐
From this systems engineering guidance, we can extract other measures of future behavior for the program:
These data elements are arranged in a matrix in preparation for the next step, Principal Component Analysis, where each column contains the data from that period, for example, the (1) WBS element, along with (2) CPI, (3) SPI, (4) TPM, (5) Risk, (6) MOE, (7) MOP, (8) KPP, (9) Staffing values reported for that period in the row elements labeled.
This general data structure is then used to find the Principal Components that are the primary variance generators.
Principal Component Analysis in a Nutshell?
If x is a random vector of dimension with finite p x p variance-covariance matrix V[x] = ∑, then the principal component analysis finds the directions of the greatest variance of the linear combinations of x’s. In other words, it seeks the orthonormal set of coefficient a1, ...,ak vectors, such that,
The linear combination is referred to as the kth principal component. [JOLL01], [EVER11]. This mathematical description of Principal Component Analysis is completely unactionable for a Program Manager. So the actionable description is simple.
Tell me which of the variables in my program, represented by the time series of their respective values, is driving the variances I see in the selected variables – the SPI and CPI – so I can go find a corrective action to keep my program GREEN.
Third Step Toward Better Estimate At Completion
With improved forecasting tools, the Third step is to make visible the connections between each measure to reveal the drivers of EAC. This means identifying of the connections between the technical performance, risk, and other variables on the project, including core, Earned Value data.
This approach provides the Program Manager with insight to the dynamic behavior of the program in ways not available from Descriptive Analytics. Using EAC calculations, the standard analysis using CPI and SPI only states the estimated cost value. It does not reveal what is driving that cost and what the contribution of those drivers are to the total cost.
When we add more variables to the collection of program performance data, we create a new problem. We will need to identify which of these data items are the Principal ones that can provide actionable information to the Program Manager. We will apply Principal Component Analysis (PCA) to identify patterns in the data and express this data in a way to highlight similarities and differences.
Using Principal Component Analysis To Discover Drivers of Probabilistic EAC
Principal Component Analysis (PCA) decomposes a number of correlated variables within a dataset into a number of uncorrelated Principal Components. The result is a reduced dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set [JOLL02]. This is achieved by transforming to a new set of variables, the Principal Components, which are uncorrelated, and which are order so the first few retain most of the variation present in all of the original variables.
The extracted Principal Components are estimated as the projections on the eigenvectors of the covariance or correlation matrix of this dataset. The variance of a dataset is an indicator of how spread out the data is. The larger the deviation, the more information is included. In practice, if 80–90% of the total variance in a multivariate dataset can be accounted for by the first few PCs, corresponding to the largest eigenvalues of the covariance matrix, then the remaining components can be rejected without much loss of information. [SAMA99]
Call to Action
After this very quick overview of the problem and two proposed solutions to increasing visibility to unanticipated growth in EAC, here is are actionable steps, known to address the issue:
Normalize Data In the Repository
For a “Big Data” repository to function properly the data needs to be normalized. This means the periods of performance are consistent, the data scaled are consistent, the primary data keys – the WBS numbers – are consistent, and the values of the data have defined ranges consistent with their represented elements. Using the WBS structure from MIL–STD–881 as a start, no matter what the structure turns out to be important attributes are needed. The WBS must be well formed, that is it must possess transitive closure as a minimal attribute so navigation of the tree structure is consistent across all programs held in the repository.?
Apply ARIMA
With a consistent set of data, known data elements with normalized values, no missing data – or if it is missing it is identified as missing – and a well structure decomposition of the WBS, the time series analysis can take place.
This time series analysis can be one of many choices, simple ARIMA(p,d,q) is a start. Holt–Winters is another popular approach, but others are available. Research will be needed to determine to appropriate approach. This can start with a long times series of data, apply the forecasting algorithm to reveal an intermediate value, confirm that the forecast value matches the actual values. With the variances from the result, adjust the parameters to improve the forecasting ability.
Adjust ARIMA
With actual programs, tested for intermediate forecasting, and comparisons of actual data to forecast data, sensitivity analysis of tuning parameters can be observed. This is the basis of all control closed loop tuning. Adaptive tuning, dynamic characterization, feed forward tuning are possibilities for early detections of unanticipated growth of EAC [KEAT11].
Integrate External Data
Earned Value data alone provides a Descriptive assessment of past performance – assuming no cumulative data being used. With the period of performance data, ARIMA can be used to forecast future values of CPI and SPI.
But our research has shown – and is spoken about in a parallel presentation by the same authors – that connecting Technical Performance Measures with the assessment of Physical Percent Complete is a critical factor in crating a credible BCWP.
This external data starts with the contents of the Systems Engineering Management Plan (SEMP). This is where Measures of Effectiveness, Measures of Performance, Technical Performance Measures, and Key Performance Parameters are identified.
§?Measures of Effectiveness –operational measures of success, closely related to the achievements of the mission or operational objectives evaluated in the operational environment, under a specific set of conditions.
Apply Principal Components Analysis
With the baseline data from the Earned Value reporting process, augmented with the data in Figure 4, Principal Components can be found that drive the program's variance. This variance is the source of increases in the Estimate At Completion. This alone is not sufficient to provide the Program Manager with actionable information.
Connecting these variances with each Work Breakdown Structure element is the source of this actionable information. By connecting variance with WBS elements, the source of variance can be revealed. Then corrective actions can be taken.
Challenges To Our Call to Action
We have presented a possible solution, making use of existing data repositories, to increase confidence in the EAC using statistical processes. These processes include applying the ARIMA to the past Earned Value Management data submitted on a monthly basis and PCA to identify the drivers of unanticipated growth to the EAC. These techniques better descriptive and prescriptive qualities for forecasting EAC.
For these benefits to be realized, the Performance Assessment community must make progress on the following:
§ Make the IMP mandatory to collect MOE, MOP, KPP, and TPM – Without the IMP there is no traceability to these measures to the work performed in the IMS.
References
? An unpublished paper submitted in support of Big Data Meets Earned Value Management, Glen B. Alleman and Thomas J. Coonce, International Cost Estimating & Analysis Association (ICEAA), 2014 Professional Development & Training Workshop, June 10-13, 2014.
[EVER11] An Introduction to Applied Multivariate Analysis with R (Use R!), Brian Everitt and Torsten Hothorn, Springer, 2011.
[JOLL02] Principal Component Analysis, Second Edition, I. T. Jolliffe, Springer–Verlag, 2002
[KEAT11] “Using Earned Value Data to Detect Potential Problems in Acquisition Contracts,” C. Grant Keaton, Second Lieutenant, USAF, Air Force Institute of Technology, March 2011.
[BOX70] Time Series Analysis: Forecasting and Control, G. Box and G. Jenkins, Holden–Day, 1970.
[RAND13] Management Perspectives Pertaining to Root Cause Analyses of Nunn–McCurdy Breaches, Volume 4 Program Manager Tenure, Oversight of Acquisition Category II Programs, and Framing Assumptions, RAND Corporation, 2013
[RAND12] Root Cause Analyses of Nunn–McCurdy Breaches, Volume 2 Excalibur Artillery Projectile and the Navy Enterprise Resource Planning Program, with an Approach to Analyzing Complexity and Risk, RAND Corporation, 2012.
[RAND11] “Root Cause Analyses of Nunn–McCurdy Breaches, Volume 1 Zumwalt–Class Destroyer, Joint Strike Fighter, Longbow Apache, and Wideband Global Satellite,” RAND Corporation, 2011.
[IDA11]?Expeditionary Combat Support System: Root Cause Analysis, Institute for Defense Analyses, 2011
[IDA11a] Global Hawk: Root Cause Analysis of Projected Unit Cost Growth, Institute for Defense Analyses, 2011.
[IDA10] ?“Root Cause Analysis for the ATIRCM/CMWS Program,” Institute for Defense Analyses, 2010.
[SAMA99]?“Experimental comparison of data transformation procedures for analysis of principal components,” Samal, M., Kárny, M., Benali, H., Backfrieder, W., Todd–Pokropek, A., & Bergmann, H. (1999). Physics in Medicine and Biology, 44(11), 2821–2834.
Technology Intelligence and Visioneer/ Energy Systems Thinker with Expertise in De-carbonized Energy Systems/ Geologist/ Futurist/ Communicator/ Energy Futures Lab Fellow/ CESAR Associate/ Artist of Possibility
1 年I just came across your article Compendium of Resources for Managing Complex Systems... which was absolutely brilliant! WOW. I am a geologist by training so used to dealing with highly complex systems 3 miles down and 60 million years ago with very few data points to find hydrocarbon reserves. And yet we do it. I loved this article as it highlighted and articulated somethings I only had a felt sense for. Glad to be part of this Herding Cats Newsletter. I don't have a hope in hell of getting the math. But if the abbreviations were spelled out instead of assumed then I think I can follow along. These days I work in Technology Stewardshiip with early stage companies and designing future low carbon energy systems.
There are several challenges existing data repositories. First, data quality has been an issue. This is something the parametric folks such as Goliath, confront all the time. A second problem is context. Much of the variability comes from missing context. Context, such as In Domain or Super domain (in DoD at least) need to be segmented. A third is inconsistency with what is measured. For example, FP vs LOC.. Giving data scientists cross sectional data has some risks. The situation is different with longitudinal data within a specific program.
A common challenge we are seeing is on the tracking side. As teams use agile estimation, joining team level estimates with the Top Down has been a problem for a number of organizations. Stipulating we are using Effort Days as the common currency, this is manageable, but has built in assumptions not always satisfied . These need to be made explicit. There is also the human in the loop data problem. We are looking at opportunities to automate some of the data collection. https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=890538 Automation requires explicit definition of the measurement model. Any critiques welcome.
Project Management Consultant, Trainer and Standards Reviewer, frm CS/IT Top Principal Research Scientist 1-st Rank, Books Author. Make your knowledge a gift and you will be rewarded with recognition and respect.
1 年Outstanding paper! Congratulations! Just a small correction inside the Introduction para, the bulleted text about BCWP should be with EV instead of PV . The assessment of the Budgeted Cost of Work Performed (EV) is not adjusted for the technical performance of the resulting work. I have also downloaded the PowerPoint presentation. Slide #14 is a great warning and it should be very well understood by all readers claiming EVM knowledge, while in slide #20 the JROC/JCIDS concepts ought to be explained, as they are somewhat domain restricted and tightly coupled to DAS.
Applying Systems Engineering Principles, Processes & Practices to Increase Probability of Program Success for Complex System of Systems, in Aerospace & Defense, Enterprise IT, and Process and Safety Industries
1 年Anne Marie Gignac, PMP,CSM there are several other news letter posts in program planning and controls along the same topic - “increasing the probability of Prorgam success”