Operationalising Data Science #1 of 3 - Technical delivery workflows
@chrisliverani

Operationalising Data Science #1 of 3 - Technical delivery workflows

Introduction

Evidence-based management aims to leverage data as a tool upon which to develop deeper conceptual understanding and more fully inform the business decision-making process.

The growth rate in available digital data reached exponential levels as far back as 2013. This explosion in the availability of digital data is predicted to accelerate over the coming years with the increasing maturity of the internet of things. Big data potential is expected to revolutionise business operations by providing businesses with multiple opportunities to seize a competitive edge.?

The potential which lies in the expanding oceans of digital data is, however, difficult to realise. The proliferation of digital data has created the need for hardware, software and human capital which can leverage business critical aspects of the data and filter out the noise.

The explosion of digital data has been coincident with the emergence of three key factors which have enabled the exploitation of digital data for business benefit. Firstly, the emergence of cloud-based solutions has removed the traditional limiting factors associated with hardware capex by providing easy access to affordable compute power and data storage capability. Secondly, software tooling capabilities have improved substantially over recent years, primarily as a result of the industry adoption of open-source software. Finally, data science has emerged as the knowledge-based discipline upon which the realisation of the benefits of digital data now depends.

The operationalisation of data science is, however, proving to be problematic across the industry and is inhibiting the realisation of the anticipated business value from digital data. Evidence demonstrates that over two-thirds of the software development industry is grappling with how best to integrate data science with existing capabilities in order to consistently and rapidly achieve business advantage by leveraging data assets.

This series of three articles examines the operationalisation of data science solutions by (1) investigating the existing and emergent technical delivery workflow paradigms (this article), (2) outlining paradigm integration challenges and opportunities (article 2) and (3) providing practical recommendations for operationalisation of data science solutions (article 3).

Existing and Emergent Delivery Workflows

Digital data is a software asset which data scientists manipulate using software in order to describe, predict or prescribe valuable insights from structured and unstructured data. It is critical to understand the nature of traditional software and the ways in which it is currently delivered in order to understand the differences with data science solutions and the associated emergent software delivery workflows.

The Standard Technology Value Stream

A technology value stream is defined as the process required to convert a business hypothesis into a technology enabled service that delivers value to the customer (Kim, Humble, et al. 2016). A standard technology value stream has emerged over several decades, the fundamental principles of which have been adopted throughout the software industry.

The standard technology value stream (Fig. 1) has evolved through the systematic application of software engineering principles to the development of software applications.?


Fig. 1 – Standard technology value stream (Kim, Humble, et al. 2016)

Fig. 1 – Standard technology value stream (Kim, Humble, et al. 2016)


The software applications traditionally developed within the standard technology value stream are explicit by nature with well-defined boundaries and controllable dependencies.

Over the past two decades there have been a number of key influences on the evolution of the standard technology value stream.?

The agile manifesto, published in 2001, led to a cultural shift in how software was developed (Beck et al. 2001). The manifesto, now widely adopted by software development teams across the industry, advocates a lean mindset, embraces uncertainty, encourages close collaboration among all stakeholders and promotes the incremental delivery of valuable software to the customer. Despite the celebrated cultural shift towards agile frameworks such as scrum, successful ‘software development’ did not always result in an acceptable outcome for the customer.?

Early agile implementations focussed on activities associated only with the development of software applications and did not adequately consider other key aspects of the software delivery lifecycle such as software testing and operations, often resulting in customer dissatisfaction due to poor software quality and delays in outcome delivery.?

The industry addressed the need to improve software delivery quality and efficiency by introducing lean principles. The application of a number of lean principles within the software industry today exists in a suite of principles known as Development Operations (DevOps). Implemented and proven by industry leaders such as Netflix and Amazon, DevOps was initially established in 2009 to bridge the gap between software development activities and the continuous efficient delivery of high quality software to production environments, where the customer benefit is ultimately realized (Allspaw and Hammond 2009).

The standard technology value stream has therefore evolved from the confluence of traditional software engineering principles, agile frameworks and DevOps principles (Fig. 2). The process of integration of agile and DevOps practices into the standard technology value stream was eased by the fact that the capabilities required to achieve many of these integrations already existed within software development teams.

No alt text provided for this image

Fig. 2 – The confluence of factors in the standard technology value stream


The Software Engineering Workflow

The low-level details behind the implementation of an organisation’s technology value stream, i.e. the specific software engineering workflows implemented to realise the value stream, are typically unique to each organisation. Design dependencies include factors such as the industry domain (e.g. FinTech, MedTech etc.) and the capabilities of the organisation’s internal human capital.?

Successful decomposition of the technology value stream into software engineering workflows and their subsequent implementation within an organisation is an expensive process involving a prolonged process of learning-by-doing. Establishing effective software engineering workflows is a significant challenge within the industry and only 32% of software development organisations are effective in this regard.?

An example of a widely applied software engineering workflow is the deployment pipeline as outlined by Humble and Farley. This deployment pipeline describes the detail behind aspects of the technology value stream and illustrates a number of key principles of modern software engineering. Key characteristics of the deployment pipeline include well-defined stages (e.g. build & unit test, release etc.), the criticality of event-driven automation (i.e. triggers) and the concept of continuous feedback and iteration (Fig. 3).


Fig. 3 – The deployment pipeline software engineering workflow (Humble and Farley 2010)

Fig. 3 – The deployment pipeline software engineering workflow (Humble and Farley 2010)


Once successfully established, software engineering workflows become organisational assets which provide a platform for achieving sustained competitive advantage within the market due to their value, rarity, in-imitability and non-substitutability. The ability to establish, maintain and evolve fit-for-purpose software engineering workflows has become a critical factor in differentiating software development organisations.?

Neither the standard technology value stream nor the software engineering workflow have, to date, been widely adopted in the delivery of data science solutions. Data science workflows have emerged independently in order to facilitate the delivery of data science solutions.

The Data Science Workflow

While there are numerous workflows associated with the delivery of data science solutions, a workflow commonly referenced is that of the machine learning workflow which, due to its complex characteristics, will be used here as a representative data science workflow.

Machine Learning

Machine learning (ML), an artificial intelligence technique, leverages sample data to generate patterns in order to make predictions upon which business decisions can be based. ML practitioners use software programming languages in order to implement complex algorithms (also known as models) which can interrogate data sets. By analysing historical data (which can be structured and/or unstructured), these algorithms learn from patterns in the data and use these learnings to make future predictions.?

Supervised learning, one category of ML, uses a training data set to learn about the relationship between input and output data. Once the relationships from the training data set are understood, the algorithm can then use this information to predict the output for completely new input data. The effectiveness of the ML algorithm is then validated with new data sets (known as the validation and test sets) to ensure that the algorithm performs within acceptable boundaries.

The level of business value provided by machine learning, depends on a number of key factors (Fig. 4).

Fig. 4 – Key ML success factors

Fig. 4 – Key ML success factors


The quantity of examples in the training data set typically correlates directly with the accuracy of the predictive model, i.e. the more high quality examples which exist in the training set, the more accurate the model. High variability of the data within the training set is also critical to ensure good model performance as all data classifications must be populated with an adequate number of high quality examples. The completeness of the data within the training set, i.e. data dimensionality, is critical to avoid empty spaces within the data which would prevent efficient learning. Validation and test data sets must be available to evaluate the algorithm (model) behaviour and to ensure the model is behaving within acceptable performance boundaries. Finally, subject matter experts are required to ensure the model predictions make sense and to identify the most valuable data points.

Machine Learning Workflow

While there are many ML workflows, the representation below describes a commonly used workflow which will be used here to illustrate the key fundamental characteristics of an ML workflow (Fig. 5).

Fig. 5 – Generic ML workflow (Amershi et al. 2019)

Fig. 5 – Generic ML workflow (Amershi et al. 2019)


Amershi et al. describe their nine stage machine learning workflow in terms of two main characteristics, the “data-centered essence of the process and the multiple feedback loops among the different stages”.

The workflow clearly illustrates the importance of data within the workflow by describing three stages whereby data collection, cleaning and labelling activities are carried out. There is consensus regarding the non-linear characteristic of the ML workflow, i.e. the existence of several feedback loops within the workflow, the fundamentally iterative nature of the workflow and it’s non-linear nature, wherein analysts regularly cycle among tasks. It is certain that the process of developing effective ML solutions requires continuous loopback and experimentation across many of the ML workflow stages.

Mapping the Paradigms

Having examined both the standard technology value stream and the ML workflow in isolation, a direct comparison of both paradigms is useful in terms of understanding the challenges and opportunities associated with paradigm integration.

When examined at the highest level, both paradigms follow a basic four stage process which involves developing an understanding of the business problem/requirements, preparing and developing the solution, testing the solution and deploying the solution (Fig. 6).

Fig. 6 – Basic four-stage process across both paradigms

Fig. 6 – Basic four-stage process across both paradigms


Despite these high level similarities, fundamental differences between the paradigms exist. Mapping the standard technology value stream (Kim, Humble, et al. 2016) against the generic ML workflow (Amershi et al. 2019) provides a basis for explaining these differences (Table 1 and Fig. 7).

Table 1 – Standard technology value stream and ML workflow mapping (Kim, Humble, et al. 2016; Amershi et al. 2019)

Table 1 – Standard technology value stream and ML workflow mapping (Kim, Humble, et al. 2016; Amershi et al. 2019)


Fig. 7 – Standard technology value stream and ML workflow mapping (Kim, Humble, et al. 2016; Amershi et al. 2019)

Fig. 7 – Standard technology value stream and ML workflow mapping (Kim, Humble, et al. 2016; Amershi et al. 2019)


Basic Stage 1 Mapping - Understand Business Problem / Define Requirements

The ML workflow requirements stage loosely maps to the customer request/design & analysis/design approval phases of the standard technology value stream. There does however exist a higher degree of outcome uncertainty at the early stage of the ML workflow. Apart from the unknown state of several key data-oriented and model-oriented factors, there are typically fundamental unknowns in terms of the problem definition, which is typically not available in advance as it depends on the downstream analysis activities. While uncertainty does exist at the early stages of the standard technology value stream it is to a significantly lesser degree due to the highly prescribed and deductive nature of the development activities. There are higher levels of uncertainty associated with requirements specification of ML solutions resulting in gaps regarding backlog (requirements) and effort estimation activities in the ML workflow when compared with the standard technology value stream.

Basic Stage 2 Mapping - Prepare and Develop

The data collection, data cleaning, data labelling, feature engineering and model training stages of the ML workflow are all loosely analogous to the Development (Incl. test automation) phase of the standard technology value stream. The three data-oriented stages in the ML workflow would be classified as environment preparation activities in the standard technology value stream but the scale of the preparation work is a key difference. Up to 80% of the effort devoted to data science is typically associated with data gathering, preparation, and exploration. The scale of preparation activities is therefore significantly higher in the ML workflow.

The feature engineering and model training stages of the ML workflow map to traditional software development activities within the standard technology value stream, however, the characteristics of the software is significantly different. The explicit, rule-based software associated with the standard technology value stream is straightforward to specify, develop and its performance is relatively predictable. Estimating the effort and complexity associated with developing and tuning implicit/inference-based software (i.e. the ML algorithm/model), which is highly dependent on external factors (such as data quality), is not an easily predictable process. Implicit/inference-based software development therefore presents a higher level of development complexity than traditional software development.

Basic Stage 3 Mapping - Test

The model evaluation stage of the ML workflow maps somewhat to the testing related activities within the standard technology value stream (i.e. test automation, UAT and exploratory & performance testing). The ability to manage product risk within the standard technology value stream depends on highly systemised and well-established testing activities. Within the standard technology value stream, the product risk level of the proposed change is typically well understood before being promoted to a production environment. The product risks associated with introducing an ML model into production are elevated due to the inherent uncertainty associated with machine learning outcomes and the difficulties associated with ML testing as a practice. There is therefore a higher degree of product risk and production readiness ambiguity within the ML workflow.

Basic Stage 4 Mapping - Deploy

The model deployment and model monitoring stages within the ML workflow closely map to Deployment wait and Verify customer receives expected value stages within the standard technology value stream. Of note here is the higher degree of production deployment and operational expertise within existing software engineering teams when compared with data science teams. While similar challenges exist across the paradigms in this regard, more deployment and monitoring expertise is evident within software engineering discipline (standard technology value stream).


Fundamental Differences Between the Workflows

Traditional software applications, developed in an explicit, rule-based fashion, are expected to perform consistently under known conditions without the need for regular tuning. The ML workflow is specifically designed to facilitate high levels of continuous refinement of productionized assets (i.e. loopbacks as mentioned earlier). Feedback can occur from model training to feature engineering or from model evaluation and model deployment to any earlier stage of the workflow, even after deployment to production. Continuous refinement/tuning of productionized assets is more evident within the ML workflow than in the traditional technology value stream.

There are several interdependent artefacts within the ML workflow when compared with the standard technology value stream. These artefacts include the data sets (training, validation, test and production), the algorithm or model (i.e. software), parameters (to send to the algorithm) as well as the application itself. The debugging process within the ML workflow is consequently more complex as there is an inherent difficulty in isolating the impact of change due to the highly coupled nature of the various artefacts. Challenges such as this have led to practitioners resorting to ad-hoc trial and error type debugging practices. High degrees of coupling between artefacts and ad-hoc approaches to debugging are antithetical to established practices within the standard technology value stream. The highly coupled nature of ML artefacts is therefore a fundamentally different architecture pattern between the paradigms.

There is significant uncertainty in the early stages of the ML workflow regarding the ultimate business value of predictive or prescriptive insights due to the critical dependency on factors ranging from data collection and data quality through to the expertise of the data science practitioners. Many variables may not be understood until well into the project therefore the organisation will typically have limited visibility on their likely return on investment. The return on investment for ML solutions therefore has a higher degree of uncertainty when compared with explicit software development which can lead to several business-related challenges (such as funding) for ML solution delivery.

The fundamental differences between the ML workflow and the standard technology value stream are summarised in Table 2.

Table 2 – Summary of differences between paradigms

Table 2 – Summary of differences between paradigms


Conclusion

This article described some dominant technical delivery workflows. The origins of the benchmark workflow for software engineering practice across the industry, the standard technology value stream, was described and its linear flow characteristics highlighted. The fundamentals of machine learning were then introduced followed by a description of the ML workflow and its data-centric and non-linear nature. The paradigms were mapped, revealing high-level similarities but significant low-level differences across eight workflow characteristics including requirements, development activities, software characteristics, validation and change control, deployment, linearity, architecture and outcome uncertainty.?

These differences can become significant progress inhibitors to operationalizing data science solutions.

The next article in this three-part series (next week) will examine the challenges and opportunities which exist in terms of integrating aspects of these paradigms in order to successfully operationalise?data science solutions.

Fergal Hynes, July 2021

Bibliography

Accenture (2014) Big Success With Big Data.

Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., Zimmermann, T. (2019) ‘Software Engineering for Machine Learning: A Case Study’, in Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019.

Bharadwaj, A.S. (2000) ‘A resource-based perspective on information technology capability and firm performance: An empirical investigation’, MIS Quarterly: Management Information Systems.

Boh, W.F., Slaughter, S.A., Espinosa, J.A. (2007) ‘Learning from experience in software development: A multilevel analysis’, Management Science.

Bourque, P., Fairley, R.E. (2014) SWEBOK v.3 - Guide to the Software Engineering - Body of Knowledge., IEEE Computer Society.

Donoho, D. (2017) ‘50 Years of Data Science’, Journal of Computational and Graphical Statistics.

Hill, C., Bellamy, R., Erickson, T., Burnett, M. (2016) ‘Trials and tribulations of developers of intelligent systems: A field study’, in Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC, 162–170.

Humble, J., Farley, D. (2010) Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation, Continuous delivery.

Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J. (2012) ‘Enterprise data analysis and visualization: An interview study’, IEEE Transactions on Visualization and Computer Graphics.

Khomh, F., Adams, B., Cheng, J., Fokaefs, M., Antoniol, G. (2018) ‘Software Engineering for Machine-Learning Applications: The Road Ahead’, IEEE Software.

Kim, G., Humble, J., Debois, P., Willis, J. (2016) The DevOps Handbook : How to Create World-Class Agility, Reliability, and Security in Technology Organizations, The DevOps handbook.

Kim, M., Zimmermann, T., Deline, R., Begel, A. (2018) ‘Data scientists in software teams: State of the art and challenges’, IEEE Transactions on Software Engineering.

Kim, M., Zimmermann, T., DeLine, R., Begel, A. (2016) ‘The emerging role of data scientists on software development teams’, in Proceedings - International Conference on Software Engineering.

Lesser, E., Ban, L. (2016) ‘How leading companies practice software development and delivery to achieve a competitive edge’, Strategy and Leadership, 44(1), 41–47.

McNulty, K. (2018) What Is Machine Learning? [online], Towards Data Science, available: https://towardsdatascience.com/what-is-machine-learning-891f23e848da.

Naur, P. (1974) Concise Survey of Computer Methods, Petrocelli Books (1974).

Patel, K., Fogarty, J., Landay, J.A., Harrison, B. (2008) ‘Investigating statistical machine learning as a tool for software development’, in Conference on Human Factors in Computing Systems - Proceedings.

Patil, D.J. (2011) ‘Building Data Science Teams’, Science.

Pfeffer, J., Sutton, R.I. (2006) ‘Evidence-based management’, Harvard Business Review.

Riungu-Kalliosaari, L., Kauppinen, M., M?nnist?, T. (2017) ‘What can be learnt from experienced data scientists? A case study’, in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).

Schilling, D. (2013) ‘Knowledge Doubling Every 12 Months, Soon to be Every 12 Hours’, Industry Tap into News.

Volpi, M. (2019) ‘How open-source software took over the world’, techcrunch.com.

Wells, A.R., Chiang, K. (2017) Monetizing Your Data.

Zhang, J.M., Harman, M., Ma, L., Liu, Y. (2020) ‘Machine Learning Testing: Survey, Landscapes and Horizons’, IEEE Transactions on Software Engineering.

Rachel Hadas Aharon

Data Architect at Fiserv

3 年

Thank you Fergal ,Very coherent article on important Topic where many companies struggle. Can't wait for your practical recommendations in part 3.

回复
Robert McDonnell, PhD

Senior Manager, Data Science at ACI Worldwide

3 年

Very interesting, Fergal, thanks for sharing. The "UAT" part of ML pipelines is an interesting space.

回复
Veena Giri

Global Delivery Leader in Payments & Fintech @ Fiserv | PGP IIMA | Chair of Women Impact Network, Global Services | DEI Lead | Merchant Acquiring Solutions | Tokenization | Acquiring as a Service | AI Enthusiast

3 年

Had a good read Fergal Hynes ! Looking forward for the remaining parts!

回复
Annemarie Ryan

Director of Product Management at Clover Network, Inc.

3 年

Excellent article Fergal, very insightful and a really easy read. Look forward to the next installment.

回复

要查看或添加评论,请登录

Fergal Hynes的更多文章

社区洞察

其他会员也浏览了