Operationalising Data Science #2 of 3 - Integrating technical delivery workflows
Introduction
The previous article in this series described the dominant technical delivery workflows (or 'paradigms'), i.e. (1) the standard technology value stream with it’s linear flow characteristics and (2) the data-centric and non-linear ML workflow. The paradigms were mapped, revealing high-level similarities but significant low-level differences. This article examines the challenges and opportunities which exist in terms of integrating aspects of these paradigms in order to successfully operationalize data science solutions.
Paradigm Integration Challenges and Opportunities
Paradigm Integration Challenges
Before examining the challenges associated with the integration of the paradigms, it is important to first identify the existing challenges which are currently associated with implementation of the standard ML workflow.
Pre-existing Challenges Within Data Science Workflows
There are a number of existing data science-related challenges within the industry. On analysis, the most frequently cited challenges are associated with (a) data operations, (b) speed and (c) testing. These three areas are illustrated in Fig. 1 in relative terms of frequency of reference.
Fig. 1 – Three most referenced pre-existing challenges facing data scientists
Data Operations Challenges
Access to high quality data is a necessary antecedent to effective data science activities and achieving this access has proven to be highly problematic. A significant challenge is the lack of cooperation from key stakeholders in terms of data collection and preparation. The challenge of acquiring data was also highlighted in terms of difficulties in understanding the location and schema (format) of the data, which could often be encoded in non-extractable data formats. Data quality challenges exist in terms of missing, incorrect or inconsistent data values and difficulties with identifying data sources and integrating data from multiple sources. Poor data quality was also found to be a significant challenge for data scientists in Microsoft where the main issues identified included the presence of defects in the data collection and sampling procedures and missing or delayed data. Fields containing missing or null attributes were also commonly observed, as well as heterogeneous data in single columns.
Dealing with large-scale data sets is also frequently highlighted as a challenge for data scientists. Challenges exist regarding data scientists' ability to create scripts that run at scale and it has been noted that algorithms developed by data scientists may not scale as the data sets grow. The requirement for data scientists to become proficient programmers in order to work with large-scale data sets is frequently highlighted, as large data sets cannot be analysed “using Excel or R-like tools alone” (Kim et al. 2018, p. 14).
The negative impacts of issues involving data access, quality and scale have been highly significant for data science as a discipline. Data access and quality issues have resulted in data science personnel spending large amounts of time in set-up activities as opposed to activities associated with generating valuable insights from the data. Industry experts have highlighted that a large portion of work required the cleaning and shaping of data just to enable data analysis. It is thought that between 50% to 80% of a data scientists time is spent collecting and preparing data before useful eliciting insights is possible.
Speed Challenges
The scale of data sets is also a factor when it comes to outcome delivery velocity, with large data sets negatively impacting workflow velocity. The inertia associated with using large data sets during model training is frequently referenced and often comes with the suggestion that model building speed could be improved, albeit at a cost of effectiveness, by using smaller data-sets.
Inadequate tooling is also referenced as a limiting factor in terms of achieving fast iteration on the analysis and tooling deficiencies are considered to have a particularly negative impact on the more experienced team members, i.e. those whom are typically challenged with executing the most difficult analysis tasks.
The essential issue is that the speed of outcome delivery from stages within the ML workflow can suffer from inertia due, in particular, to the ever increasing scale of the data and the deficiencies in tooling. This inertia can manifest itself in difficulties for data scientists in meeting scheduled delivery commitments.
Testing Challenges
Testing of data science solutions is certainly a major challenge and the current methods can be improved. As the desired outcomes of an ML solution are typically unknown at the outset, developing tests to validate the actual outcome against a predefined expected outcome is impossible. Also, due to the high behavioural interdependency which exists between ML artefacts, whereby a change in one artefact can alter the behaviour of another/all other artefacts, root cause analysis becomes highly challenging. Despite the fact that the requirement for adequate ML solution testing is highly referenced, the discipline of ML testing remains immature and there is consensus on the necessity to ensure it is an area of focus into the future.
Other Challenges
While data operations, speed and testing were the challenges most highly referenced, several other challenges have also been identified. Issues exist regarding identifying the right problem to solve, dealing with unrealistic customer expectations, measuring the business impact of the data science effort and effectively communicating outcome uncertainty. Efficient use of a data scientists time is another frequently mentioned issue with mitigation of time-wasting achieved by developing/leveraging a common data platform to enable easy data access. Productionizing and deploying the developed data science solution is also an issue outlined by commentators across the industry.
Challenges Introduced by Paradigm Integration
Having outlined the challenges which currently exist within the ML workflow, it is appropriate to examine the challenges associated specifically with the integration of the ML workflow with the standard technology value stream. The most significant paradigm integration challenges were identified and grouped by theme (Fig. 2).
Fig. 2 – Paradigm integration challenges by theme
Two particularly dominant paradigm challenges which emerged pertained to (1) role clarity and (2) education. Prominent but less intensively discussed themes included working within agile frameworks, communication and challenges associated with practitioner mindsets.
Role Clarity
Role clarity is a critical element of organisational success with positive correlations proven between role clarity, work satisfaction and low staff turnover rates. As previously discussed, high degrees of ambiguity surround the definition of both the discipline of data science and the roles fulfilled by data scientists within organisations. Role ambiguity such as this is unhelpful in terms of achieving efficient integration within established role-based organisational structures such as the agile team model, widely adopted within the standard technology value stream. A number of considerations pertaining to role clarity are now examined in terms of the integration challenges they are likely to present.
The lack of a data science workforce framework which clearly outlines the knowledge, skills and abilities of data scientists has been recognised as a significant gap. The role of a data scientist is often blurred to such a degree that it results in inappropriate expectations of data scientists and unrealized potential of data science.
Data Engineers vs. Data Scientists
Worthy of particular note is the requirement to clarify the relationship between data engineering and data science. Data scientists depend on data engineers to create the infrastructure, so the data becomes easily available for the data scientist. The criticality of this distinction is becoming more evident in terms of how organisations are now beginning to set up general and specific data scientist roles to support data science and engineering, one of which is that of ‘data engineer’. There is a clear requirement therefore to have a delineation of responsibilities between these two very distinct but interdependent roles within the existing ML workflow. The data engineers responsibilities logically extend to the data-oriented stages of the ML workflow while the data scientists responsibilities logically lie in the model-oriented stages (Fig. 3).
Fig. 3 – Delineation between data engineering and data science (Amershi et al. 2019)
Challenges to Established Roles
An integration of the paradigms also challenges the definition of existing roles within the technology value stream.
Software Testing
The changing role of software test professionals is commonly discussed in the context of supporting the validation tasks associated with delivering ML solutions. There is an emerging requirement for specialised ML testing expertise, a new skill-set when compared with the existing test engineering skillset evident in software engineering teams. There is no doubt that new challenges will face software test professionals in terms of testing ML solutions, with substantial changes on the horizon for software engineering in the context of testing and debugging.
Business Analysts / Product Managers / Product Owners
The role of business analysts, product managers and product owners within the standard technology value stream will also likely change in the context of ML solution definition. The standard technology value stream typically begins with a market driven business need, which results in explicit requirements, but this relationship is redefined within the emergent ML workflow. The relationship between business value and the ML workflow can be described as typically being the exact opposite, wherein requirements are generated inductively from the insights derived from the data (Fig. 4). There are significant role changes implied by this inversion, including a change in the nature of the relationship between the business and technology organisations. The technology organisation will, over time, likely become a significant driver of business decision-making thereby implying an evolution of the role of the business analyst/product managers/product owners and a necessary change in their skillset.
Fig. 4 – Inductive vs. deductive requirements management
Education
The recency of the emergence of data science as a mainstream discipline within the technology industry means that a lack of understanding of data science still exists among several key stakeholder cohorts. Successful integration of aspects of the standard technology value stream and the ML workflow will require significant stakeholder education in the context of establishing an appreciation of data science as a discipline.
A fundamental challenge for data scientists will be convincing their non-data science colleagues of the value of data science. Significant confusion exists in agile teams regarding both the role of data scientists and the the terminology of the data scientist. Several areas exist whereby the characteristics of ML solutions differ when compared with traditional software components (e.g. regarding complex component entanglement), and the suggestion is that these differences will require substantial changes to existing software engineering practices and that engineers with traditional software engineering backgrounds need to learn how to work alongside ML specialists.?
The requirement for data scientists to educate their colleagues in both the lexicon of data science and the nuances of data science workflows is clearly identified. Microsoft has consciously addressed internal cross-functional education in a number of ways, including holding twice-yearly internal conferences on data science, hosting weekly open forums and leveraging internal mailing lists and online forums to allow employees learn more about data science.
A further critical aspect of education is the requirement for data scientists to not only focus on cross-functional education but to continuously collaborate with other data scientists as this is considered one of the best ways a data scientist can grow and learn.
Working in Agile Teams
The acceptance of uncertainty, one of the principles of agile development within the standard technology value stream, would appear synergetic with the iterative nature of the ML workflow. Despite this seemingly obvious synergy, integration difficulties are likely to emerge with regard to the implementation of agile practices for the delivery of data science solutions.
领英推荐
Agile scrum is based on the regular delivery of valuable increments of software within defined sprint cycles. The concept of sprint deliverables is challenging for? data scientists because they often work on hypotheses, the delivery schedule of which is difficult to estimate (Hukkelberg and Berntzen 2019). The creativity and freedom which are described as necessary elements for the effective application of data science can therefore lead to missed sprint goals, an undesirable situation for traditional scrum teams and wider business stakeholders such as product management, project managers, senior management and customers.
A further challenge regarding the integration of data scientists into agile teams relates to the scale of the data sets being analysed and their effect on estimation and velocity. The schedule impact is significant when, for example, batch processing jobs on large data sets must be executed and this can be highly disruptive within fixed duration sprint cycles.
It is clear that due to the nature of the data science workflow, challenges do exist for data scientists in terms of integrating with agile scrum teams within the standard technology value stream.?
Communication
Significant communication challenges also threaten the effective integration of data science. There are difficulties associated with explaining the exploratory nature of data science and the unpredictability of the results to non-data scientists. Communication challenges also exist in terms of conveying the resulting insights to leaders and stakeholders in an effective manner. The capacity to transform analytical concepts and outcomes into business friendly interpretations and the ability to communicate actionable insights to non-technical audiences are also important elements in being a good data scientist. There is clearly a critical responsibility on data scientists to communicate effectively with stakeholders in order to achieve the required levels of understanding and support within their organisations.
Mindset - Scientist vs Engineer
The integration of the standard technology value stream and ML workflow is also significant from a philosophical standpoint. The standard technology value stream has been predominantly designed by engineers while the ML workflow has largely been designed by researchers and scientists. The dichotomy between engineers and scientists has been studied for over half a century and mindset differences between these cohorts have been recognised.
The major difference between scientists and engineers exist in terms of scientists seeking knowledge and engineers applying knowledge, and engineers and scientists differ fundamentally in terms of their needs satisfaction (Atkinson et al. 1969). Attitudes towards industrial taxonomies have evolved since 1969 and the appropriateness of classifying all professionals involved in any particular endeavour within a single homogeneous group (i.e. scientist or engineer) has been challenged (Rothaermel and Hess 2007). A recent study by Bignon and Szajnfarber did somewhat support the traditional view by highlighting a bias within the scientist cohort towards research and intrapreneur activities with engineers being motivated more by enabling and bridging activities (Bignon and Szajnfarber 2015). Based on the inherent difference in professional philosophy whereby scientists seek knowledge and engineers apply knowledge, mindset challenges may also emerge during paradigm integration.
Challenge Resolution Responsibility: Examination of the main challenges introduced by paradigm integration from the perspective of resolution responsibility highlights the necessity to have a holistic approach to addressing the paradigm integration challenges. The association of each of the eleven challenges to a level within the organisation (Fig. 2) reveals that 45% of the challenges must be addressed at the organisational level, 63% at the team level and 72% at the individual level. In order to implement an effective paradigm integration strategy there is therefore a clear requirement to elicit support from actors across all levels of the organisation.
Paradigm Integration Opportunities
The importance of collaboration among experts from different fields in order to deliver effective ML solutions is clear. While data science demands competences that go beyond software engineering, there is widespread recognition that aspects of traditional software engineering can play a significant role in improving the process of data science delivery.
The need for software engineering practices in order to cater for the emergence of data science is acknowledged while the need for software engineering for machine learning is also clearly highlighted. Data scientists can benefit significantly from the availability of capabilities such as software engineering (including systems design and analysis) and quality assurance.
The most significant paradigm integration opportunities include leveraging (1) the operationalizing/deployment and (2) validation capabilities of the technology value stream in the delivery of data science solutions.
Operationalizing and Deployment
The success of data science solutions has a critical dependency on the capability to operationalize the solution in a production environment and this step of going the last mile is often difficult for modelling specialists both in terms of production readiness of the developed application and in terms of the management of the development and operational infrastructure.
In terms of programming capability, the requirement exists for data scientists to have somebody who can create production-ready model code. The requirement to apply software engineering skills to building models in order to achieve customizability and extensibility is also clear. Fundamental software engineering practices including version control, re-use and continuous delivery are also considered valuable competencies which are more prevalent in traditional software engineering teams than in data science teams.
The development and operationalization of software application code and the management of software infrastructure are therefore highlighted as capabilities which commonly exist within traditional software engineering teams and which can benefit data science delivery.
Test / Validation
Test engineering, a well-established discipline within the standard technology value stream, plays an important role in the process of product risk mitigation by ensuring that the software product is adequately tested. Test and validation activities are common across both paradigms however there is evidence of significant deficiencies regarding the validation activities currently carried out within the data science domain. There is little in terms of industrial reference to the discipline of testing/validation within data science. While the inductive characteristics of ML testing differ from traditional test engineering methods, the existence of an established professional test engineering discipline within the technology value stream can clearly be leveraged and evolved to address aspects of the ML testing need. Significant challenges which currently exist within data science can therefore potentially benefit from appropriate integration of the standard technology value stream. Core capabilities pertaining to operationalizing, deploying and validating software already exist within the standard technology value stream and it is likely that these capabilities could be better leveraged to benefit data science solution delivery (Fig. 5).
Fig. 5 – Integration opportunities
Conclusion
This article examined the challenges and opportunities which exist in terms of integrating aspects of the standard technology value stream and the non-linear ML workflow in order to more successfully operationalize? data science solutions.?
The three most prevalent existing data science delivery challenges were identified as data operations, speed of delivery and testing. Establishing role clarity for data scientists and redefining the roles of established professionals within the standard technology value stream were considered significant challenges. The challenge of educating non-data scientists in the discipline of data science was also considered to be significant. The other significant integration challenges identified involved data scientists working within agile teams, effective communication of data science across key stakeholder cohorts and bridging the gap between science and engineering mindsets.?
Integration opportunities were identified whereby traditional software engineering could contribute to solving significant data science challenges pertaining to operationalizing, deployment and testing.
The next (and final) article in this series will provide practical recommendations in terms of next steps in developing a successful strategy to operationalize data science solutions within your organisation.
Fergal Hynes, July 2021
Bibliography
Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., Zimmermann, T. (2019) ‘Software Engineering for Machine Learning: A Case Study’, in Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019.
Atkinson, A.C., Bobis, A.H., Farris, G.F., Harroid, R.W. (1969) ‘Difference Between Engineers and Scientists’, IEEE Transactions on Engineering Management.
Bignon, I., Szajnfarber, Z. (2015) ‘Technical Professionals’ Identities in the R and D Context: Beyond the Scientist Versus Engineer Dichotomy’, IEEE Transactions on Engineering Management.
Cao, L. (2017) ‘Data science: A comprehensive overview’, ACM Computing Surveys.
Ebert, C., Heidrich, J., Martinez-Fernandez, S., Trendowicz, A. (2019) ‘Data science: Technologies for better software’, IEEE Software.
Fisher, D., DeLine, R., Czerwinski, M., Drucker, S. (2012) ‘Interactions with big data analytics’, Interactions.
Hassan, S. (2013) ‘The importance of role clarification in workgroups: Effects on perceived role clarity, work satisfaction, and turnover rates’, Public Administration Review.
Hill, C., Bellamy, R., Erickson, T., Burnett, M. (2016) ‘Trials and tribulations of developers of intelligent systems: A field study’, in Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC, 162–170.
Hukkelberg, I., Berntzen, M. (2019) ‘Exploring the challenges of integrating data science roles in agile autonomous teams’, in Lecture Notes in Business Information Processing.
Kandel, S., Paepcke, A., Hellerstein, J.M., Heer, J. (2012) ‘Enterprise data analysis and visualization: An interview study’, IEEE Transactions on Visualization and Computer Graphics.
Khomh, F., Adams, B., Cheng, J., Fokaefs, M., Antoniol, G. (2018) ‘Software Engineering for Machine-Learning Applications: The Road Ahead’, IEEE Software.
Kim, M., Zimmermann, T., Deline, R., Begel, A. (2018) ‘Data scientists in software teams: State of the art and challenges’, IEEE Transactions on Software Engineering.
Kim, M., Zimmermann, T., DeLine, R., Begel, A. (2016) ‘The emerging role of data scientists on software development teams’, in Proceedings - International Conference on Software Engineering.
Lohr, S. (2014) ‘For Data Scientists, “Janitor Work” is Hurdle to Insights.’, New York Times, Late Edition.
Riungu-Kalliosaari, L., Kauppinen, M., M?nnist?, T. (2017) ‘What can be learnt from experienced data scientists? A case study’, in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).
Rothaermel, F.T., Hess, A.M. (2007) ‘Building dynamic capabilities: Innovation driven by individual-, firm-, and network-level effects’, Organization Science.
Saltz, J.S., Grady, N.W. (2017) ‘The ambiguity of data science team roles and the need for a data science workforce framework’, in Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017.
Zhang, J.M., Harman, M., Ma, L., Liu, Y. (2020) ‘Machine Learning Testing: Survey, Landscapes and Horizons’, IEEE Transactions on Software Engineering.