Effectively Leading a DataScience Initiative
Bhalchandra (Bhal) Madhekar
Software Engineering Leader | Big Data | Advanced Analytics | AI/ML | Cloud Technologies | SaaS / PaaS | KaggleX BIPOC Mentor | GitHub.com/madhekar
New data science leaders aspiring to manage advanced analytics teams face a set of challenges and are required to think through questions to best utilize tools and teams to provide ROI for business. More businesses are seeking to achieve enhanced business performance out of insights and predictions from modern-day data lakes or data centers. Although data science based methods are not quite novel to some industry segments such as insurance and finance, they are new for many other organizations seeking to adopt data science in improving financial outcomes or business decisions. Now, the most obvious and daunting question is, what is the most effective way to start and lead a data science effort, and get a team humming at optimal performance to produce business value?
Well, you may say, it is a subjective question very much dependent on industry or business needs, objectives and adaptation strategy. But, without loss of generality, most natural and effective way to begin with in my mind is to self reflect and understand or white board possible questions before embarking on building data science teams or initiatives as such. Here are some of the top questions I have grappled with, and overtime have refined (the hard way!). Having drafted the specific objective and expected outcome of the data science initiative, the leader needs to pause and think through the next set of imperative questions upfront:
- What is an effective communication strategy within team, across business teams and IT?
- What skills are needed based on current and future business goals, objectives and needs?
- What is the plan to achieve:
- Data { availability, quantity, quality, diversity }
- Business requirements, assumptions, limitations
- What is the plan for measuring quality of insights produced?
- What is the plan or architecture to achieve reproducibility of models or analytic results?
- What is the strategy for prevention of bias in data/collection/algorithm/decision making?
- What is the failure strategy if data science outcomes are not actionable or useful?
- What is the security plan to make sure IP is not exposed or data is not misused?
In this article I have made an effort to go over some of these questions. I understand some may not be applicable across all industry segments, and certainly there is large variation in level of interest or adaptation. This article depicts lessons learned from my experience building data science teams to create hybrid data science products to deliver near real-time predictions. I have made an effort to simplify most of the hard lessons learnt from early days of data science initiatives including, refactoring software, model training and maintaining relevance of predictions over time.
Building a comprehensive product or initiative strategy is the initial challenge with short and long term objectives of delivering effective repeatable detection capability with incremental learning, while making sure to acquire and retain customer persona. Once you have an initial blueprint of an initiative in place it is important to create structure around it for long term sustenance. Building team structure around the effort is next big step.
Data Science Team:
Objectives behind a data science team include data acquisition, feature selection, model selection, statistical analysis, development processes, and communication interface with the rest of business. To arrive at these objectives we need to break down the overall scope of the initiative. Keep in mind that a small team based effort with clearly defined roles can drive early wins.
- Data Architect: Designs /maintains data acquisition, storage, data models data applications and process workflows.
- Data Engineer: makes data available for data science efforts, designs, developed and codes data applications for data capture and analysis.
- Data Scientist: does research on datasets available using tools such as data visualization, machine learning to understand data and build, validate and test models.
- Data /Business Analyst: analyses a large variety of data using visualization tools, extracts information about system, services or organization performance.
- IT Infrastructure Engineer: builds and manages IT systems for organization.
- Business Stakeholder/ Client: provides high-level guidance about what can be made operational, value judgement, etc.
Development Process
Once your team is in place, building a strong development processes is necessary for long term sustainability. This is because an effective data science leader brings together two different types of work items namely software and data science teams. This is a challenge, since software development lifecycle differs from that of data science processes. Building software is linearly progressive to achieve software development and delivery. Data science trajectory traces a non-linear progression, which involves iterating over assumptions, thoughts and results only to sometimes start them all over again. It is a difficult task to run a data science team that matches the business reality of setting a clear time bound expectation with stake holders. A leader almost needs to juggle two different development processes. In my experience, I ended using Agile/ Scrum for software and dev/ops team and CRISP/DP iterative process for data science team initially.
Quality of Insights
Another important judgement a leader has to make is the assessment of quality of the data science outcome. They must answer the question - are my results the best possible or just merely adequate? Although, there are a large number of tools to verify model quality and automate model validation and verification processes, it is still difficult to accurately judge quality. Results have large dependancies with may factors such as requirement assumptions made in the design phase, availability of high quality diverse data, biases in data collection and model creation. Although daunting, it is better to discuss, brainstorm and document the process decisions made and then iterate on those choices to minimize biases.
Reproducibility of Results
Reproducibility of results becomes very difficult if different tools are used in development versus in production. Early on, I incorporated this in a deployment and containerization process making data science models into a pluggable module for a software application. This helps not only to make, test and deploy the module independently but also helps in managing release cycles and effectively versioning. There are other reproducibility challenges such as loss of traceability due to incorrectly porting model creation process for production, and model results regression because of new datasets with different data assumptions and diversity.
These are the important questions data science leaders need to answer and implement to be successful. It is important to think through every aspect highlighted above before jumping into development and making accurate timeline promises.
Superbb Article Dada...It is indeed a well drafted article which bring out the critical questions,expected outcomes,business requirements and real time predictions which a data science leader need to answer in the real sicnerio...You writing is too good ...Best regards