登录查看更多内容

Effectively Leading a DataScience Initiative

Bhalchandra (Bhal) Madhekar

Software Engineering Leader | Big Data | Advanced Analytics | AI/ML | Cloud Technologies | SaaS / PaaS | KaggleX BIPOC Mentor | GitHub.com/madhekar

发布日期: 2020年4月15日

New data science leaders aspiring to manage advanced analytics teams face a set of challenges and are required to think through questions to best utilize tools and teams to provide ROI for business. More businesses are seeking to achieve enhanced business performance out of insights and predictions from modern-day data lakes or data centers. Although data science based methods are not quite novel to some industry segments such as insurance and finance, they are new for many other organizations seeking to adopt data science in improving financial outcomes or business decisions. Now, the most obvious and daunting question is, what is the most effective way to start and lead a data science effort, and get a team humming at optimal performance to produce business value?

Well, you may say, it is a subjective question very much dependent on industry or business needs, objectives and adaptation strategy. But, without loss of generality, most natural and effective way to begin with in my mind is to self reflect and understand or white board possible questions before embarking on building data science teams or initiatives as such. Here are some of the top questions I have grappled with, and overtime have refined (the hard way!). Having drafted the specific objective and expected outcome of the data science initiative, the leader needs to pause and think through the next set of imperative questions upfront:

What is an effective communication strategy within team, across business teams and IT?
What skills are needed based on current and future business goals, objectives and needs?
What is the plan to achieve:

- Data { availability, quantity, quality, diversity }

- Business requirements, assumptions, limitations

What is the plan for measuring quality of insights produced?
What is the plan or architecture to achieve reproducibility of models or analytic results?
What is the strategy for prevention of bias in data/collection/algorithm/decision making?
What is the failure strategy if data science outcomes are not actionable or useful?
What is the security plan to make sure IP is not exposed or data is not misused?

In this article I have made an effort to go over some of these questions. I understand some may not be applicable across all industry segments, and certainly there is large variation in level of interest or adaptation. This article depicts lessons learned from my experience building data science teams to create hybrid data science products to deliver near real-time predictions. I have made an effort to simplify most of the hard lessons learnt from early days of data science initiatives including, refactoring software, model training and maintaining relevance of predictions over time.

Building a comprehensive product or initiative strategy is the initial challenge with short and long term objectives of delivering effective repeatable detection capability with incremental learning, while making sure to acquire and retain customer persona. Once you have an initial blueprint of an initiative in place it is important to create structure around it for long term sustenance. Building team structure around the effort is next big step.

Data Science Team:

Objectives behind a data science team include data acquisition, feature selection, model selection, statistical analysis, development processes, and communication interface with the rest of business. To arrive at these objectives we need to break down the overall scope of the initiative. Keep in mind that a small team based effort with clearly defined roles can drive early wins.

Data Architect: Designs /maintains data acquisition, storage, data models data applications and process workflows.
Data Engineer: makes data available for data science efforts, designs, developed and codes data applications for data capture and analysis.
Data Scientist: does research on datasets available using tools such as data visualization, machine learning to understand data and build, validate and test models.
Data /Business Analyst: analyses a large variety of data using visualization tools, extracts information about system, services or organization performance.
IT Infrastructure Engineer: builds and manages IT systems for organization.
Business Stakeholder/ Client: provides high-level guidance about what can be made operational, value judgement, etc.

Development Process

Once your team is in place, building a strong development processes is necessary for long term sustainability. This is because an effective data science leader brings together two different types of work items namely software and data science teams. This is a challenge, since software development lifecycle differs from that of data science processes. Building software is linearly progressive to achieve software development and delivery. Data science trajectory traces a non-linear progression, which involves iterating over assumptions, thoughts and results only to sometimes start them all over again. It is a difficult task to run a data science team that matches the business reality of setting a clear time bound expectation with stake holders. A leader almost needs to juggle two different development processes. In my experience, I ended using Agile/ Scrum for software and dev/ops team and CRISP/DP iterative process for data science team initially.

Quality of Insights

Another important judgement a leader has to make is the assessment of quality of the data science outcome. They must answer the question - are my results the best possible or just merely adequate? Although, there are a large number of tools to verify model quality and automate model validation and verification processes, it is still difficult to accurately judge quality. Results have large dependancies with may factors such as requirement assumptions made in the design phase, availability of high quality diverse data, biases in data collection and model creation. Although daunting, it is better to discuss, brainstorm and document the process decisions made and then iterate on those choices to minimize biases.

Reproducibility of Results

Reproducibility of results becomes very difficult if different tools are used in development versus in production. Early on, I incorporated this in a deployment and containerization process making data science models into a pluggable module for a software application. This helps not only to make, test and deploy the module independently but also helps in managing release cycles and effectively versioning. There are other reproducibility challenges such as loss of traceability due to incorrectly porting model creation process for production, and model results regression because of new datasets with different data assumptions and diversity.

These are the important questions data science leaders need to answer and implement to be successful. It is important to think through every aspect highlighted above before jumping into development and making accurate timeline promises.

Dr. Revati Deshpande

4 年

Superbb Article Dada...It is indeed a well drafted article which bring out the critical questions,expected outcomes,business requirements and real time predictions which a data science leader need to answer in the real sicnerio...You writing is too good ...Best regards

查看更多评论

要查看或添加评论，请登录

Bhalchandra (Bhal) Madhekar的更多文章

Startup Metrics: Financial

2023年12月5日

Startup Metrics: Financial

In the world of startups, where every decision founders takes can make or break a company’s future. The startup metrics…

1 条评论
LLM Model Serving : An Interesting Challenge

2023年10月19日

LLM Model Serving : An Interesting Challenge

Text Generation Short Summery Large Language Models (LLMs) generate text in a two-step process: pre-fill, where the…
Measuring SaaS Startup Success

2022年9月14日

Measuring SaaS Startup Success

It has always fascinated me working for more than a decade with various SaaS startups, how difficult it is to device…

12 条评论
New Gen India: socially re-engineered?

2020年1月13日

New Gen India: socially re-engineered?

During my recent visit to India, I had a unique opportunity to interact with millennials and Gen Zs - true digital…

12 条评论
A case for unification of financial crime, fraud, and cybersecurity operations

2019年10月10日

A case for unification of financial crime, fraud, and cybersecurity operations

In general risks associated with financial crime involve three kinds of counter measures: identifying and…
Machine Learning: Dimensionality

2018年4月27日

Machine Learning: Dimensionality

One of the tough problems in machine learning is dimensionality, in other words, number of features. This term was…

2 条评论
Machine Learning: Feature Engineering

2018年4月16日

Machine Learning: Feature Engineering

Data has become a first-class asset for modern businesses, corporations, and organizations irrespective of their size…
Application Containers Security, Monitoring and Compliance Challenges

2016年1月31日

Application Containers Security, Monitoring and Compliance Challenges

Application Container such as Docker—is relatively young application container technology with a lot of momentum think…

8 条评论
IoT Security: Threats, Constrains & Challenges -II

2015年12月10日

IoT Security: Threats, Constrains & Challenges -II

The Internet of Things [IOT] will and is overhauling the way which we all use technology. Its proliferation although…

3 条评论
Cyber Security: Advanced Persistent Threats

2015年10月30日

Cyber Security: Advanced Persistent Threats

Advanced Persistent Threats (APT) are long-lived malware with specific goals has recently emerged as the major threat…

See all articles

Effectively Leading a DataScience Initiative

Bhalchandra (Bhal) Madhekar

Software Engineering Leader | Big Data | Advanced Analytics | AI/ML | Cloud Technologies | SaaS / PaaS | KaggleX BIPOC Mentor | GitHub.com/madhekar

Bhalchandra (Bhal) Madhekar的更多文章

社区洞察

其他会员也浏览了

Dive into the World of Data & Analytics with Our Livestreams!

Data Science vs. Data Analytics: The Difference Explained

A Beginner's Guide to the Data Science Pipeline

Modern data culture stack, analytics community, data scientist as a role, and more

The Power of Data Science at Sadup Softech

Why Large Corporations Embrace Data Science for a Competitive Advantage

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

Data Nugget August 2024

What are the 3 Stages where your Data Science Teams might Fail???

What Do You Do with Data?

Bhalchandra (Bhal) Madhekar的更多文章

Startup Metrics: Financial

LLM Model Serving : An Interesting Challenge

Measuring SaaS Startup Success

New Gen India: socially re-engineered?

A case for unification of financial crime, fraud, and cybersecurity operations

Machine Learning: Dimensionality

Machine Learning: Feature Engineering

Application Containers Security, Monitoring and Compliance Challenges

IoT Security: Threats, Constrains & Challenges -II

Cyber Security: Advanced Persistent Threats

社区洞察

其他会员也浏览了

Dive into the World of Data & Analytics with Our Livestreams!

Data Science vs. Data Analytics: The Difference Explained

A Beginner's Guide to the Data Science Pipeline

Modern data culture stack, analytics community, data scientist as a role, and more

The Power of Data Science at Sadup Softech

Why Large Corporations Embrace Data Science for a Competitive Advantage

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

Data Nugget August 2024

What are the 3 Stages where your Data Science Teams might Fail???

What Do You Do with Data?