How to Build a Data Science Business Function

How to Build a Data Science Business Function

This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed for regular posts on machine learning, statistics, insurance and fintech.

Like any collaborative business effort involving research & development, a data science function should be built carefully in order to enable the best expertise and technologies.

Data science is a broad discipline and companies have differing data analysis requirements. These can vary from one-off scenario-specific modelling exercises to regular analysis of e.g. the effectiveness of an advertising campaign right through to live on-line predictive modelling of user actions. To achieve these goals the bedrock of a data science function needs to combine great people, good software and solid business processes.

In this post I want to set out how we typically approach the building of a data science team.

At Applied AI most of our project work combines bespoke data analysis with lightweight high-level software built to interpret and explore the results. In our experience it is possible to achieve high-quality data insights and effect business change using the client's existing data without needing gigantic data sets. This can be achieved by using various techniques such as intelligent problem definition, representative sub-sampling and effective experimentation.

The data science function in an organisation may be one person or many, spanning locations and project duration, mixing people in and out as skills and availability dictate. Like any collaborative enterprise involving research, design and development, a data science team should be built carefully and projects should be designed to achieve a particular aim, making full use of available expertise, existing facilities and the latest technologies.

Set up a strong team

We've seen that the practicing data scientist will generally use a wide variety of tools in order to variously:

  • acquire, manipulate, store and access data efficiently
  • design surveys and scientific experiments to test hypotheses
  • undertake statistically valid analyses
  • implement high-quality, optimised predictive models
  • derive and communicate actionable insights

These activities above require diverse skills covering database management, software engineering, statistical analysis, machine learning, graphic design, business experience, ethics, social responsibility, domain knowledge and communication. However, the days of simply hiring a single, unicorn-like, 'full-stack' data scientist to solve all our problems are pretty much gone, and probably never really existed.

The unicorn is dead. Whilst drafting this post I came across a handful of articles from the likes of HBR and Computerworld somewhat labouring the point that the data science business function requires a full team, and I completely agree, noting that it's always best to hire iteratively and size the team according to project scope and management buy-in.

For a start:

The team needs to be small, agile and focussed: 2-6 data scientists is ample, and they should be proven generalists, team-players and pragmatists able to cope with vague requirements, messy data and high failure rates (see below for project considerations). I like Forbes' opinion that the first hire(s) should "help get three things ready: your data; a clear problem to be solved; and a process to evaluate the business impact of any new solution".

  • Such highly-skilled people can be hard to find, but if you concentrate on the 'science' part of data science, it's possible to consider many candidates working in experimental science, industrial research & development and high-tech engineering - all disciplines which require a creative approach to learning from data
  • Ideally, some of these people will have experience within the company or at least strong experience in the industry - don't underestimate domain knowledge
  • The team also needs a well-respected sponsor within the organisation to help overcome failures, advertise successes and gain general buy-in from the board.


As the team grows:

The projects are likely to shift from one-off experiments into producing system-critical software, business process reengineering, tailored marketing & advertising, and reports for senior-management. Thus the team must grow and specialise accordingly, hiring for example:

  • data engineers / database administrators to source, clean and store the data, making it accessible, reliable, reusable and well-documented
  • computer scientists and software engineers to help scale algorithms to larger data sets, implement business rules or develop analytical applications
  • specialist statisticians and mathematicians to improve experiments and fine-tune algorithms
  • interface / graphic designers to help communicate insights
  • technical project managers to help organise the teams and deliverables
  • experienced technical people from other parts of the business - e.g. marketing or finance - who have organisational and domain knowledge


The steady state:

The data science function may grow to a significant size within the organisation, operating as a service to other departments and/or creating core features of the product and business processes.

The company may want to appoint a Chief Data Officer (CDO), leading the whole data science function and bridging the gap between the executive, financial, information and marketing leaders at board level.


Define and operate projects

As I've mentioned before, a lot of practical data science looks like software engineering, and happily, there's a huge number of articles and established techniques discussing how to best manage these creative but technical projects. Our opinion is that any piece of research or development likely to last more than a few days and/or involve more than one person should be classified as a project, and should have:

  • A primary sponsor and a project leader, respectively responsible for commissioning and delivering the project. These may often be one and the same person, but it’s important to recognise the roles and thereby ensure that the project is well managed and the results used in the business
  • A well defined goal that is specific, measurable, achievable, realistic and time bounded (SMART), and a written & agreed specification document no matter how concise
  • Regular progress meetings throughout the project to validate and update the plan and scope, with full and frank communication between major stakeholders
  • Knowledge sharing upon completion - sharing lessons learned is very important but we often see this given low priority; formalised past learnings are incredibly useful for influencing future projects
  • Also consider maintaining a basic RACI (responsibility matrix) and a risks & issues register, so that problems can be addressed and resolved methodically.

 

According to it's creator Drew Conway, the Data Science Venn Diagram could be updated for the modern day by adding communication as a separately defined, vital skill:

 

Ensure effective communication

Regular communication and team member visibility is also important, helping to ensure that the project stays on-track and issues are spotted early. Useful communication techniques include:

  • Daily stand-up meetings strictly limited to 10 minutes or less, where immediate activities and issues are shared by each team member to the rest of the team.
  • An up-to-date communal task schedule - which team members are working on which activities and when: the Kanban methodology for visual project management demonstrates this well.
  • Simplified and centralised communications technology; try to move written discussions away from email and towards wikis, message boards, and Slack
  • Try to allow data scientists / software engineers the time and space to get into a productive flow state without meetings and interruptions.

 

Systematise the data pipeline and analyses

Recently, I've noticed more articles and research papers written about systematising the data science function, such as the importance of automating workflows and dealing with technical debt when creating machine learning systems. This is a good sign, meaning that data science really is maturing into an everyday humdrum business function - and there's plenty of best practices out there for us all to follow including:

  • Understand and map the data 'pipeline' - the path from raw data sources to refined insights, and the human and machine consumers of this information - this will help reduce redundant analyses, identify fragile data and standardise processes
  • Stop when the models are good enough - avoid diminishing returns and overcomplexity - get to v1.0 quickly and iterate thereafter according to the needs of the information consumer
  • Encourage a systematic, shared approach to the creation of all machine learning tools and analyses - with proper source control and documentation, code reviews, 'lunch and learn' seminar sessions, and regular refactoring of algorithms, applications and data preparation scripts where appropriate.

 

In Summary

  • Start with a small team of capable generalists and work hard to define the business problems & success criteria, set timescales and understand & access the available data
  • Allow for and embrace failure, give data scientists time and space to research and experiment
  • Require a corporate sponsor with clout and encourage strong communication with the rest of the business
  • Specialise when necessary, automate where possible and embed into an ongoing cycle of development, maintenance and support.


We've covered a lot of ground here, so will likely revisit some of these topics in future and really get into the details.

 

This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed for regular posts on machine learning, statistics, insurance and fintech.

要查看或添加评论,请登录

Jonathan Sedar的更多文章

  • Delivering Value Throughout the Analytical Process

    Delivering Value Throughout the Analytical Process

    This post was originally published on The (new) Sampler, my personal blog for technical case studies and general…

  • Build You a Library

    Build You a Library

    This post was originally published on The (new) Sampler, my personal blog for technical case studies and general…

  • On Contractor Day Rates

    On Contractor Day Rates

    This post was originally published on The (new) Sampler, my personal blog at sedar.co.

    3 条评论
  • 9 Questions To Determine If You Have A Good Data Science Ecosystem

    9 Questions To Determine If You Have A Good Data Science Ecosystem

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

  • Our Growing World of Instech

    Our Growing World of Instech

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

  • The Data Science Maturity Model

    The Data Science Maturity Model

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

    2 条评论
  • Tools of the Trade (an overview)

    Tools of the Trade (an overview)

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

  • Data Science has become a well established discipline, so what is it?

    Data Science has become a well established discipline, so what is it?

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

社区洞察

其他会员也浏览了