Scaling Data Science

Scaling Data Science

A look at the tools, processes, and infrastructure your business needs to derive value from data.

This white paper can be downloaded at DataScience.com at this link.

Introduction

Though it has deep roots in academia, data science is now very much a business process. Just like accounting or marketing, there is a cost to doing data science — and benefits that come from doing it well. In fact, 21% of executives now report that investing in big data has been transformative for their firms.

For businesses not seeing transformative improvements from data-driven work, the roadblocks are varied: organizational impediments, lack of alignment between stakeholders, resistance to technology changes, and more can hold up progress. 

What Does It Mean to ‘Scale Data Science’?

When we say “scaling data science,” we’re talking about making your data science team the engine that powers every decision with powerful predictive models and comprehensive analyses. For instance, a predictive customer churn model built by a data scientist could forecast the likelihood that individual customers will stop buying from your business in a given time period. That information could help you identify high-risk customers and intervene before they defect — but only if that information gets to the appropriate parties.

If you have the right processes in place, the outputs of that model can be integrated with your call center software (so customer service reps will be able to view risk scores when they’re on the phone with customers) or with your marketing automation system (allowing your marketers to create more targeted campaigns). This is just one example of how to scale your data science efforts. To make data science worth the investment, your company not only needs data scientists to do that high-value work in the first place, but the tools, processes, and infrastructure to support it. 

1. Tools & Processes

Identifying the tools you need to scale is both essential and complex. No two data scientists possess the same skillsets or tool preferences, and tool sprawl — in which the volume of tools being used exceeds an organization’s ability to effectively utilize them — is the number one problem data-driven companies face.

The typical enterprise data science project requires dozens of steps, from cleaning data and selecting model features to model validation and deployment, and there are dozens of tools designed to cater to each part of the process. The tools and processes you have in place should support your data scientists’ ability to:

Experiment: Not every data model is going to function perfectly right out of the gate. Experimenting and iterating is part of a typical data science process, which is why the tools your team uses should support the deployment of different model versions and compile metrics to measure the success of those versions. Experimentation also requires the collection and storage of massive amounts of data — the more data your data scientists have, the more opportunities for analysis they can uncover. 

Create work that is reproducible: If your data scientists are reinventing the wheel for every project, you’re wasting valuable time and resources. Because data scientists often store their work locally, much of what has already been done goes unshared. Having a central location for files, models, and code will help your team find and reuse data science work that has already been battle tested.

Collaborate and share across teams: Your processes and tools should encourage knowledge sharing, especially between technical and non-technical teams. Make sure it’s easy for data scientists to publish analyses as shareable reports and integrate model results into the dashboards or real-time applications that stakeholders rely on.

2. Infrastructure & Environments

When we talk about data science, we often forget to mention the infrastructure needed to support complex data analyses. IT can spend a lot of time spinning up the necessary resources to support data science work or setting up environments with the right packages (and waiting for those packages to compile).

But with the right approach, your data science team can get the resources they need without costing you — and your IT team — an excess of money and time. Here are just a few elements to consider: 

  • On-demand computing resources: Increasingly, companies are turning to on-demand computing (in which computing resources are spun up as needed) to reduce costs. If you’re working in a cloud environment, this is a great way to accommodate data science work in a way that doesn’t require your computing resources to be “always on.” 
  • Standardized data science environments: Why set up a new data science environment for every project? Containers like Docker make it easy to download standard, repeatable environments with the tools and packages you need already installed. 
  • Access control: Two and a half quintillion bytes of data are created every day, and much of it comes from customer interactions with your business. It’s unlikely that every member of your team needs access to every type of data or every analysis. The ability to delegate access by role and protect sensitive data is an important one.

3. Production

Getting data science work into production is arguably the most important step in any data science process. Until this point, you haven’t truly scaled data science at your organization.

Putting data science work into production means building a pipeline in which data science analyses and models are continuously running in real time to power your business. Deploying a model — like a recommendation engine — into production so that it can suggest products to shoppers on your site is just one part of the equation. That model will need to be retrained, A/B tested, constantly fed new data, etc. Making this work seamlessly is no small undertaking. The process should allow for: 

  • Automation of repetitive tasks: If your team is manually running the reports your stakeholders want regularly, cleaning data, and retraining models, you’re not efficiently putting work into production. Currently, three out of every five data scientists spend the majority of their time cleaning and organizing data. Setting up a system that automates many of these low-level tasks will give your data scientists more time to focus on building high-value analyses.
  • Ongoing monitoring of model performance: Collecting data from your models to monitor their performance is essential for identifying issues and addressing them. Set up a pipeline to deliver relevant data from activities like model API calls, training, and cross validation.
  • Constant improvement: A data scientist’s work is never done. Giving your data scientists the ability to deploy different versions of their predictive models to compare and iterate upon is a great way to constantly improve the results of your data science work. 

How a Data Science Platform Brings It All Together

Data science platforms are a relatively new technology that are becoming a must have for enterprise data science teams. In fact, platform adoption is expected to rise from 26% to 69% over the next two years as companies increasingly recognize the value of managing data science tools and processes in a centralized hub.

In a nutshell, a data science platform is a software hub around which all data science work takes place. That work usually includes integrating and exploring data from various sources, coding and building models that leverage that data, deploying those models into production, and serving up results, whether that’s through model-powered applications or reports. Platforms are designed to support this work to make scaling data science much more achievable.

Ready to start performing data science at scale? The DataScience.com Platform provides integrations with the tools your data scientists already use and love — like Jupyter notebooks and GitHub — intuitive project organization, easy report publishing, model deployment capabilities, and much more, backed by enterprise-grade security features and infrastructure. 

References:

  • Forrester Consulting, “Data Science Platforms Help Companies Turn Data Into Business Value,” December 2016
  • CrowdFlower, “2016 Data Science Report,” 2016
  • NewVantage Partners, “Big Data Executive Survey 2017,” January 2017.

Credit to the team @datascience.com !!



Bhotskie Aquino OCP, DevOps, TOGAF

Helping Cxx Leaders achieve business value through the use of IT, enterprise architecture and sound IT frameworks

7 年

great read. thanks!

要查看或添加评论,请登录

Ian Swanson的更多文章

社区洞察

其他会员也浏览了