登录查看更多内容

Scaling Data Science

Ian Swanson

CEO of Protect AI. Prior Worldwide Leader AI & ML at Amazon, VP Machine Learning at Oracle, and CEO of DataScience.com

发布日期: 2017年6月20日

+ 关注

A look at the tools, processes, and infrastructure your business needs to derive value from data.

This white paper can be downloaded at DataScience.com at this link.

Introduction

Though it has deep roots in academia, data science is now very much a business process. Just like accounting or marketing, there is a cost to doing data science — and benefits that come from doing it well. In fact, 21% of executives now report that investing in big data has been transformative for their firms.

For businesses not seeing transformative improvements from data-driven work, the roadblocks are varied: organizational impediments, lack of alignment between stakeholders, resistance to technology changes, and more can hold up progress.

What Does It Mean to ‘Scale Data Science’?

When we say “scaling data science,” we’re talking about making your data science team the engine that powers every decision with powerful predictive models and comprehensive analyses. For instance, a predictive customer churn model built by a data scientist could forecast the likelihood that individual customers will stop buying from your business in a given time period. That information could help you identify high-risk customers and intervene before they defect — but only if that information gets to the appropriate parties.

If you have the right processes in place, the outputs of that model can be integrated with your call center software (so customer service reps will be able to view risk scores when they’re on the phone with customers) or with your marketing automation system (allowing your marketers to create more targeted campaigns). This is just one example of how to scale your data science efforts. To make data science worth the investment, your company not only needs data scientists to do that high-value work in the first place, but the tools, processes, and infrastructure to support it.

1. Tools & Processes

Identifying the tools you need to scale is both essential and complex. No two data scientists possess the same skillsets or tool preferences, and tool sprawl — in which the volume of tools being used exceeds an organization’s ability to effectively utilize them — is the number one problem data-driven companies face.

The typical enterprise data science project requires dozens of steps, from cleaning data and selecting model features to model validation and deployment, and there are dozens of tools designed to cater to each part of the process. The tools and processes you have in place should support your data scientists’ ability to:

Experiment: Not every data model is going to function perfectly right out of the gate. Experimenting and iterating is part of a typical data science process, which is why the tools your team uses should support the deployment of different model versions and compile metrics to measure the success of those versions. Experimentation also requires the collection and storage of massive amounts of data — the more data your data scientists have, the more opportunities for analysis they can uncover.

Create work that is reproducible: If your data scientists are reinventing the wheel for every project, you’re wasting valuable time and resources. Because data scientists often store their work locally, much of what has already been done goes unshared. Having a central location for files, models, and code will help your team find and reuse data science work that has already been battle tested.

Collaborate and share across teams: Your processes and tools should encourage knowledge sharing, especially between technical and non-technical teams. Make sure it’s easy for data scientists to publish analyses as shareable reports and integrate model results into the dashboards or real-time applications that stakeholders rely on.

2. Infrastructure & Environments

When we talk about data science, we often forget to mention the infrastructure needed to support complex data analyses. IT can spend a lot of time spinning up the necessary resources to support data science work or setting up environments with the right packages (and waiting for those packages to compile).

But with the right approach, your data science team can get the resources they need without costing you — and your IT team — an excess of money and time. Here are just a few elements to consider:

On-demand computing resources: Increasingly, companies are turning to on-demand computing (in which computing resources are spun up as needed) to reduce costs. If you’re working in a cloud environment, this is a great way to accommodate data science work in a way that doesn’t require your computing resources to be “always on.”
Standardized data science environments: Why set up a new data science environment for every project? Containers like Docker make it easy to download standard, repeatable environments with the tools and packages you need already installed.
Access control: Two and a half quintillion bytes of data are created every day, and much of it comes from customer interactions with your business. It’s unlikely that every member of your team needs access to every type of data or every analysis. The ability to delegate access by role and protect sensitive data is an important one.

3. Production

Getting data science work into production is arguably the most important step in any data science process. Until this point, you haven’t truly scaled data science at your organization.

Putting data science work into production means building a pipeline in which data science analyses and models are continuously running in real time to power your business. Deploying a model — like a recommendation engine — into production so that it can suggest products to shoppers on your site is just one part of the equation. That model will need to be retrained, A/B tested, constantly fed new data, etc. Making this work seamlessly is no small undertaking. The process should allow for:

Automation of repetitive tasks: If your team is manually running the reports your stakeholders want regularly, cleaning data, and retraining models, you’re not efficiently putting work into production. Currently, three out of every five data scientists spend the majority of their time cleaning and organizing data. Setting up a system that automates many of these low-level tasks will give your data scientists more time to focus on building high-value analyses.
Ongoing monitoring of model performance: Collecting data from your models to monitor their performance is essential for identifying issues and addressing them. Set up a pipeline to deliver relevant data from activities like model API calls, training, and cross validation.
Constant improvement: A data scientist’s work is never done. Giving your data scientists the ability to deploy different versions of their predictive models to compare and iterate upon is a great way to constantly improve the results of your data science work.

How a Data Science Platform Brings It All Together

Data science platforms are a relatively new technology that are becoming a must have for enterprise data science teams. In fact, platform adoption is expected to rise from 26% to 69% over the next two years as companies increasingly recognize the value of managing data science tools and processes in a centralized hub.

In a nutshell, a data science platform is a software hub around which all data science work takes place. That work usually includes integrating and exploring data from various sources, coding and building models that leverage that data, deploying those models into production, and serving up results, whether that’s through model-powered applications or reports. Platforms are designed to support this work to make scaling data science much more achievable.

Ready to start performing data science at scale? The DataScience.com Platform provides integrations with the tools your data scientists already use and love — like Jupyter notebooks and GitHub — intuitive project organization, easy report publishing, model deployment capabilities, and much more, backed by enterprise-grade security features and infrastructure.

References:

Forrester Consulting, “Data Science Platforms Help Companies Turn Data Into Business Value,” December 2016
CrowdFlower, “2016 Data Science Report,” 2016
NewVantage Partners, “Big Data Executive Survey 2017,” January 2017.

Credit to the team @datascience.com !!

Bhotskie Aquino OCP, DevOps, TOGAF

Helping Cxx Leaders achieve business value through the use of IT, enterprise architecture and sound IT frameworks

7 年

great read. thanks!

2 次回应

要查看或添加评论，请登录

Ian Swanson的更多文章

Protect AI: End-of-Year Memo

2024年12月31日

Protect AI: End-of-Year Memo

As we close out 2024, I want to take a moment to reflect on the incredible progress Protect AI has made this year. Our…

10 条评论
Does Your Company Need A Chief AI Officer?

2024年5月23日

Does Your Company Need A Chief AI Officer?

[This article was originally published on Forbes] The rapid mainstreaming of artificial intelligence (AI) has led the…

5 条评论
How MLSecOps Can Reshape AI Security

2024年1月3日

How MLSecOps Can Reshape AI Security

[This article was originally published on Forbes] The adoption of DevSecOps has reshaped cybersecurity, but DevSecOps…

2 条评论
My Testimony in a Congressional Hearing on AI Security

2023年12月12日

My Testimony in a Congressional Hearing on AI Security

Today, I had the honor of participating in a congressional hearing on a very important topic - security of artificial…

36 条评论
Protect AI Acquires huntr; Launches Artificial Intelligence and Machine Learning Bug Bounty Platform

2023年8月9日

Protect AI Acquires huntr; Launches Artificial Intelligence and Machine Learning Bug Bounty Platform

Protect AI is proud to announce the acquisition of 418Sec, a cybersecurity research community, and their bug bounty…

8 条评论
Protect AI: The Time is Now

2023年7月27日

Protect AI: The Time is Now

This has been quite the year for artificial intelligence (AI). Innovations in Generative AI are driving lightning fast…

61 条评论
Why We're Building Protect AI

2022年12月15日

Why We're Building Protect AI

I’ve spent much of my career scaling Artificial Intelligence and Machine Learning (AI/ML) while leading some of the…

47 条评论
The Data Science ABCs - An Introduction to Essential Data Science Concepts, From A to Z

2019年4月21日

The Data Science ABCs - An Introduction to Essential Data Science Concepts, From A to Z

[This article originally appeared on DataScience.com and can be downloaded here.
Are We There Yet? The Road To Enterprise AI Adoption

2019年3月4日

Are We There Yet? The Road To Enterprise AI Adoption

[This article also appeared on Forbes.] When it comes to the long-promised mass adoption of artificial intelligence…

1 条评论
Artificial Intelligence: It’s Time

2019年1月15日

Artificial Intelligence: It’s Time

Interview from Oracle Magazine January 2019 ---> https://bit.ly/2RrJug1 People may continue to call artificial…

6 条评论

See all articles

Scaling Data Science

Ian Swanson

CEO of Protect AI. Prior Worldwide Leader AI & ML at Amazon, VP Machine Learning at Oracle, and CEO of DataScience.com

Introduction

What Does It Mean to ‘Scale Data Science’?

1. Tools & Processes

2. Infrastructure & Environments

3. Production

How a Data Science Platform Brings It All Together

Ian Swanson的更多文章

社区洞察

其他会员也浏览了

Modern data culture stack, analytics community, data scientist as a role, and more

The Power of Data Science at Sadup Softech

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Analytics and Data Science News for the Week of November 22; Updates from insightsoftware, SAS, ThoughtSpot & More

Building a Data Science Culture, One Convert at a Time

5 Real-Time Challenges Budding Data Scientists Face in IT Projects—and How to Overcome Them

How Do You Win the Data Science Wars? You Cheat By Doing The Necessary Pre-work!

Expert Data Science Services for Your Business

Analytics and Data Science News for the Week of March 15: Updates from Quantexa, Alation, Matillion, and More

Introduction

What Does It Mean to ‘Scale Data Science’?

1. Tools & Processes

2. Infrastructure & Environments

3. Production

How a Data Science Platform Brings It All Together

Ian Swanson的更多文章

Protect AI: End-of-Year Memo

Does Your Company Need A Chief AI Officer?

How MLSecOps Can Reshape AI Security

My Testimony in a Congressional Hearing on AI Security

Protect AI Acquires huntr; Launches Artificial Intelligence and Machine Learning Bug Bounty Platform

Protect AI: The Time is Now

Why We're Building Protect AI

The Data Science ABCs - An Introduction to Essential Data Science Concepts, From A to Z

Are We There Yet? The Road To Enterprise AI Adoption

Artificial Intelligence: It’s Time

社区洞察

其他会员也浏览了

Modern data culture stack, analytics community, data scientist as a role, and more

The Power of Data Science at Sadup Softech

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Analytics and Data Science News for the Week of November 22; Updates from insightsoftware, SAS, ThoughtSpot & More

Building a Data Science Culture, One Convert at a Time

5 Real-Time Challenges Budding Data Scientists Face in IT Projects—and How to Overcome Them

How Do You Win the Data Science Wars? You Cheat By Doing The Necessary Pre-work!

Expert Data Science Services for Your Business

Analytics and Data Science News for the Week of March 15: Updates from Quantexa, Alation, Matillion, and More