登录查看更多内容

Continuous Analytics Defined

Stepan Pushkarev

Co-founder, CTO/CEO at Provectus

发布日期: 2016年5月23日

Continuous Analytics is the extension of devops, continuous integration, and continuous delivery to big data development. It plugs data science and data engineering into the traditional software development model and Agile methodology where coding, testing, and deployment are all run as automated processes.

Continuous Analytics is related to the new term DataOps, which defines the exact workflow for data analysts to follow to improve their efficiency by reducing the time-to-value in analytics projects. Continuous Analytics meshes neatly with that DataOps set of best practices.

This software engineering approach lets big data teams release software in shortened cycles, which is the goal of Agile iterations. The data scientists store their code in Git together with the programmers who write APIs to connect to datasources. The devops and big data engineers code scripts and playbooks in Ansible and Docker to lay down the big data software, networks, and storage, and spin up virtual machines. Testing is automated and built into the process.

Adding Big Data Members to the Development Team

Big data developments adds people to the development team who have not traditionally been working there. Data scientists are mathematicians and statisticians who write code, but they usually work in isolation from the traditional development team of programmers, product managers, project managers, testers, and architects. Continuous Analytics brings them onboard with everyone else.

In particular we have these two distinct roles related to big data and analytics:

The big data analyst is trained in mathematical and statistical modelling and machine learning. Before big data made technology it possible to apply analytics to large amounts of data to solve business problems, you typically found these scientists working on, for example, clinical trials at university hospitals. Now they have been made part of tactical business planning as they work with live business data and outside data feeds and report predictions and correlations that they find there to business managers.
The big data engineer is a platform specialist. They are responsible for designing the big data architecture. This includes architecting, configuring, and installing the big data tools: Kafka, HBase, Spark, Hadoop and others. They work with virtualization and network architects and programmers to write scripts to automate the installation of all of that in public and private clouds.

The Disjointed Big Data Development Process

Continuous analytics solves the main problem with how analytics is coded, which is it has been done without an overarching methodology, like devops and Agile, and no common set of tools.

In the siloed approach to big data development, the data analyst uploads a file or downloads a feed to do data exploration. So they work independently from the data laid down by the big data engineer, who is installing Spark and configuring Kafka. They are also one step removed from programmers who are working with Twitter APIs and writing code to retrieve data purchased from data brokers.

Someone looking at this approach from above would question that. They would say It obviously makes more sense to work on the data in its final form up front. This is made easier as various big data systems like Spark have interactive command line shells written in Scala, Python, and R, which are the programming languages the data scientists already use. With those shells the data scientist can build up their models in an interactive way. They select data from the Spark RDD, group it, filter it, make cross tables, and create visualizations in the sandbox mirror of live production data.

If they are not working with the rest of the team they are working on only a subset of the actual data. So they cannot see the complete results from their regression analysis or predictive model until that is pushed to production. It is better to work on the complete data the first time around.

When integrated with the rest of the development team, data scientists do not have to modify their code to switch from spreadsheet .csv files to Hadoop, Spark, or Hive format. They keep their code in the same repository as the regular Java programmers and architects who are writing build-out scripts thus allowing it to be run against automated testing frameworks and released to the main code branch. Code repositories like Git would not support having two different production-ready versions of the code.

Analytics as Code

Continuous Analytics follows the same practice as devops, whose mantra is Infrastructure as Code. Continuous Analytics says Analytics as Code. That means everything is abstracted so that it can be pushed out in a continuous release fashion and deployed to any set of clusters at any time. There are no clumsy, time consuming manual steps, as all of that is written as code.

The requirement that everyone on the development team write code is new to mathematicians, statisticians, and physicists. But they quickly adapt to this model. That both creates clean and reliable code and boosts cross-team communications. It also imposes standards across the team so that handoff from one person to another is not as difficult.

Breaking it Down into Steps

Here is how big data development progresses from data exploration to production build. Here too is how all of that is made smoother with Continuous Analytics.

Data Exploration—this is where data scientists prototype their ideas against actual live data. This tests their assumptions and yields insights that lead to new models and algorithms.

Development—data patterns and insights found in the discovery phase are transitioned to production-ready code with little to no refactoring.

Test—as algorithms and machine learning models are coded they are pushed through unit and integration tests to enable continuous integration into the master branch.

Deploy—the automatic push to production comes next. At this stage, the techniques similar to Canary Releases—which release changes to a subset of the system rather than the whole—are applied. Canary allows two or more instances of the model to exist at the same time. This also facilitates performing QA validations without impacting production users.

Monitor—checks performance and accuracy of the analytics models. For example, it runs checks to verify that, for example, the union of two sets of n tuples is 2n. It also looks to reduce the error statistic in regression and other analysis. Monitoring gathers all of this and data on system health and performance and feeds that back into the development process, where changes are made and released back into production.

So Continuous Analytics takes the lessons learned from Agile development and devops and applies those to the big data development process. This is logical and natural as writing statistical and modelling analytics code can be fit into the traditional coding process. This imposes efficiencies and standards that shorten the release cycle and gets the output from analytics into the hands of business planners faster.

Oleg Kulikov

Experienced Engineering & Delivery Manager with a robust background in software development and mathematics

8 年

Stepan, do you know any practical implementation of this approach?

查看更多评论

要查看或添加评论，请登录

Stepan Pushkarev的更多文章

Machine Learning Function as a Service (ML FaaS)

2017年10月6日

Machine Learning Function as a Service (ML FaaS)

I am going to give a talk in upcoming Software Architecture Conference in London. You may find the preceding part here.
Tear down the Wall between Data Science and DevOps

2016年5月9日

Tear down the Wall between Data Science and DevOps

Big data has moved beyond hype to production. After several years of education and promotion by the tech media and…
How Long Do Engineers Stay On Your Team?

2015年7月8日

How Long Do Engineers Stay On Your Team?

In the past, changing your job every 5 years was considered the norm. Since then, our lifestyle has changed its pace…

Continuous Analytics Defined

Stepan Pushkarev

Co-founder, CTO/CEO at Provectus

Adding Big Data Members to the Development Team

The Disjointed Big Data Development Process

Analytics as Code

Breaking it Down into Steps

Stepan Pushkarev的更多文章

社区洞察

其他会员也浏览了

Let's understand DataOps

Designing for the Future: Building Resilient and Agile Data Architectures

Marvelous MLOps #45: The Ultimate Must-Haves and Nice-to-Haves for MLOps & LLMOps

Using DevOps for Data Science: Collaborating Development and Data

DataOps: Better Data Analytics, Obtained Faster

Why big data needs DevOps?

MLOps Maturity Stages

DevOps for Data Analytics and Intelligence: Transforming Defense and Intelligence Operations

Working of MLOps (Part-3)

The Dataops Evolution

Adding Big Data Members to the Development Team

The Disjointed Big Data Development Process

Analytics as Code

Breaking it Down into Steps

Stepan Pushkarev的更多文章

Machine Learning Function as a Service (ML FaaS)

Tear down the Wall between Data Science and DevOps

How Long Do Engineers Stay On Your Team?

社区洞察

其他会员也浏览了

Let's understand DataOps

Designing for the Future: Building Resilient and Agile Data Architectures

Marvelous MLOps #45: The Ultimate Must-Haves and Nice-to-Haves for MLOps & LLMOps

Using DevOps for Data Science: Collaborating Development and Data

DataOps: Better Data Analytics, Obtained Faster

Why big data needs DevOps?

MLOps Maturity Stages

DevOps for Data Analytics and Intelligence: Transforming Defense and Intelligence Operations

Working of MLOps (Part-3)

The Dataops Evolution