Continuous Analytics Defined
Continuous Analytics is the extension of devops, continuous integration, and continuous delivery to big data development. It plugs data science and data engineering into the traditional software development model and Agile methodology where coding, testing, and deployment are all run as automated processes.
Continuous Analytics is related to the new term DataOps, which defines the exact workflow for data analysts to follow to improve their efficiency by reducing the time-to-value in analytics projects. Continuous Analytics meshes neatly with that DataOps set of best practices.
This software engineering approach lets big data teams release software in shortened cycles, which is the goal of Agile iterations. The data scientists store their code in Git together with the programmers who write APIs to connect to datasources. The devops and big data engineers code scripts and playbooks in Ansible and Docker to lay down the big data software, networks, and storage, and spin up virtual machines. Testing is automated and built into the process.
Adding Big Data Members to the Development Team
Big data developments adds people to the development team who have not traditionally been working there. Data scientists are mathematicians and statisticians who write code, but they usually work in isolation from the traditional development team of programmers, product managers, project managers, testers, and architects. Continuous Analytics brings them onboard with everyone else.
In particular we have these two distinct roles related to big data and analytics:
- The big data analyst is trained in mathematical and statistical modelling and machine learning. Before big data made technology it possible to apply analytics to large amounts of data to solve business problems, you typically found these scientists working on, for example, clinical trials at university hospitals. Now they have been made part of tactical business planning as they work with live business data and outside data feeds and report predictions and correlations that they find there to business managers.
- The big data engineer is a platform specialist. They are responsible for designing the big data architecture. This includes architecting, configuring, and installing the big data tools: Kafka, HBase, Spark, Hadoop and others. They work with virtualization and network architects and programmers to write scripts to automate the installation of all of that in public and private clouds.
The Disjointed Big Data Development Process
Continuous analytics solves the main problem with how analytics is coded, which is it has been done without an overarching methodology, like devops and Agile, and no common set of tools.
In the siloed approach to big data development, the data analyst uploads a file or downloads a feed to do data exploration. So they work independently from the data laid down by the big data engineer, who is installing Spark and configuring Kafka. They are also one step removed from programmers who are working with Twitter APIs and writing code to retrieve data purchased from data brokers.
Someone looking at this approach from above would question that. They would say It obviously makes more sense to work on the data in its final form up front. This is made easier as various big data systems like Spark have interactive command line shells written in Scala, Python, and R, which are the programming languages the data scientists already use. With those shells the data scientist can build up their models in an interactive way. They select data from the Spark RDD, group it, filter it, make cross tables, and create visualizations in the sandbox mirror of live production data.
If they are not working with the rest of the team they are working on only a subset of the actual data. So they cannot see the complete results from their regression analysis or predictive model until that is pushed to production. It is better to work on the complete data the first time around.
When integrated with the rest of the development team, data scientists do not have to modify their code to switch from spreadsheet .csv files to Hadoop, Spark, or Hive format. They keep their code in the same repository as the regular Java programmers and architects who are writing build-out scripts thus allowing it to be run against automated testing frameworks and released to the main code branch. Code repositories like Git would not support having two different production-ready versions of the code.
Analytics as Code
Continuous Analytics follows the same practice as devops, whose mantra is Infrastructure as Code. Continuous Analytics says Analytics as Code. That means everything is abstracted so that it can be pushed out in a continuous release fashion and deployed to any set of clusters at any time. There are no clumsy, time consuming manual steps, as all of that is written as code.
The requirement that everyone on the development team write code is new to mathematicians, statisticians, and physicists. But they quickly adapt to this model. That both creates clean and reliable code and boosts cross-team communications. It also imposes standards across the team so that handoff from one person to another is not as difficult.
Breaking it Down into Steps
Here is how big data development progresses from data exploration to production build. Here too is how all of that is made smoother with Continuous Analytics.
Data Exploration—this is where data scientists prototype their ideas against actual live data. This tests their assumptions and yields insights that lead to new models and algorithms.
Development—data patterns and insights found in the discovery phase are transitioned to production-ready code with little to no refactoring.
Test—as algorithms and machine learning models are coded they are pushed through unit and integration tests to enable continuous integration into the master branch.
Deploy—the automatic push to production comes next. At this stage, the techniques similar to Canary Releases—which release changes to a subset of the system rather than the whole—are applied. Canary allows two or more instances of the model to exist at the same time. This also facilitates performing QA validations without impacting production users.
Monitor—checks performance and accuracy of the analytics models. For example, it runs checks to verify that, for example, the union of two sets of n tuples is 2n. It also looks to reduce the error statistic in regression and other analysis. Monitoring gathers all of this and data on system health and performance and feeds that back into the development process, where changes are made and released back into production.
So Continuous Analytics takes the lessons learned from Agile development and devops and applies those to the big data development process. This is logical and natural as writing statistical and modelling analytics code can be fit into the traditional coding process. This imposes efficiencies and standards that shorten the release cycle and gets the output from analytics into the hands of business planners faster.
Experienced Engineering & Delivery Manager with a robust background in software development and mathematics
8 年Stepan, do you know any practical implementation of this approach?