ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Tear down the Wall between Data Science and DevOps

Stepan Pushkarev

Co-founder, CTO/CEO at Provectus

å‘å¸ƒæ—¥æœŸ: 2016å¹´5æœˆ9æ—¥

Big data has moved beyond hype to production. After several years of education and promotion by the tech media and vendors, business has now gained a solid understanding of how they can use big data to solve problems and improve efficiency. But what IT has not yet figured out is how to efficiently manage the big data development team and infrastructure. This mismatch causes a drain on productivity that is an impediment to reaping the benefits of big data analysis. And it slows down the time it takes to roll out new ideas and makes it more difficult to do so.

Built Dataâ€™s Built-in Complications

Big data projects have built-in complications that make them more challenging to manage than other types of software development. The software upon which it is based is rapidly changing, because it is so new. The very idea of streaming, unstructured data is hard to grasp for traditional row-and-column Java programmers. The esoteric algorithms, math, and statistics that powers analytics are incomprehensible to most people beyond the data scientists.

Twitter, Linkedin, Yahoo, and Google have taken their best ideas on how to process reams of unstructured streaming data and handed those over to open source projects. But there is no overarching set of tools to tie Spark, Kafka, Mesos, and Hadoop together and connect algorithms, code, and the platform. What is missing is a devops tool and process to make it easier to build, integrate, automate, monitor, and upgrade the system and do so within the confines of a shortened release cycle.

The project team has changed too. People who write algorithms and build models, who used to work in isolated silos, are new additions to the team of Agile project managers, product managers, QA testers, architects, and developers. What data scientists, data engineers, and devops need are processes and automated tools to let them hand off to each other and to hand over to those traditional Agile players.

The big data ecosystem has a lot of moving pieces and persons to bring together. From the application perspective there is the forecasting app, reporting services, and business triggers that signal the business to make a price change or taking other action. There is service discovery, log collection, and monitoring. From the architectural perspective the project needs to spin up and tear down storage, networks, and virtual machines and lay down and configure Spark, Kafka, and Hadoop and do so across clusters in DEV, QA, and production. And the private and public cloud operating models means all of that needs to work with Amazon, OpenStack, Azure, and the Google Cloud.

Orchestration without the Orchestra

The build and release cycle goes something like this: first, business planners come up with a metric that they want to track, correlate, or forecast. Data scientists ask engineers to prepare a data set that they can experiment with. Next data scientists code algorithms and test models. Then comes ETL development and data preparation. Next programmers and big data engineers implement these models in Hadoop and Spark. Then there is the push to production. After that the application begins gathering metrics and make projections as it run its models. If something fails or works incorrectly it is escalated from ops to the data analyst to the data engineers. The feedback from all of this goes back into the business and IT decision making loop and the process repeats itself in a never ending fashion.

In the current approach to doing all of this, where there is no devops coordination, building the infrastructure and pushing out code is clumsy and slow. It can take up to six months to get the big data platform up and running and one month to go from release iteration, to analytics model, to production. All of this needs to work smoothly with the build and release cycle of traditional software projects, which people already know how to manage with devops.

There are a couple of ways that businesses have attempted to reduce the complexities and inefficiencies in all of this. One is to embrace the one-stop SaaS approach to gain access to Spark, Hadoop, and Kafka. But that ties a company into one vendor and one stack. Without access to the command-line, IT cannot readily swap out components. Already we have seen Spark replace MapReduce, so a company cannot be wedded to something that is not a complete abstraction. Also SaaS offers just one building block in the big data platform as the Hadoop cluster by itself is not a complete analytics system.

Enter Hydrosphere to Automate the Build Cycle

Rising to the challenge of bringing Agile efficiency and devops thinking to big data projects, we started Hydrosphere.io to offer opensource end-to-end solution for big data.

Hydrosphere.io does for big data and analytics what devops does for software development and continuous delivery: it builds out, tests, and configures the entire big data platform, bringing together QA, development, data scientists, and business line product managers.

As a devops platform, Hydrosphere provides a tool for data teams to have self service data exploration. It ties together development, testing, deployment and monitoring, using Mesos for dynamic resource allocation and distribution. It releases code into and builds out DEV, QA, and production clusters, keeping the big data lake available across all three. It eliminates silos and dramatically improves the efficiency of data science teams.

Our Message to Business

At Hydrosphere.io we rise to this challenge by giving the project the tools they need to bring together the many players on the project team with automated tools. The software delivers measurable improvements in the build and release cycle. We aim to reduce the time it takes to build the ecosystem by up to a factor of 10 and shorten the release cycles by a factor of 5 by allowing continuous integration and delivery. In sum, it speeds up the time that it takes for product managers, product planners, and data scientists to take ideas they have etched out on the whiteboard and deliver a working solution. Thatâ€™s the message that business managers want to hear.

Request for a feedback

We are in alpha stage now with Hydrosphere, working with early adopters and friends to implement this. The next step is to push our product to opensource code and make it available to the community. As we work through this we are open to feedback and criticism. Please feel free to send me an email or signup for updates at https://hydrosphere.io.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Stepan Pushkarevçš„æ›´å¤šæ–‡ç«

Machine Learning Function as a Service (ML FaaS)

2017å¹´10æœˆ6æ—¥

Machine Learning Function as a Service (ML FaaS)

I am going to give a talk in upcoming Software Architecture Conference in London. You may find the preceding part here.
Continuous Analytics Defined

2016å¹´5æœˆ23æ—¥

Continuous Analytics Defined

Continuous Analytics is the extension of devops, continuous integration, and continuous delivery to big dataâ€¦

2 æ¡è¯„è®º
How Long Do Engineers Stay On Your Team?

2015å¹´7æœˆ8æ—¥

How Long Do Engineers Stay On Your Team?

In the past, changing your job every 5 years was considered the norm. Since then, our lifestyle has changed its paceâ€¦

Tear down the Wall between Data Science and DevOps

Stepan Pushkarev

Co-founder, CTO/CEO at Provectus

Built Dataâ€™s Built-in Complications

Orchestration without the Orchestra

Enter Hydrosphere to Automate the Build Cycle

Our Message to Business

Request for a feedback

Stepan Pushkarevçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

ETL encapsulation in aws-Lambda Function with Serverless, CloudFormation, APIGateway, Docker, FastAPI to PowerBI API

Building Transaction Apache Hudi Data Lake with Streaming ETL from Multiple Kinesis Streams and Joining using Apache Flink | Hands on Lab

Orchestrators: Apache Airflow vs. Dagster vs. Azure Data Factory

Best Practices: Running Stateful Apps on Kubernetes

Kafka Architecture

ETL vs. ELT: Pick the Right approach based on your team, not just trends

Engineering Grads' Path to Data Engineering: Power of Platforms in Data Engineering

Day 8: Data Engineering for MLOps

Built Dataâ€™s Built-in Complications

Orchestration without the Orchestra

Enter Hydrosphere to Automate the Build Cycle

Our Message to Business

Request for a feedback

Stepan Pushkarevçš„æ›´å¤šæ–‡ç«

Machine Learning Function as a Service (ML FaaS)

Continuous Analytics Defined

How Long Do Engineers Stay On Your Team?

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

ETL encapsulation in aws-Lambda Function with Serverless, CloudFormation, APIGateway, Docker, FastAPI to PowerBI API

Building Transaction Apache Hudi Data Lake with Streaming ETL from Multiple Kinesis Streams and Joining using Apache Flink | Hands on Lab

Orchestrators: Apache Airflow vs. Dagster vs. Azure Data Factory

Best Practices: Running Stateful Apps on Kubernetes

Kafka Architecture

ETL vs. ELT: Pick the Right approach based on your team, not just trends

Engineering Grads' Path to Data Engineering: Power of Platforms in Data Engineering

Day 8: Data Engineering for MLOps

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†