登录查看更多内容

100 open source Big Data and ML architecture papers for data professionals (sequel).

Anil Madan

发布日期: 2020年10月13日

Introduction

In the last decade, Big Data technology has been extremely disruptive with open source playing a dominant role in shaping its evolution. It has led to a complex ecosystem where new frameworks, libraries and tools were being released pretty much every day, creating confusion as technologists struggled to understand the intricacies of the systems. In 2015, I made an attempt in addressing that (see original article) by demystifying the space. Well a lot has changed in the last 5 years with new ecosystems like Spark and Flink that have disrupted the 1st generation systems, advances in AI and ML and the maturity of offerings from cloud providers. Thought it was a good time to do a sequel.

This time, I have made an attempt to unpack the complexity in a 3-part series. The first part (current article) provides an overview of the architecture through the lens of an open source ecosystem. The second covers the realization of architecture from the lens of the cloud providers while the third will do the same from the lens of some leading Tech companies which are dealing with data at scale.

If you are a Big Data enthusiast or a technologist ramping up (or scratching your head), it is important to spend some serious time deeply understanding the architecture of key systems to appreciate its evolution. Understanding the architectural components and subtleties would also help you choose and apply the appropriate technology for your use case. In my journey over the last decade, some literature has helped me become a better educated data professional. My goal here is to not only share the literature but consequently also use the opportunity to put some sanity into the labyrinth of open source systems.

One caution, most of the reference literature included is hugely skewed towards providing a deep architecture overview (in most cases it refers to the original research papers).

Architecture Layers

1. Data Foundational Services

a) Coordination - These are systems that are used for coordination and state management across distributed data systems both inspired from Paxos.

Zookeeper — inspired by Chubby though is a general coordination service rather than simply a locking service.
Raft a consensus Algorithm used in several modern databases - e.g. CockroachDB MongoDB and InfluxDB.

b) Resource Managers - While the first generation of Hadoop ecosystem started with monolithic schedulers like YARN, the evolution happened towards hierarchical schedulers (Mesos), that can manage distinct workloads across different kinds of compute paradigms to achieve higher utilization and efficiency.

YARN — is a resource management and job scheduling framework.
Mesos — scheduling between multiple diverse cluster computing frameworks.
Peloton is a unified resource scheduler built by Uber that runs on top of Mesos, to support distinct workloads - i) stateless jobs with long-running services ii) stateful jobs long-running services such as those from Cassandra, MySQL, and Redis, that have state, iii) batch jobs or iv) daemon jobs.

These are loosely coupled with Hadoop schedulers whose primary function is to schedule jobs based on scheduling policies/configuration. A couple of popular ones are Capacity Scheduler and FairShare Scheduler (background here).

c) Engines - are the execution runtimes which provide an environment for running distinct kinds of compute. The two most common engines are

● Spark — has popularity and widespread adoption with a thriving ecosystem.

● Flink — very similar to Spark; strength over Spark is in exactly-once stream processing.

d) Messaging - Kafka is the most widely used messaging system for data processing.

e) Security

● Hadoop Security Design– seminal paper which captures key aspects of Hadoop design. A good overview of the projects can be found here.

● Apache Metron- is a cyber security application framework that provides a centralized tool for security monitoring and anomaly detection with capabilities for log aggregation, full packet capture indexing, storage, advanced behavioral analytics and data enrichment, while applying the most current threat-intelligence information to security telemetry.

● Apache Knox- is the Web/REST API Gateway solution for Hadoop. It provides a single access point to access all of Hadoop resources over REST. It acts as a virtual firewall enforcing authentication and usage policies on inbound requests and blocking everything else.

● Apache Ranger — is a policy administration tool for Hadoop clusters. It includes a broad set of management functions, including auditing, key management, and fine-grained data access policies across HDFS, Hive, YARN, Solr, Kafka and other modules.

● Apache Sentry- fine-grained authorization to data stored in Apache Hadoop. Enforces a common set of policies across multiple data access paths in Hadoop. Apache Sentry overlaps with Apache Ranger since it also deals with authorization and permissions.

f) Operability - The operational frameworks provide capabilities for metrics, benchmarking and performance optimization to manage workloads.

OpenTSDB — a time series metrics systems built on top of HBase.
Ambari — is a system for collecting, aggregating and serving Hadoop and system metrics.
Helix - developed at LinkedIn a generic cluster management system.
Mantis - built at Netflix, the Mantis platform provides observability to help quickly identify issues, trigger alerts, and apply resiliency to minimize or completely avoid downtime.

2. Data Storage Services

a) File Systems - Distributed file systems provide storage, fault tolerance, scalability, reliability, and availability.

Google File System- The seminal work on Distributed File Systems which shaped the Hadoop File System.
Hadoop File System– Historical context/architecture on evolution of HDFS.
Ceph File System – An alternative to HDFS which converges block, file and object storage.
Alluxio, Inc. (formerly Tachyon Nexus) Tachyon– An in-memory storage system to handle the modern day low latency data processing.

b) File Formats - File Systems have also seen an evolution on the file formats and compression techniques. Column Oriented vs Row-Stores provides a good overview of data layout, compression and materialization. A good overview paper that compares and contrasts them can be found here. A detailed paper is here. A related topic is compression techniques and their comparison on the Hadoop ecosystem.

Avro — modeled around ProtocolBuffers - language neutral serialization format popularized by Google - for the Hadoop ecosystem. It is a row based format.
Parquet– column oriented format first covered in Google’s Dremel’s paper.
ORCFile – an improved column oriented format used by Hive.
Arrow - a in-memory language-independent columnar format for flat and hierarchical data that also supports zero-copy reads for lightning-fast data access without the serialization overhead.

c) Table Format - Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg (comparison) are trying to solve for transactional guarantees through ACID compliance, snapshot isolation, safe schema evolution and performant support for CRUD operations.

Delta Lake - uses a transaction log that is compacted into Apache Parquet format to provide ACID properties and time travel.
Apache Iceberg - originally developed at Netflix, provides transactional guarantees, snapshot isolation, column projections and predicate pushdown.
Apache HUDI (Hadoop Upserts Deletes and Incrementals) - originally developed at Uber, again bringing in ACID compliance on data lakes.

3. Data Management Services

a) Data Query - The data query tools range from declarative languages like SQL to procedural languages like Pig.

Pig — Provides a good overview of Pig Latin, while Pig provides an introduction of how to build data pipelines using Pig.
Hive — provides an introduction of Hive, while Hive another good paper tries to share the motivations behind Hive at Facebook.
Phoenix — SQL on HBase.
Spark SQL - Relational data processing on Spark.
Calcite - provides optimized query processing over heterogeneous sources and is embedded in several Big Data systems.

b) Visualization

Superset - developed at AirBnb it is a data exploration and data visualization tool for Big Data.

c) Data Integration - Data integration frameworks provide good mechanisms to ingest and outgest data between Big Data systems.

Sqoop– a tool to move data between Hadoop and Relational data stores.
Flume - a stream-oriented data flow, which collects logs from all specified servers and loads them on central storage.
Goblin - originally developed at LinkedIn simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.
Marmaray - developed at Uber supports both ingestion and dispersal.

d) Orchestration - tools help build data pipelines

Apache Nifi- data distribution and processing system; provides a way to move data from one place to another, making routing decisions and transformations as necessary along the way.
Apache Airflow - originally developed at AirBnb, helps author, schedule and monitor workflows.
Oozie — a workflow scheduler system to manage Hadoop jobs.
Azkaban - a workflow manager developed at LinkedIn.
Genie - a framework developed at Netflix for data pipelines.

e) Metadata Tools - provides a central place to ingest new data sources, discover datasets and insights, browse metadata, explore raw or fairly unprocessed data to understand data lineage, create and share insights and look at data quality monitoring & anomaly detection.

Ground — a system to manage all the information that informs the use of data.
Apache Atlas - Data governance platform, designed to exchange metadata, track lineage with other tools and processes within the Hadoop stack.
DataHub - developed at LinkedIn, provides a complete solution with metadata discovery and a data portal.
Amundsen - developed at Lyft provides data discovery and metadata management.
Metacat - developed at Netflix , provides a unified metadata exploration API service.

4. Data Processing Services

The data processing frameworks can broadly be classified based on the model and latency of processing.

a) Batch - MapReduce — The seminal paper from Google on MapReduce.

b) Streaming

Apache Beam inspired by Google Dataflow and Millwheel, it unifies the model for defining batch and streaming data-parallel processing pipelines.
Flink - Flink provides a unifying model for real-time analysis, continuous streams, and batch processing both in the programming model and in the execution engine. It deals with state using Asynchronized Barrier Synchronization.
Spark Streaming — introduced the micro batch architecture bridging the traditional batch and interactive processing.
Twitter Heron– exposes the same API interface as Storm, however improves upon it to have higher scalability, better debuggability, better performance, and easier to manage.
Samza — stream processing framework from LinkedIn.

c) Interactive

Dremel — Google’s paper on how it processes interactive big data workloads, which laid the groundwork for multiple open source SQL systems on Hadoop.
Presto - an open-source MPP SQL query engine developed at Facebook to quickly process large data sets.
Impala — MPI style processing on make Hadoop performant for interactive workloads.
Drill - An open source implementation of Dremel.
Tez — open source implementation of Dryad using YARN.

d) RealTime

Druid – a real time OLAP data store. Operationalized time series analytics databases.
Pinot – developed at LinkedIn, OLAP data store very similar to Druid.
AresDB - developed Uber for real time analytics that leverages GPUs.

e) Iterative

Pregel — Google’s paper on large scale graph processing.
Giraph — large-scale distributed Graph processing system modelled around Google Pregel.

5. ML Services

ML services provide the foundations for machine learning developers, data scientists, and data engineers to take their ML projects from ideation to production and deployment, quickly and cost-effectively.

a) ML Lifecycle - A typical ML lifecycle starts with data preparation followed by (Feature) discovery, developing and training the model, testing, deploying and then finally using the model for inferences or predictions. Data preparation usually is associated with acquiring, deriving and cleaning enough training data to feed into an ML algorithm. Feature discovery and extraction identifies the key data attributes that are most important to the business domain. Deployment includes observability, debuggability, monitoring and productionalization. There are several challenges in the modern ML lifecycle (see tech debt) which the ML lifecycle systems are trying to solve.

The prominent open source ones are (good overview here):

MLFlow from Databricks with a key principle is an open interface design where scientists and data engineers can bring in their models to operate in a structured environment.
Michaelangelo built by Uber, allows to seamlessly build, deploy, and operate machine learning solutions at scale.
Metaflow is a python library used at Netflix for building and deploying data-science workflows.

b) Distributed Processing Frameworks - There is a lot of progress in building distributed deep learning frameworks, a great introduction can be found here with the popular ones being Google's GPIPE, Uber's Horovard and DeepMind’s TF-Replicator.

c) Feature Stores - A Feature Store allows distinct teams to manage, store, and discover features for use in machine learning projects. It acts as an API between Data Engineering and Data Science, enabling improved collaboration. A great introduction can be found here with a list of feature stores listed here. Few popular ones are Feast from Google with some background here , HopsWorks from Logicalclocks, Frame from LinkedIn and ZipLine from Airbnb.

d) ML Libraries - ML developers usually want to try every available algorithm to build the most accurate model and a team working on an active project might try multiple processing libraries, popular ones being MxNet, TensorFlow, Clipper, Caffe, PyTorch, Theano and Chainer.

6. Data Serving Systems

Broadly, the distributed data stores, depending on the underlying schema & supported data structures, are classified into Column, KeyValue, Document, Graph, Search, Relational and In-memory.

a) Column Oriented Stores

BigTable — seminal paper from Google on distributed column oriented data stores.
HBase — while there is no definitive paper, this provides a good overview of the technology.
Hypertable — modeled around BigTable.

b) Key Value Stores

Dynamo — the seminal paper on key-value distributed storage systems.
Cassandra — Inspired by Dynamo; a multi-dimensional key-value/column oriented data store.
Voldemort — another one inspired by Dynamo, developed at LinkedIn.
RocksDB - developed at Facebook, a log structured database engine, optimized for fast, low latency storage such as flash drives and high-speed disk drives.

c) Document Stores

CouchDB - a popular document oriented data store.
MongoDB — a good introduction to MongoDB architecture.

d) Graph Stores

Neo4j — most popular Graph database.
Titan — graph database under the Apache license.

e) Search Stores

Elastic Search - distributed scalable search engine.
Solr - similar to ElasticSearch.

f) Time Series

TimeScaleDB - scalable time series database.
InfluxDB - another popular time series database (differences).

g) Relational

CockroachDB — modeled around Google’s Spanner.
MySQL - widely used relational database.
PostgreSQL - another very popular database.

h) In-Memory

Memcache - a scalable distributed cache, its usage at Facebook.
Redis - another popular distributed cache with an option of disk persistence.

Addendum

Some general papers which can provide you a great background on NoSQL, Data Warehouse Scale Computing and Distributed Systems.

Distributed Systems Fundamentals
Data center as a computer– provides a great background on warehouse scale computing.
Eventual Consistency– background on the different consistency models for distributed systems.
CAP Theorem – a nice background on CAP and its evolution with its critique.
Lambda- Established architecture for a typical data pipeline.
Kappa– An alternative architecture which moves the processing upstream to the Stream layer.

Summary

I hope that the papers are useful as you embark or strengthen your journey. In the second part we will look at how at the architecture from the lens of cloud providers while in the third part we will look at some leading Tech companies in how they are using these (or have developed similar technology) to solve their business problems.

Flávio Eduardo Rodrigues Ribeiro

Data Consultant(Data Engineering) na Minsait

4 年

Excellent article, it is help me a lot . Thanks

Heman Kodappully

4 年

Thanks for sharing this excellent article & infographic , Anil Madan ! Also enjoyed the article on Cloud services comparison. In the next update perhaps it would be useful to include more tools/software for streaming analytics and those for data & application security and governance.

Shashi rekha

Assistant Professor at VTU People

4 年

Very informative and knowledge sharing topic sir.

R. Paul Singh

4 年

Great article Anil Madan. Looking forward to the next 2 articles

Vijay Anand

CTO, Product and Engineering Leader; Advisor, Coach, Investor; Global team builder; ex-SVPE Intuit, Oracle, Sun Microsystems

4 年

Thank you, Anil! Exactly what I need to catch up on the current state of data technology. Looking forward to the series.

查看更多评论

要查看或添加评论，请登录

Anil Madan的更多文章

Generative AI - Strategic Considerations

2023年4月22日

Generative AI - Strategic Considerations

Generative AI has been in the press recently after the launch of Open AI ChatGPT, and there’s been a lot of “buzz” from…

18 条评论
Modern Data Platform - How to build one?

2021年7月12日

Modern Data Platform - How to build one?

Introduction A modern data platform is a set of cultural principles, tools and capabilities that enables organizations…

24 条评论
Big Data - AWS, Azure, GCP Offerings

2020年10月20日

Big Data - AWS, Azure, GCP Offerings

In part 1 we covered how open source has been extremely disruptive in shaping Big Data’s evolution. Additionally, we…

1 条评论
Online Experimentation: Faster and Faster

2020年9月6日

Online Experimentation: Faster and Faster

If you want to know the basics about Online Experimentation and how it is done at Intuit please refer to the 4 part…

3 条评论
Customer Data Platforms Overview

2020年8月18日

Customer Data Platforms Overview

Introduction Customer Data Platforms(CDP) have recently received a lot of attention as they are becoming an essential…

19 条评论
Artificial Intelligence Landscape - 100 great articles and research papers

2020年2月16日

Artificial Intelligence Landscape - 100 great articles and research papers

Back in 2015 I had written an article on 100 Big Data papers to help demystify landscape. On the same lines I thought…

20 条评论
Personalization in QuickBooks

2019年1月26日

Personalization in QuickBooks

Our mission at Intuit is to Power Prosperity Around the World for small businesses, self-employed and consumers. We…
Experimentation @ Intuit?—?Part 1 Culture

2018年9月17日

Experimentation @ Intuit?—?Part 1 Culture

This is a 4 part series on experimentation at Intuit. Part 1 covers the culture of experimentation @Intuit while Parts…

4 条评论
Monolith to Microservices – A Platform Transformation…Part 2

2016年12月26日

Monolith to Microservices – A Platform Transformation…Part 2

In Part1 we looked at how Microservices enable applications & systems to move away from a monolithic implementation…

14 条评论
Monolith to Microservices – A Platform Transformation…

2016年5月5日

Monolith to Microservices – A Platform Transformation…

Introduction Microservices in the last few years have received considerable coverage. Both large and small enterprises…

18 条评论

See all articles

100 open source Big Data and ML architecture papers for data professionals (sequel).

Anil Madan

Introduction

Architecture Layers

1. Data Foundational Services

2. Data Storage Services

3. Data Management Services

4. Data Processing Services

5. ML Services

6. Data Serving Systems

Addendum

Summary

Anil Madan的更多文章

社区洞察

其他会员也浏览了

AWS SERVERLESS DATA PLATFORM ARCHITECTURE

The evolution of data engineering tools

Best NoSQL databases

Apache Flink: Real-Time Data Processing at Scale

The Evolution of Big Data Technologies

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Mastering Big Data: A Guide to MapReduce, Spark, and Hive for Industry Success ????? Introduction: Big Data and Its Impact

AWS Tools for Big Data Engineering: Enabling Scalable and Efficient Solutions

Day 9: Data Storage and Management

Introduction

Architecture Layers

1. Data Foundational Services

2. Data Storage Services

3. Data Management Services

4. Data Processing Services

5. ML Services

6. Data Serving Systems

Addendum

Summary

Anil Madan的更多文章

Generative AI - Strategic Considerations

Modern Data Platform - How to build one?

Big Data - AWS, Azure, GCP Offerings

Online Experimentation: Faster and Faster

Customer Data Platforms Overview

Artificial Intelligence Landscape - 100 great articles and research papers

Personalization in QuickBooks

Experimentation @ Intuit?—?Part 1 Culture

Monolith to Microservices – A Platform Transformation…Part 2

Monolith to Microservices – A Platform Transformation…

社区洞察

其他会员也浏览了

AWS SERVERLESS DATA PLATFORM ARCHITECTURE

The evolution of data engineering tools

Best NoSQL databases

Apache Flink: Real-Time Data Processing at Scale

The Evolution of Big Data Technologies

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

Mastering Big Data: A Guide to MapReduce, Spark, and Hive for Industry Success ????? Introduction: Big Data and Its Impact

AWS Tools for Big Data Engineering: Enabling Scalable and Efficient Solutions

Day 9: Data Storage and Management