Big Data - AWS, Azure, GCP Offerings
In part 1 we covered how open source has been extremely disruptive in shaping Big Data’s evolution. Additionally, we mapped the open source systems into an architectural framework. The 2nd part (this article) makes an attempt to deep dive into the 3 prominent cloud providers - AWS, Azure and GCP - and map their offerings into the architecture framework. One caution, the cloud platform offerings, are a continuous moving target, the article focuses on the key ones (NOT ALL) with an effort to bucket similar offerings based on their capabilities.
AWS
1) Foundational Services
AWS EMR is the big data platform for processing vast amounts of data using open source tools such as Spark, Hive, HBase, Flink, Hudi and Presto. It includes provisioning, scaling, and reconfiguring of clusters, and notebooks for collaborative development.
AWS managed Kafka is a messaging system to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications. It automatically provisions and runs Kafka clusters.
2) Storage Services
AWS Simple Storage Service is an object storage service that offers scalability, availability, security, and performance. Is designed for 11 9's of durability.
Elastic File System (Amazon EFS) provides a simple, scalable and fully managed elastic NFS file system.
3) Data Management Services
AWS Data Pipeline helps reliably process and move data between different AWS compute and storage services, as well as on-premises data sources. It helps users create complex data processing workloads that are fault tolerant, repeatable, and highly available.
AWS Glue is a data catalog service that helps users discover data and store the associated metadata (e.g. table definition and schema). Once cataloged, the data is immediately searchable, queryable, and available for ETL. It is a fully managed extract, transform and load (ETL) service that makes it easy to prepare data for analytics. It allows users to create and run an ETL job with a few clicks using the AWS Glue visual editor.
Amazon QuickSight is a business intelligence service that makes it easy to deliver insights. It lets users easily create and publish interactive dashboards that include ML Insights.
4) Data Processing Services
Amazon Kinesis enables users to ingest, buffer, and process streaming data in real-time, to derive insights. It is fully managed which allows continuous stream processing with low latency.
Amazon Redshift is a petabyte scale Data Warehouse. It provides the ability to run queries against the data lake with Redshift Spectrum, supports autoscaling and high concurrency at consistent performance.
AWS Athena is an interactive serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. It uses Presto with ANSI SQL support and works with a variety of standard data formats including CSV, JSON, ORC, Avro, and Parquet. Athena while ideal for quick, ad-hoc querying can also handle complex analysis, including large joins, window functions, and arrays.
5) Machine Learning Services
Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. It has few services i) SageMaker Studio, which provides a single, web-based visual interface where users can perform all ML development steps. SageMaker Studio gives users complete access, control, and visibility into each step required to build, train, and deploy models, ii) Amazon SageMaker Autopilot provides automated machine learning capability that automatically inspects raw data, applies feature processors, picks the best set of algorithms, trains and tunes multiple models, tracks their performance, and then ranks the models based on performance, iii) SageMaker Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning, iv) Amazon SageMaker Neo enables developers to train machine learning models once and run them anywhere in the cloud and at the edge.
6) Data Serving Services
Relational - Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in the cloud. Amazon RDS provides six familiar database engines to choose from including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server.
Relational - Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, that combines the performance and availability of traditional enterprise databases. It is a distributed, fault-tolerant, self-healing storage system that auto-scales to deliver high performance and availability with low-latency reads.
Key-Value - Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It's a fully managed, multiregion, multimaster, durable database with built-in security, backup and restore, and in-memory caching.
Graph - Amazon Neptune is a reliable, fully managed graph database service. It is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with milliseconds latency. It supports popular graph query languages like Apache TinkerPop Gremlin and SPARQL, allowing to easily build queries that efficiently navigate highly connected datasets.
Document - Amazon DocumentDB is a document based, fully managed database service designed for performance, scalability, and availability when operating mission-critical MongoDB workloads. The storage and compute are decoupled allowing each to scale independently.
Cache - Amazon ElastiCache allows users to seamlessly set up, run, and scale popular open-source compatible in-memory data stores. It offers fully managed Redis and Memcached with sub-millisecond response times.
Search - Amazon Elasticsearch Service is a fully managed service that makes it easy for users to deploy, secure, and run Elasticsearch cost effectively at scale.
TimeSeries - Amazon Timestream provides a scalable and serverless time series database service that makes it easy to store and analyze trillions of events per day. It has a purpose-built query engine that lets users access and analyze recent and historical data together, without needing to specify explicitly in the query whether the data resides in the in-memory or cost-optimized tier. It has built-in time series analytics functions, helping identify trends and patterns in data in near real-time.
Azure
1) Foundational Services
Service Bus is a reliable, fully managed messaging as a service (MaaS) that provides publish/subscribe, asynchronous operations along with structured first-in, first-out (FIFO) messaging capabilities.
Azure HDInsight provides the Hadoop ecosystem as a service with popular open source frameworks of Apache Hadoop, Spark, Kafka, HBase, Hive and Storm.
Azure Databricks provides the latest versions of the Databricks stack with Apache Spark, Delta lake and ML flow and allows seamless integration with open source libraries. It provides autoscaling and auto-termination to improve total cost of ownership.
2) Data Storage Services
Data Lake Storage Gen2 is the foundation for building data lakes. A fundamental part of Data Lake Storage Gen2 is the hierarchical namespace Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. Operations such as renaming or deleting a directory become single atomic metadata operations on the directory.
Azure Files Storage, offers fully managed file shares in the cloud that are accessible via the industry standard Server Message Block (SMB) protocol or Network File System (NFS) protocol.
3) Data Management Services
Azure Data Factory is the ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management.
Azure Data Explorer is a fully managed data analytics service for real-time analysis on large volumes of data streaming applications. It allows collecting, storing, and analyzing diverse data and makes it simple to ingest this data to do complex ad hoc queries on the data.
Azure Data Catalog is a fully managed service that enables data discovery - any user analyst, data scientist, or developer - can discover, understand, and consume data sources. Data Catalog includes a crowdsourcing model of metadata and annotations. It is a single, central place for all of an organization's users to contribute their knowledge and build a community and culture of data.
Azure Synapse is a Data Warehouse analytics service that brings together enterprise data warehousing and Big Data analytics. It provides the option to either use serverless or provisioned resources. Additionally, it provides capabilities to ingest, prepare, manage, and serve data for BI and machine learning needs.
4) Data Processing Services
Event Hubs is a fully managed, real-time messaging and data ingestion service. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters. Event Hubs for Kafka provides a Kafka endpoint so that any Kafka client can publish and subscribe events to and from Event Hubs with a simple configuration change.
Azure Stream Analytics is a fully managed continuous real-time analytics and complex event-processing engine that is designed to analyze and process high volumes of fast streaming data from multiple sources simultaneously. It supports guaranteed, “exactly once” event processing with 99.9% availability and built-in recovery capabilities.
Azure Data Lake Analytics is an on-demand analytics job service. Instead of deploying, configuring, and tuning hardware, users write queries to transform data and extract valuable insights.
Power BI Embedded simplifies Business Intelligence capabilities by helping users to quickly add stunning visuals, reports, and dashboards to their apps.
5) Machine Learning Services
Azure Machine Learning is a cloud-based environment that users can use to train, deploy, automate, manage, and track ML models.
Azure Machine Learning studio is a web portal for data scientist developers in Azure Machine Learning. The studio combines no-code and code-first experiences for an inclusive data science platform.
6) Data Serving Services
Relational - Azure SQL Database is a fully-managed scalable, relational SQL database service with AI-powered and automated features that optimize performance and durability. It additionally supports serverless compute and Hyperscale storage options.
Relational - Azure Database for PostgreSQL, MySQL and Maria DB are relational database services based on their respective community editions. The cloud offerings automates the management and maintenance of the infrastructure and database server including routine updates, backups, and security.
Key-Value - CosmosDB is a fully-managed key-value multi-model database. It is globally distributed and provides low latency, high throughput with client SDKs available for . NET, Java, Python, and Node and APIs for SQL, MongoDB, Cassandra, and Gremlin, and no-ETL (extract, transform, load) analytics.
Cache - Azure Redis Cache is a fully managed distributed cache with low latency, high throughput and performance to handle millions of requests per second.
GCP
1) Foundational Services
Cloud Pub-Sub is a managed, highly available service that replicates messages. It provides Messaging and ingestion for event-driven systems and streaming analytics. It supports the semantics of both at-least-once and exactly-once delivery & processing.
Cloud Dataproc is a managed Spark and Hadoop service that packages the open source data tools for batch processing, querying, streaming, and machine learning. It offers fully configured environments for Flink, Solr, Zookeeper, Druid, Presto, and other open source software components related to the Apache Hadoop and Apache Spark ecosystem.
2) Data Storage Services
Cloud Storage provides object storage, lifecycle Management & versioning. It provides tiered storage with i) Standard Storage that is good for “hot” data that’s accessed frequently ii) Nearline Storage with lower cost that is good for data that can be stored for at least 30 days iii) Coldline Storage with very low cost that is good for data that can be stored for at least 90 days, including disaster recovery and finally iv) Archive Storage with the lowest cost that is good for data that can be stored for at least 365 days, including regulatory archives.
Filestore is a high-performance, fully managed file storage that offers low latency for file operations.
3) Data Management Services
Cloud Data Fusion is a fully managed service that provides data integration at scale. It is built using open source project CDAP which ensures data pipeline portability. It offers pre-built transformations as well as the ability to create an internal library of custom connections and transformations that can be validated, shared, and reused across teams.
Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow, orchestrating workflows that cross between on-premises and the public cloud. It provides the ability to create workflows that connect data, processing, and services across clouds to give a unified data environment.
Cloud Datalab is an interactive data analysis and machine learning environment designed to explore, analyze, transform, and visualize data interactively and to build machine learning models. Cloud Datalab is packaged as a container and provides a notebook environment.
Cloud Data Catalog is a managed data discovery and metadata management service. It is a cataloging system for capturing both technical metadata (automatically) as well as business metadata (tags) in a structured format. It provides search and discovery through a simple UI. It also allows ingestion of technical metadata from non-Google Cloud data assets for a unified view of all data assets.
4) Data Processing Services
BigQuery is a serverless, scalable, multi-cloud data warehouse. It integrates with open source data science workloads (Spark, TensorFlow, Dataflow and Apache Beam, MapReduce, Pandas, and scikit-learn) directly using the Storage API. It provides multi-cloud capabilities with BigQuery Omni that allows to analyze data across clouds using standard and Data QnA which makes it easy for anyone to access the data insights they need through NLP. BigQuery ML integrates with AI Platform Prediction and TensorFlow to enable training of models on structured data with SQL.
Cloud Dataflow is a unified (through Apache Beam) stream and batch data processing that's serverless. Apache Beam’s programming model simplifies the mechanics of large-scale data processing. It allows users to create pipelines that encapsulate the entire series of computations involved in reading input data, transforming that data, and writing output data.
5) Machine learning Services
AI Platform makes it easy for developers, data scientists, and data engineers to streamline their ML workflows. It helps manage the training, validation, deployment, predictions (online and batch) and monitoring in a typical ML workflow. AI Pipelines streamlines ML Ops.
Cloud AutoML is a suite of machine learning products that enables developers with limited machine learning expertise to train high-quality models specific to their business needs. It relies on Google’s state-of-the-art transfer learning and neural architecture search technology.
6) Data Serving Services
Relational - Cloud Spanner is a managed relational database with scale, strong consistency, and up to 99.999% availability. It supports strong transactional consistency with schemas, SQL queries, and ACID transactions and provides transparent, synchronous replication across region and multi-region configurations. It automatically shards the data based on request load and size.
Relational - Cloud SQL is a fully-managed database service that helps set up, maintain, manage, and administer relational databases. It provides support for MySQL, PostgreSQL, and SQL Server database engines.
Key-Value - Google BigTable is a fully managed and scalable NoSQL database service for large analytical and operational workloads. Bigtable is ideal for storing very large amounts of data in a key-value store and supports high read and write throughput at low latency for fast access to large amounts of data.
Cache - Cloud Memorystore is a fully managed in-memory data store service for Redis built on scalable, secure and highly available infrastructure.
3 Cloud Offerings at a glance
Summary
Hope this was useful in providing an overview of the key Big Data product offerings across AWS, Azure and GCP.
Another amazing document Anil Madan