登录查看更多内容

AWS and Open Source Big Data and Analytic Frameworks

Muisa Emin OZDEM, MBA

发布日期: 2022年1月31日

Today, practically every firm makes considerable use of big data to gain a competitive advantage in the market.

With this in mind, open source big data solutions for big data processing and analysis are the most cost-effective and beneficial alternative for enterprises.

As firms build innovative solutions to get a competitive edge in the big data market, it is beneficial to focus on open source big data technologies that are driving the big data sector. Below you will find some of the popular frameworks, processing systems and engines , the usage percentage of data type and popular Data Analytic Related AWS services. Enjoy!

Apache Spark

Apache SparkTM is a distributed processing system that is open-source and used for big data applications. For quick analytic queries against any quantity of data, it uses in-memory caching and efficient query execution.

Presto

Presto is an open source, distributed SQL query engine created from the bottom up for rapid analytic queries on any quantity of data.

Apache Hive

Hive is a Hadoop-based open source data warehousing and analytics package managed by Hive QL, a SQL-based language that allows users to organize, summarize, and query data sources stored on Amazon S3.

Apache Hudi

Apache Hudi is an open-source data management framework designed to make incremental data processing and data pipeline creation easier. Apache Hudi allows you to manage data at the record level in S3 and provides a framework for dealing with data privacy use cases that need record level changes and deletes.

Impala

Impala is a Hadoop ecosystem open source tool for interactive, ad hoc querying using SQL syntax. Instead of MapReduce, it employs a massively parallel processing (MPP) engine similar to those used in traditional relational database management systems (RDBMS).

Apache Pig

Pig is a free and open source analytics package that operates on top of Hadoop. Pig Latin is a SQL-like language that allows users to organise, summarize, and query data sources stored on S3. It includes first-class support for map/reduce functions and complicated extendable user specified data types that enables the analysis of complicated, even unstructured, data sources such as text documents and log files.

Apache HBase

HBase is an open source, non-relational, distributed database inspired by Google BigTable and operates on top of the Hadoop Distributed File System (HDFS) to offer Hadoop with BigTable-like features.

The majority of data businesses have :

10% Structured Data

10% Semistructured Data

80% Untructured Data

- Structured data :

Amazon #RDS and Amazon #Aurora,

MySQL, MariaDB, PostgreSQL, Microsoft SQL Server, and Oracle.

- Semistructured data :

Amazon #Neptune, Amazon #DynamoDB and Amazon #ElastiCache

CSV, XML, JSON

- Unstructured data :

领英推荐

The Top Data Analytics Platforms of 2015?

Bernard Marr 10 年前

Hadoop to Azure Databricks Migration

Dr.Abdur Rahman Author,ICF-PCC,SPC,AWS-SA,ACP,CSM,CPO 5 个月前

The Data Value Chain: Redefined

Roman Stanek 4 年前

Amazon #S3, and Amazon #Redshift Spectrum.

Emails, photos, videos, clickstream data

AWS Data Analytics Services and Brief Summary

1- #Athena - Interactive Analytics

2- #Elasticsearch - Operational Analytics

3- #EMR - Big Data Analytics

-Apache Hive- Data warehouse and analytics package?

-Apache Pig- Analytics package

-Apache Spark - Distributed processing framework and

programming model (machine learning, stream processing, or

graph analytics)

-Apache HBase - Non-relational, distributed database modeled

Google's BigTable

-Presto - Distributed SQL query engine optimized for low-latency,

ad-hoc analytics

-Kinesis Connector-?Enables EMR to directly read and query data

from Kinesis Data Streams.

4- #Kinesis #Analytics - Real Time Analytics

5- Kinesis #Firehose - Delivering Streaming Data

6- Kinesis #Streams - Ingest Large Data

7- Kinesis #Video Streams - Real Time Video Analytics

8- #Redshift - Data Warehousing

9- #Sagemaker - Predictive Analytics

Resource : https://aws.amazon.com/emr/faqs/

https://www.whizlabs.com/blog/big-data-tools/

Ellen Ettinger

I am a Motherhood Expert/Coach/Strategist. The "Go-To Expert in Motherhood, Parenting, and Work/Family Topics"

3 年

Terabytes? Petabytes? Exabytes? Maybe it’s not about the actual size but how you use it, for example the ability to set up a spark cluster that can scale up and down as needed or to use tools like dask.

查看更多评论

要查看或添加评论，请登录

Muisa Emin OZDEM, MBA的更多文章

Frameworks, Tips, and Best Practices for a Product Manager Interview

2022年10月5日

Frameworks, Tips, and Best Practices for a Product Manager Interview

Introduction You have been selected for an interview with a top tech company. You're not going to get through the door…
Side Projects to Transition into a Product Manager Role

2022年9月28日

Side Projects to Transition into a Product Manager Role

Introduction If you want to become a Product Manager, there are three ways to do it: learning from experience, getting…

5 条评论
Data Pipeline

2022年2月24日

Data Pipeline

Amazon Data Pipeline In today's data-driven businesses, data quality is one of the most important factors to consider…
Amazon Quicksight

2022年2月23日

Amazon Quicksight

Amazon QuickSight Business intelligence solutions frequently require teams of data engineers working for months on end…
Fully Managed, Petabyte-scale, Cloud-based Data Warehouse - Redshift

2022年2月16日

Fully Managed, Petabyte-scale, Cloud-based Data Warehouse - Redshift

Amazon Redshift The amount of data that has to be saved, managed, and evaluated grows exponentially as an organization…
OpenSearch Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data:

2022年2月16日

OpenSearch Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data:

Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data: Amazon OpenSearch Service In today's Amazon Web…
Fully Manage Process Streaming and Event Data

2022年2月12日

Fully Manage Process Streaming and Event Data

Amazon Managed Streaming for Apache Kafka Amazon Managed Streaming for Apache Kafka (Amazon MSK) allows you to operate…

4 条评论
Who Does not Want Scalable, Long-Lasting and Low Cost Streaming Data Solution backed by AWS.

2022年2月11日

Who Does not Want Scalable, Long-Lasting and Low Cost Streaming Data Solution backed by AWS.

Amazon Kinesis Data Streams : Stream data is data that is constantly generated from different sources, such as log…
Fully Managed Search - Amazon Cloudsearch

2022年2月9日

Fully Managed Search - Amazon Cloudsearch

Amazon CloudSearch Search is an essential part of any significant data-driven website or app. Without search or even a…

4 条评论
Amazon EMR - Your Solution to Handle Big Data

2022年2月9日

Amazon EMR - Your Solution to Handle Big Data

No Infrastructure! No Waste of Time! There is Amazon EMR! This is a complete guide for technical and non-technical…

4 条评论

See all articles

AWS and Open Source Big Data and Analytic Frameworks

Muisa Emin OZDEM, MBA

领英推荐

Muisa Emin OZDEM, MBA的更多文章

社区洞察

其他会员也浏览了

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

Introduction to Big Data Technologies and Concepts: Building a Foundation for Data-Driven Success

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Delta Lake Format: Understanding Parquet under the hood.

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Apache Parquet: The Modern Solution for Efficient Data Storage and Processing

Big Data Frameworks You Should Know About

Beginner's Guide to Big Data

Introduction

Golden Monkey Go March in! Revo R (updated)

领英推荐

Muisa Emin OZDEM, MBA的更多文章

Frameworks, Tips, and Best Practices for a Product Manager Interview

Side Projects to Transition into a Product Manager Role

Data Pipeline

Amazon Quicksight

Fully Managed, Petabyte-scale, Cloud-based Data Warehouse - Redshift

OpenSearch Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data:

Fully Manage Process Streaming and Event Data

Who Does not Want Scalable, Long-Lasting and Low Cost Streaming Data Solution backed by AWS.

Fully Managed Search - Amazon Cloudsearch

Amazon EMR - Your Solution to Handle Big Data

社区洞察

其他会员也浏览了

“THE FUNDAMENTALS OF BIG DATA TOOLS: MapReduce, Spark, and Hive”

Introduction to Big Data Technologies and Concepts: Building a Foundation for Data-Driven Success

Top 20 Big Data Platforms: The Best Open Source Tools (updated April 2020)

Delta Lake Format: Understanding Parquet under the hood.

Understanding Apache Hudi's MERGE INTO Command with Minio and HiveMetaStore

Apache Parquet: The Modern Solution for Efficient Data Storage and Processing

Big Data Frameworks You Should Know About

Beginner's Guide to Big Data

Introduction

Golden Monkey Go March in! Revo R (updated)