AWS and Open Source Big Data and Analytic Frameworks
workana

AWS and Open Source Big Data and Analytic Frameworks

Today, practically every firm makes considerable use of big data to gain a competitive advantage in the market.

With this in mind, open source big data solutions for big data processing and analysis are the most cost-effective and beneficial alternative for enterprises.

As firms build innovative solutions to get a competitive edge in the big data market, it is beneficial to focus on open source big data technologies that are driving the big data sector. Below you will find some of the popular frameworks, processing systems and engines , the usage percentage of data type and popular Data Analytic Related AWS services. Enjoy!

Apache Spark

Apache SparkTM is a distributed processing system that is open-source and used for big data applications. For quick analytic queries against any quantity of data, it uses in-memory caching and efficient query execution.

Presto

Presto is an open source, distributed SQL query engine created from the bottom up for rapid analytic queries on any quantity of data.

Apache Hive

Hive is a Hadoop-based open source data warehousing and analytics package managed by Hive QL, a SQL-based language that allows users to organize, summarize, and query data sources stored on Amazon S3.

Apache Hudi

Apache Hudi is an open-source data management framework designed to make incremental data processing and data pipeline creation easier. Apache Hudi allows you to manage data at the record level in S3 and provides a framework for dealing with data privacy use cases that need record level changes and deletes.

Impala

Impala is a Hadoop ecosystem open source tool for interactive, ad hoc querying using SQL syntax. Instead of MapReduce, it employs a massively parallel processing (MPP) engine similar to those used in traditional relational database management systems (RDBMS).

Apache Pig

Pig is a free and open source analytics package that operates on top of Hadoop. Pig Latin is a SQL-like language that allows users to organise, summarize, and query data sources stored on S3. It includes first-class support for map/reduce functions and complicated extendable user specified data types that enables the analysis of complicated, even unstructured, data sources such as text documents and log files.

Apache HBase

HBase is an open source, non-relational, distributed database inspired by Google BigTable and operates on top of the Hadoop Distributed File System (HDFS) to offer Hadoop with BigTable-like features.


The majority of data businesses have :

10% Structured Data

10% Semistructured Data

80% Untructured Data

- Structured data :

Amazon #RDS and Amazon #Aurora,

MySQL, MariaDB, PostgreSQL, Microsoft SQL Server, and Oracle.

- Semistructured data :

Amazon #Neptune, Amazon #DynamoDB and Amazon #ElastiCache

CSV, XML, JSON

- Unstructured data :

Amazon #S3, and Amazon #Redshift Spectrum.

Emails, photos, videos, clickstream data


AWS Data Analytics Services and Brief Summary


1- #Athena - Interactive Analytics

2- #Elasticsearch - Operational Analytics

3- #EMR - Big Data Analytics

-Apache Hive- Data warehouse and analytics package?

-Apache Pig- Analytics package

-Apache Spark - Distributed processing framework and

programming model (machine learning, stream processing, or

graph analytics)

-Apache HBase - Non-relational, distributed database modeled

Google's BigTable

-Presto - Distributed SQL query engine optimized for low-latency,

ad-hoc analytics

-Kinesis Connector-?Enables EMR to directly read and query data

from Kinesis Data Streams.

4- #Kinesis #Analytics - Real Time Analytics

5- Kinesis #Firehose - Delivering Streaming Data

6- Kinesis #Streams - Ingest Large Data

7- Kinesis #Video Streams - Real Time Video Analytics

8- #Redshift - Data Warehousing

9- #Sagemaker - Predictive Analytics


Resource : https://aws.amazon.com/emr/faqs/

https://www.whizlabs.com/blog/big-data-tools/

Ellen Ettinger

I am a Motherhood Expert/Coach/Strategist. The "Go-To Expert in Motherhood, Parenting, and Work/Family Topics"

3 年

Terabytes? Petabytes? Exabytes? Maybe it’s not about the actual size but how you use it, for example the ability to set up a spark cluster that can scale up and down as needed or to use tools like dask.

回复

要查看或添加评论,请登录

Muisa Emin OZDEM, MBA的更多文章

社区洞察

其他会员也浏览了