登录查看更多内容

AWS EMR (Amazon Elastic MapReduce)

Rohit Singh

Associate Project Manager @ HuQuo

发布日期: 2024年10月3日

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data. Amazon EMR is based on Apache Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Using MapReduce, a core component of the Hadoop software framework, developers can write programs that process massive amounts of unstructured data across a distributed cluster of processors or standalone computers. It was developed by Google for indexing webpages and replaced its original indexing algorithms and heuristics in 2004.

Amazon EMR processes big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The Elastic in EMR's name refers to its dynamic resizing ability, which enables administrators to increase or reduce resources, depending on their current needs. Amazon EMR is used for data analysis in log analysis, web indexing, data warehousing, machine learning (ML), financial analysis, scientific simulation and bioinformatics. It also supports workloads based on Apache Spark, Apache Hive, Presto and Apache HBase -- the latter of which integrates with Hive and Pig, which are open source data warehouse tools for Hadoop. Hive uses queries and analyzes data, and Pig offers a high-level mechanism for programming MapReduce jobs to be executed in Hadoop.

领英推荐

Apache Spark: Key Advantages Over Hadoop and the Power…

Omar Khaled 5 个月前

Apache Spark on Azure

Anuradha Nanayakkara 3 个月前

Apache Spark Vs Hadoop

Macrometa 2 年前

Amazon EMR use cases

There are several ways enterprises can use Amazon EMR, including:

Machine learning. EMR's built-in ML tools use the Hadoop framework to create a variety of algorithms to support decision-making, including decision trees, random forests, support-vector machines and logistic regression.
Extract, transform and load. ETL is the process of moving data from one or more data stores to another. Data transformations -- such as sorting, aggregating and joining -- can be done using EMR.
Clickstream analysis. Clickstream Data from Amazon S3 can be analyzed with Apache Spark and Apache Hive. Apache Spark is an open source data processing tool that can help make data easy to manage and analyze. Spark uses a framework that enables jobs to run across large clusters of computers and can process data in parallel. Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides tools for working with data that Spark can analyze. Clickstream analysis can help organizations understand customer behaviors, find ways to improve a website layout, discover which keywords people are using in search engines and see which word combinations lead to sales.
Real-time streaming. Users can analyze events using streaming data sources in real time with Apache Spark Streaming and Apache Flink. This enables streaming data pipelines to be created on EMR.
Interactive analytics. EMR Notebooks are a managed service that provide a secure, scalable and reliable environment for data analytics. Using Jupyter Notebook -- an open-source web application data scientists can use to create and share live code and equations -- data can be prepared and visualized to perform interactive analytics.
Genomics. Organizations can use EMR to process genomic data to make data processing and analysis scalable for industries including medicine and telecommunications.

Amazon EMR deployment options

As a cloud service, Amazon EMR can be deployed in a variety of settings, such as:

Amazon EMR on Amazon EC2. Amazon EMR can quickly process large amounts of data using Amazon EC2. Users can configure Amazon EMR to take advantage of On-Demand, Reserved and Spot Instances.
Amazon EMR on Amazon Elastic Kubernetes Service (EKS). The Amazon EMR console enables users to run Apache Spark applications with other applications on the same EKS cluster. Organizations can share compute and memory resources across all applications and use a Kubernetes tool to monitor and manage the infrastructure.
Amazon EMR on AWS Outposts. AWS Outposts enables organizations to run EMR in their own data centers. This makes it easier to set up, deploy, manage and scale EMR in on-premises environments.

要查看或添加评论，请登录

Rohit Singh的更多文章

Python Django

2025年3月29日

Python Django

Python Django Python-based web framework Django allows you to create efficient web applications quickly. It is also…
Apache Parquet

2025年3月28日

Apache Parquet

Apache Parquet is an open-source columnar storage format used to efficiently store, manage and analyze large datasets…
Scope management

2025年3月27日

Scope management

Project scope refers to the detailed description of the deliverables, objectives, tasks, and goals that need to be…
Selenium WebDriver

2025年3月26日

Selenium WebDriver

Selenium WebDriver is a powerful Automation tool widely used for web application testing. It provides a programming…
Robot Framework

2025年3月25日

Robot Framework

Robot Framework is an open-source test automation framework for acceptance testing and acceptance test-driven…
Azure Active Directory

2025年3月24日

Azure Active Directory

Azure Active Directory (Azure AD), now known as Microsoft Entra ID, is a cloud-based identity and access management…
Matillion

2025年3月22日

Matillion

Matillion is a cloud-native data integration platform that simplifies and accelerates the ELT (Extract, Load…
Azure Blob storage

2025年3月21日

Azure Blob storage

Blob storage is a type of cloud storage for unstructured data, like images, videos, or documents, where data is stored…
BI Testing

2025年3月20日

BI Testing

BI testing, or Business Intelligence testing, verifies and validates the accuracy and reliability of insights delivered…
Amazon Elastic Container Service (Amazon ECS)

2025年3月19日

Amazon Elastic Container Service (Amazon ECS)

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that simplifies the…

See all articles

AWS EMR (Amazon Elastic MapReduce)

Rohit Singh

Associate Project Manager @ HuQuo

领英推荐

Amazon EMR use cases

Amazon EMR deployment options

Rohit Singh的更多文章

社区洞察

其他会员也浏览了

Exploring the Functionality of MapReduce, Apache Spark and Hive in the Distributed Computing Paradigm

Building Scalable Data Pipelines with Apache Spark & Hadoop

What is Apache Spark?

Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases

What is Apache Spark? The Big Data Platform That Surpassed Hadoop

Big Data, focusing on MapReduce, Spark, and SQL (Hive).

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Understanding the MapReduce Workflow: A Detailed Guide

The Rise and Fall of MapReduce: How Big Data Processing Evolved

Apache Spark vs. Hadoop MapReduce

领英推荐

Amazon EMR use cases

Amazon EMR deployment options

Rohit Singh的更多文章

Python Django

Apache Parquet

Scope management

Selenium WebDriver

Robot Framework

Azure Active Directory

Matillion

Azure Blob storage

BI Testing

Amazon Elastic Container Service (Amazon ECS)

社区洞察

其他会员也浏览了

Exploring the Functionality of MapReduce, Apache Spark and Hive in the Distributed Computing Paradigm

Building Scalable Data Pipelines with Apache Spark & Hadoop

What is Apache Spark?

Exploring AWS EMR (Elastic MapReduce): Evolution, Analysis, and Real-World Use Cases

What is Apache Spark? The Big Data Platform That Surpassed Hadoop

Big Data, focusing on MapReduce, Spark, and SQL (Hive).

Comparing Spark and MapReduce: The Pros and Cons of Two Popular Big Data Processing Frameworks on the Hadoop Ecosystem

Understanding the MapReduce Workflow: A Detailed Guide

The Rise and Fall of MapReduce: How Big Data Processing Evolved

Apache Spark vs. Hadoop MapReduce