登录查看更多内容

Seeing the Big Picture: MapReduce, Hadoop and the Cloud

Dr. Shannon Block, CFE

Board Director?3-time CEO?President?Chief Digital Officer?Chief Strategy Officer?COO?CBDO?Doctorate in Computer Science?M.S. Physics?B.S. Applied Mathematics?B.S. Physics

发布日期: 2019年8月30日

Big data contains patterns and methods to inform companies about their customers and vendors, as well as help improve business processes. Some of the biggest companies in the world like Facebook have used MapReduce framework as a tool for their cloud computing applications sometimes through implementing Hadoop, an open source code of MapReduce. MapReduce was designed by Google for parallel distributed computing of big data.

Before MapReduce, companies needed to pay data modelers and buy supercomputers to process timely big data insights. MapReduce has been an important development in helping businesses solve complex problems across big data sets like determining the optimal price for products, understanding the return on the investment of advertising, performing long term predictions and mining web clicks to inform product and service development.

MapReduce works across a network of low-cost commodity machines allowing actionable business insights to be more accessible than ever before. It is strong computation tool for solving problems that can involve pattern matching, social network analysis, log analysis and clustering.

The logic behind MapReduce is basically dividing big problems into small manageable tasks that are then distributed to hundreds of thousands of server nodes. The server nodes operate in parallel to generate results. From a programming standpoint, this involves writing a map script where the data is mapped into a collection of key value pairs and writing a reduce script over all pairs with the same key. One challenge is the time it takes to convert and break the data into the new key-value pair which increases latency.

Hadoop is Apache’s open-source implementation of the MapReduce framework. In addition to the MapReduce distributed processing layer, Hadoop uses HDFS for reliable storage, YARN for resource management and has flexibility in dealing with structured and unstructured data. New nodes can be added easily to Hadoop without downtime and if a machine goes down, data can be easily retrieved. Hadoop can be a cost efficient solution for big data processing, allow terabytes of data to be analyzed within minutes.

But, cloud platforms like Amazon Web Services (AWS), Microsoft Azure, and Google’s Cloud Platform offer similar MapReduce components where the operational complexity is handled by the cloud vendors instead of the individual businesses. Hadoop is known for its strong combination of computation with storage, but in place of HDFS, cloud-based object stores have been built on models like AWS which given the ability to still compute and use virtualization technology like Kubernetes instead of YARN. With the growing shift to cloud vendors, there have been some increased concerns around the long-term vision for Hadoop.

Hortonworks was the data software company that supported open-source software, primarily Hadoop. But in January 2019, Hortonworks closed an all-stock $5.2 billion merger with Cloudera. While Cloudera also supports open source Hadoop, it has a different vendor-lock management suite that is supposed to help with both installation and deployment whereas Hortonworks was 100% open-source. In May 2109, another Hadoop provider, MapR, announced they were looking for a new source of funding. On June 6, 2019, Cloudera’s stock declined 43% and the CEO left the company.

Understanding the advantages and disadvantages of the MapReduce framework and Hadoop in big data analytics is helpful to making informed business decisions as this field continues to evolve. In terms of the drawbacks of Hadoop, Monte Zwebe, the CEO of Splice Machine, that creates relational databases for Hadoop says, “When we need to transport ourselves to another location and need a vehicle, we go and buy a car. We don’t buy a suspension system, a fuel injector, and a bunch of axles and put the whole thing together, so to speak. We don’t go get the bill of materials.”

What do you think? Please DM me or leave your feedback in the comments below.

#Hadoop #MapReduce #CloudComputing

?About the Author

Shannon Block is an entrepreneur, mother and proud member of the global community. Her educational background includes a B.S. in Physics and B.S. in Applied Mathematics from George Washington University, M.S. in Physics from Tufts University and she is currently completing her Doctorate in Computer Science. She has been the CEO of both for-profit and non-profit organizations. Currently as Executive Director of Skillful Colorado, Shannon and her team are working to bring a future of skills to the future of work. With more than a decade of leadership experience, Shannon is a pragmatic and collaborative leader, adept at bringing people together to solve complex problems. She approaches issues holistically, helps her team think strategically about solutions and fosters a strong network of partners with a shared interest in strengthening workforce and economic development across the United States. Prior to Skillful, Shannon served as CEO of the Denver Zoo, Rocky Mountain Cancer Centers, and World Forward Foundation. She is deeply engaged in the Colorado community and has served on multiple boards including the International Women's Forum, the Regional Executive Committee of the Young Presidents’ Organization, Children’s Hospital Quality and Safety Board, Women’s Forum of Colorado, and the Colorado-based Presbyterian/St. Luke’s Community Advisory Council. Follow her on Twitter @ShannonBlock or connect with her on LinkedIn.

Visit www.ShannonBlock.org for more on technology tools and trends.

Dr. Guy Pyke

Information Technology Consultant

5 年

Great article Shannon! I am intrigued to learn more about MapReduce and especially Hadoop as this falls directly in line with my dissertation research. Big data processing is vital in today’s technological society with almost everything now operating on cloud services and virtuality. Thank you!

1 次回应

查看更多评论

要查看或添加评论，请登录

Dr. Shannon Block, CFE的更多文章

My Interview with Chat GPT

2023年2月7日

My Interview with Chat GPT

[my opinions are my own] Below is the transcript from my interview with Chat GPT. Enjoy! Dr.

1 条评论
API’s in a COVID-19 World

2020年5月9日

API’s in a COVID-19 World

Application program interfaces (APIs) are growing exponentially in the COVID-19 world. In a rare collaboration, Apple…

2 条评论
Security Assessment versus Security Audit?

2020年1月26日

Security Assessment versus Security Audit?

If you are a member of the Board and the topic of a cybersecurity audit comes up, it is important to define what it is…

2 条评论
Cybersecurity 101: Ten KPI’s to Monitor

2020年1月22日

Cybersecurity 101: Ten KPI’s to Monitor

It’s no surprise that attackers are using more sophisticated techniques to target systems from personal devices to all…

1 条评论
Telehealth is Changing Healthcare

2019年12月9日

Telehealth is Changing Healthcare

Telemedicine makes it easier for people to stay healthy. It has been estimated that nearly three of every four…

1 条评论
Magecart Cybercriminals Steal Credit Card Info for a Week from Macys.com and Stocks Decline 10%

2019年11月23日

Magecart Cybercriminals Steal Credit Card Info for a Week from Macys.com and Stocks Decline 10%

For the second time, Macy’s customers were involved in a credit card data breach. Reports say that the breach lasted…

1 条评论
Scalable and Intelligent Security Analytics: Splunk, Devo, IBM and McAfee

2019年11月8日

Scalable and Intelligent Security Analytics: Splunk, Devo, IBM and McAfee

Organizations of any size can be victims of a cyber attack. Small and medium-sized organizations can be tempting for…
What Should Be Keeping You Up At Night: Where is Big Data Stored?

2019年10月20日

What Should Be Keeping You Up At Night: Where is Big Data Stored?

The digital universe is expected to double in size every two years with machine-generated data experiencing a 50x…
Streaming Data Solutions: Flink versus Spark

2019年10月18日

Streaming Data Solutions: Flink versus Spark

While real-time stream processing has been around for a while, businesses are now trying to quickly process larger…
Detecting Bots with IP Size Distribution Analysis

2019年10月7日

Detecting Bots with IP Size Distribution Analysis

Kylie Jenner reportedly makes $1 million per paid Instagram post, and Selena Gomez is a close second with over $800K…

See all articles

Seeing the Big Picture: MapReduce, Hadoop and the Cloud

Dr. Shannon Block, CFE

Board Director?3-time CEO?President?Chief Digital Officer?Chief Strategy Officer?COO?CBDO?Doctorate in Computer Science?M.S. Physics?B.S. Applied Mathematics?B.S. Physics

Dr. Shannon Block, CFE的更多文章

社区洞察

其他会员也浏览了

Understanding Hadoop and Managed Cloud Versions from Microsoft, AWS, and GCP

Hadoop: What it is and why it matters

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Is cloud replacing Hadoop?

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Evolution of Apache's Big Data Ecosystem

Introduction:

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

Dr. Shannon Block, CFE的更多文章

My Interview with Chat GPT

API’s in a COVID-19 World

Security Assessment versus Security Audit?

Cybersecurity 101: Ten KPI’s to Monitor

Telehealth is Changing Healthcare

Magecart Cybercriminals Steal Credit Card Info for a Week from Macys.com and Stocks Decline 10%

Scalable and Intelligent Security Analytics: Splunk, Devo, IBM and McAfee

What Should Be Keeping You Up At Night: Where is Big Data Stored?

Streaming Data Solutions: Flink versus Spark

Detecting Bots with IP Size Distribution Analysis

社区洞察

其他会员也浏览了

Understanding Hadoop and Managed Cloud Versions from Microsoft, AWS, and GCP

Hadoop: What it is and why it matters

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Is cloud replacing Hadoop?

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop And Apache SparK: Which Is Suitable for Your Domain of Work?

Evolution of Apache's Big Data Ecosystem

Introduction:

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing