Hadoop
Surendra Bairagi
Global Head of Sales & Strategies | Cloud Consulting & Cybersecurity Specialist | Empowering Businesses with Digital Transformation @ IBN Technologies Ltd
It is a known fact that Hadoop has been exclusively designed to handle Big Data. Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.
As the World Wide Web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch is the brainchild of Doug Cutting and Mike Cafarella. They wanted to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept of storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.
In 2006, Cutting joined Yahoo and so Nutch. The Nutch project was divided as the web crawler portion remained as Nutch and the distributed computing and processing portion became Hadoop (named after Cutting’s sons toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Hadoop Ecosystem has ability to store and process huge amounts of any kind of data, quickly. With data volumes and varieties constantly increasing, especially from social media and the Internet of Things (IoT), that's a key consideration. Hadoop's distributed computing model processes big data fast. The more computing nodes you use, the more processing power you have. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. Multiple copies of all data are stored automatically. Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos. The open-source framework is free and uses commodity hardware to store large quantities of data. You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
Hadoop runs on clusters of commodity servers and can scale up to support thousands of hardware nodes and massive amounts of data. It uses a namesake distributed file system that's designed to provide rapid data access across the nodes in a cluster, plus fault-tolerant capabilities so applications can continue to run if individual nodes fail. Consequently, Hadoop became a foundational data management platform for big data analytics uses after it emerged in the mid-2000s.
That changed in Hadoop 2.0, which became generally available in October 2013 when version 2.2.0 was released. It introduced Apache Hadoop YARN, a new cluster resource management and job scheduling technology that took over those functions from MapReduce. YARN -- short for Yet Another Resource Negotiator but typically referred to by the acronym alone -- ended the strict reliance on MapReduce and opened up Hadoop to other processing engines and various applications besides batch jobs.
Hadoop 3.0.0 was the next major version of Hadoop. Released by Apache in December 2017, it didn't expand Hadoop's set of core components. However, it added a YARN Federation feature designed to enable YARN to support tens of thousands of nodes or more in a single cluster, up from a previous 10,000-node limit. The new version also included support for GPUs and erasure coding, an alternative to data replication that requires significantly less storage space.
Data lakes generally serve different purposes than traditional data warehouses that hold cleansed sets of transaction data. But, in some cases, companies view their Hadoop data lakes as modern-day data warehouses. Either way, the growing role of big data analytics in business decision-making has made effective data governance and data security processes a priority in data lake deployments.
One of the most popular analytical uses by some of Hadoop's largest adopters is for web-based recommendation systems. Facebook’s people you may know. LinkedIn’s jobs you may be interested in. Netflix, eBay. These systems analyze huge amounts of data in real time to quickly predict preferences before customers leave the web page.
It’s no secret that there is a data explosion. A recent IDC analyst report from April 2014 indicated the volume of data, known as the digital universe, is doubling in size every two years. And by 2020, there will be as many digital bits as there are stars in the universe. There are many reasons.
Hadoop has matured to become a key part of the next-gen data management platforms for enterprises worldwide. The growing production use of Hadoop in the cloud, on-premises, and out to the edge demands seamless management, security, and governance of all data, regardless of its deployment or type.