Big Data a BIG Problematic Friend

Big Data a BIG Problematic Friend

So we all have heard the term "BIG DATA" but what does it exactly means. Before that lets first see what is data

Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects.

In simple words, it can be said that Data is facts and statistics collected together for reference or analysis.

The units of data in terms of bytes

No alt text provided for this image

Now we know what data is so the question remains what do we mean by Big Data then?

According to a blog on oracle.com it says

Big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.

According to a post on guru99.com

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

From these 2 definitions, we can say that Big data is also a form of data but in huge volume i.e in the sizes of terabytes or petabytes or even more which can't be stored or analyzed using the normal techniques.

Let's now see how much Big data some of the famous companies are getting

Google: 40,000 Google Web Searches Per Second

More than 3.7 billion humans now have regular access to and use the internet. That results in about 40,000 web searches per second — on Google alone.

Furthermore, over half of all those web searches take place on mobile devices. It is likely that web search totals will continue to grow as more and more people get their hands on mobile devices across the world.

Facebook: 500 Terabytes Per Day

In 2012, Facebook’s system was generating 2.5 billion pieces of content and more than 500 terabytes of data per day. There are just as many “likes,” photos, and data scans too. It was massive then, and it’s certainly grown over time.

Today, there are two billion active users on Facebook and counting, making it the largest social media platform in existence. About 1.5 billion people are active on the network per day, all generating data and content. Five new profiles join Facebook every second, and more than 300 million photos are uploaded, too.

Twitter: 12 Terabytes Per Day

One wouldn’t think that 140-character messages comprise large stores of data, but it turns out that the Twitter community generates more than 12 terabytes of data per day.

That equals 84 terabytes per week and 4368 terabytes — or 4.3 petabytes — per year. That’s a lot of data certainly for short, character-limited messages like those shared on the network.

Amazon: $258,751.90 in Sales Per Minute

Amazon generates data two-fold. The major retailer is collecting and processing data about its regular retail business, including customer preferences and shopping habits. But it is also important to remember that Amazon offers cloud storage opportunities for the enterprise world.

Amazon S3 — on top of everything else the company handles — offers a comprehensive cloud storage solution that naturally facilitates the transfer and storage of massive data troves. Because of this, it’s difficult to truly pinpoint just how much data Amazon is generating in total.

Instead, it’s better to look at the revenue flowing in for the company which is directly tied to data handling and storage. The company generates more than $258,751.90 in sales and service fees per minute.

General Stats: Per Minute Ratings

Here are some of the per-minute ratings for various social networks:

  • Snapchat: Over 527,760 photos shared by users
  • LinkedIn: Over 120 professionals join the network
  • YouTube: 4,146,600 videos watched
  • Twitter: 456,000 tweets sent or created
  • Instagram: 46,740 photos uploaded
  • Netflix: 69,444 hours of video watched
  • Giphy: 694,444 GIFs served
  • Tumblr: 74,220 posts published
  • Skype: 154,200 calls made by users

By 2025, it’s estimated that 463 exabytes of data will be created each day globally – that’s the equivalent of 212,765,957 DVDs per day!

This amount of data is so terrifying how are the Big companies able to store such vast amounts of data and how are they processing it.

Before seeing how they do so let's first look at "The five V’s of big data"

No alt text provided for this image

Volume

If we see big data as a pyramid, volume is the base. The volume of data that companies manage skyrocketed around 2012, when they began collecting more than three million pieces of data every data. “Since then, this volume doubles about every 40 months,” Herencia said.

Velocity

In addition to managing data, companies need that information to flow quickly – as close to real-time as possible. So much so that the MetLife executive stressed that: “Velocity can be more important than volume because it can give us a bigger competitive advantage. Sometimes it’s better to have limited data in real-time than lots of data at a low speed.”

The data have to be available at the right time to make appropriate business decisions. Data analysis expert Gemma Mu?oz provided an example: on the days when Champions League soccer matches are held, the food delivery company La Nevera Roja (which was taken over by Just Eat in 2016,) decides whether to buy a Google AdWords campaign based on its sales data 45 minutes after the start of the game. Three hours later, this information is not nearly as important.

Variety

The third V of big data is variety. A company can obtain data from many different sources: from in-house devices to smartphone GPS technology or what people are saying on social networks. The importance of these sources of information varies depending on the nature of the business. For example, a mass-market service or product should be more aware of social networks than an industrial business.

These data can have many layers, with different values. As Mu?oz explained, “When launching an email marketing campaign, we don’t just want to know how many people opened the email, but more importantly, what these people are like.”

Veracity

The fourth V is veracity, which in this context is equivalent to quality. We have all the data, but could we be missing something? Are the data “clean” and accurate? Do they really have something to offer?

Value

Finally, the V for value sits at the top of the big data pyramid. This refers to the ability to transform a tsunami of data into business.

Now we are familiar with the 5 V's of Big data lets have a look at that are the advantages of using Big data.

Advantages of Big Data

Benefits of Using Big Data Analytics

  • Identifying the root causes of failures and issues in real-time.
  • Fully understanding the potential of data-driven marketing.
  • Generating customer offers based on their buying habits.
  • Improving customer engagement and increasing customer loyalty.
  • Reevaluating risk portfolios quickly.

Being so problematic Big data is very useful for all the major business running around the globe that's the reason I called it a Problematic friend

Now let us see how can we manage Big data.

Big data can be managed by using the concept of a Distributed File System.

Now the question becomes what is a Distributed File System.

A distributed file system (DFS) is a method of storing and accessing files based on a client/server architecture. In a distributed file system, one or more central servers store files that can be accessed, with proper authorization rights, by any number of remote clients in the network.

Much like an operating system organizes files in a hierarchical file management system, the distributed system uses a uniform naming convention and a mapping scheme to keep track of where files are located. When the client device retrieves a file from the server, the file appears as a normal file on the client machine, and the user is able to work with the file in the same ways as if it were stored locally on the workstation. When the user finishes working with the file, it is returned over the network to the server, which stores the now-altered file for retrieval at a later time.

Distributed file systems can be advantageous because they make it easier to distribute documents to multiple clients and they provide a centralized storage system so that client machines are not using their resources to store files.

No alt text provided for this image

Distributed File System is a concept now to implement it there are many software available in the market and "Apache Hadoop" is one such software which enables us to use this concept.

No alt text provided for this image

What is Hadoop?

Hadoop is an open-source, Java-based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella, Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes. The framework is managed by Apache Software Foundation and is licensed under the Apache License 2.0.

For years, while the processing power of application servers has been increasing manifold, databases have lagged behind due to their limited capacity and speed. However, today, as many applications are generating big data to be processed, Hadoop plays a significant role in providing a much-needed makeover to the database world.

From a business point of view, too, there are direct and indirect benefits. By using open-source technology on inexpensive servers that are mostly in the cloud (and sometimes on-premises), organizations achieve significant cost savings.

Additionally, the ability to collect massive data, and the insights derived from crunching this data, results in better business decisions in the real-world—such as the ability to focus on the right consumer segment, weed out or fix erroneous processes, optimize floor operations, provide relevant search results, perform predictive analytics, and so on.

How Hadoop Improves on Traditional Databases

Hadoop solves two key challenges with traditional databases:

1. Capacity: Hadoop stores large volumes of data.

By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is split into chunks and saved across clusters of commodity servers. As these commodity servers are built with simple hardware configurations, these are economical and easily scalable as the data grows.

2. Speed: Hadoop stores and retrieves data faster.

Hadoop uses the MapReduce functional programming model to perform parallel processing across data sets. So, when a query is sent to the database, instead of handling data sequentially, tasks are split and concurrently run across distributed servers. Finally, the output of all tasks is collated and sent back to the application, drastically improving the processing speed.

5 Benefits of Hadoop for Big Data

For big data and analytics, Hadoop is a lifesaver. Data gathered about people, processes, objects, tools, etc. is useful only when meaningful patterns emerge that, in turn, result in better decisions. Hadoop helps overcome the challenge of the vastness of big data:

  1. Resilience — Data stored in any node is also replicated in other nodes of the cluster. This ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
  2. Scalability — Unlike traditional systems that have a limitation on data storage, Hadoop is scalable because it operates in a distributed environment. As the need arises, the setup can be easily expanded to include more servers that can store up to multiple petabytes of data.
  3. Low cost — As Hadoop is an open-source framework, with no license to be procured, the costs are significantly lower compared to relational database systems. The use of inexpensive commodity hardware also works in its favor to keep the solution economical.
  4. Speed — Hadoop’s distributed file system, concurrent processing, and the MapReduce model enable running complex queries in a matter of seconds.
  5. Data diversity — HDFS has the capability to store different data formats such as unstructured (e.g. videos), semi-structured (e.g. XML files), and structured. While storing data, it is not required to validate against a predefined schema. Rather, the data can be dumped in any format. Later, when retrieved, data is parsed and fitted into any schema as needed. This gives the flexibility to derive different insights using the same data.
No alt text provided for this image

The Hadoop Ecosystem: Core Components

Hadoop is not just one application, rather it is a platform with various integral components that enable distributed data storage and processing. These components together form the Hadoop ecosystem.

Some of these are core components, which form the foundation of the framework, while some are supplementary components that bring add-on functionalities into the Hadoop world.

The core components of Hadoop are:

HDFS: Maintaining the Distributed File System

HDFS is the pillar of Hadoop that maintains the distributed file system. It makes it possible to store and replicate data across multiple servers.

HDFS has a NameNode and DataNode. DataNodes are the commodity servers where the data is actually stored. The NameNode, on the other hand, contains metadata with information on the data stored in the different nodes. The application only interacts with the NameNode, which communicates with data nodes as required.

YARN: Yet Another Resource Negotiator

YARN stands for Yet Another Resource Negotiator. It manages and schedules the resources, and decides what should happen in each data node. The central master node that manages all processing requests is called the Resource Manager. The Resource Manager interacts with Node Managers; every slave datanode has its own Node Manager to execute tasks.

MapReduce

MapReduce is a programming model that was first used by Google for indexing its search operations. It is the logic used to split data into smaller sets. It works on the basis of two functions — Map() and Reduce() — that parse the data in a quick and efficient manner.

First, the Map function groups, filters, and sorts multiple data sets in parallel to produce tuples (key, value pairs). Then, the Reduce function aggregates the data from these tuples to produce the desired output.

The Hadoop Ecosystem: Supplementary Components

The following are a few supplementary components that are extensively used in the Hadoop ecosystem.

Hive: Data Warehousing

Hive is a data warehousing system that helps to query large datasets in the HDFS. Before Hive, developers were faced with the challenge of creating complex MapReduce jobs to query the Hadoop data. Hive uses HQL (Hive Query Language), which resembles the syntax of SQL. Since most developers come from a SQL background, Hive is easier to get on-board.

The advantage of Hive is that a JDBC/ODBC driver acts as an interface between the application and the HDFS. It exposes the Hadoop file system as tables, converts HQL into MapReduce jobs, and vice-versa. So while the developers and database administrators gain the benefit of batch processing large datasets, they can use simple, familiar queries to achieve that. Originally developed by the Facebook team, Hive is now an open-source technology.

Pig: Reduce MapReduce Functions

Pig, initially developed by Yahoo!, is similar to the Hive in that it eliminates the need to create MapReduce functions to query the HDFS. Similar to HQL, the language used — here, called “Pig Latin” — is closer to SQL. “Pig Latin” is a high-level data-flow language layer on top of MapReduce.

Pig also has a runtime environment that interfaces with HDFS. Scripts in languages such as Java or Python can also be embedded inside Pig.

Hive Versus Pig

Although Pig and Hive have similar functions, one can be more effective than the other in different scenarios.

Pig is useful in the data preparation stage, as it can perform complex joins and queries easily. It also works well with different data formats, including semi-structured and unstructured. Pig Latin is closer to SQL but also varies from SQL enough for it to have a learning curve.

Hive, however, works well with structured data and is therefore more effective during data warehousing. It’s used on the server-side of the cluster.

Researchers and programmers tend to use Pig on the client-side of a cluster, whereas business intelligence users such as data analysts find Hive as the right fit.

Flume: Big Data Ingestion

Flume is a big data ingestion tool that acts as a courier service between multiple data sources and the HDFS. It collects, aggregates, and sends huge amounts of streaming data (e.g. log files, events) generated by applications such as social media sites, IoT apps, and e-commerce portals into the HDFS.

Flume is feature-rich, it:

  • Has a distributed architecture.
  • Ensures reliable data transfer.
  • Is fault-tolerant.
  • Has the flexibility to collect data in batches or real-time.
  • Can be scaled horizontally to handle more traffic, as needed.

Data sources communicate with Flume agents — every agent has a source, channel, and a sink. The source collects data from the sender, the channel temporarily stores the data, and finally, the sink transfers data to the destination, which is a Hadoop server.

Sqoop: Data Ingestion for Relational Databases

Sqoop (“SQL,” to Hadoop) is another data ingestion tool like Flume. While Flume works on unstructured or semi-structured data, Sqoop is used to export data from and import data into relational databases. As most enterprise data is stored in relational databases, Sqoop is used to import that data into Hadoop for analysts to examine.

Database admins and developers can use a simple command-line interface to export and import data. Sqoop converts these commands to MapReduce format and sends them to the HDFS using YARN. Sqoop is also fault-tolerant and performs concurrent operations like Flume.

Zookeeper: Coordination of Distributed Applications

Zookeeper is a service that coordinates distributed applications. In the Hadoop framework, it acts as an admin tool with a centralized registry that has information about the cluster of distributed servers it manages. Some of its key functions are:

  • Maintaining configuration information (shared state of configuration data)
  • Naming service (assignment of name to each server)
  • Synchronization service (handles deadlocks, race condition, and data inconsistency)
  • Leader election (elects a leader among the servers through consensus)

The cluster of servers that the Zookeeper service runs on is called an “ensemble.” The ensemble elects a leader among the group, with the rest behaving as followers. All write-operations from clients need to be routed through the leader, whereas read operations can go directly to any server.

Zookeeper provides high reliability and resilience through fail-safe synchronization, atomicity, and serialization of messages.

Kafka: Faster Data Transfers

Kafka is a distributed publish-subscribe messaging system that is often used with Hadoop for faster data transfers. A Kafka cluster consists of a group of servers that act as an intermediary between producers and consumers.

In the context of big data, an example of a producer could be a sensor gathering temperature data to relay back to the server. Consumers are the Hadoop servers. The producers publish a message on a topic and the consumers pull messages by listening to the topic.

A single topic can be split further into partitions. All messages with the same key arrive to a specific partition. A consumer can listen to one or more partitions.

By grouping messages under one key and getting a consumer to cater to specific partitions, many consumers can listen on the same topic at the same time. Thus, a topic is parallelized, increasing the throughput of the system. Kafka is widely adopted for its speed, scalability, and robust replication.

HBase: Non-Relational Database

HBase is a column-oriented, non-relational database that sits on top of HDFS. One of the challenges with HDFS is that it can only do batch processing. So for simple interactive queries, data still has to be processed in batches, leading to high latency.

HBase solves this challenge by allowing queries for single rows across huge tables with low latency. It achieves this by internally using hash tables. It is modeled along the lines of Google BigTable that helps access the Google File System (GFS).

HBase is scalable, has failure support when a node goes down, and is good with unstructured as well as semi-structured data. Hence, it is ideal for querying big data stores for analytical purposes.


I would like to thank the resources from where I got the information some of the links are

https://www.talend.com/resources/what-is-hadoop/

https://www.webopedia.com/TERM/D/distributed_file_system.html

and others too :)

Thank you for reading everyone :)

Dipti Kumari

Senior Software Engineer @SopraSteria | B.Tech(CSE) 2022

4 年

Well done????????

Kartik Chadha

Assistant System Engineer At Tcs

4 年

Amazing stuffff ????

回复
Aanchal Mittal

Software Engineer @LTTS | AWS Cloud Practitioner Certified | HCIA DATACOM Certified | Core JAVA | Springboot | SQL | GIT | kafka | Bitbucket | Networking

4 年

Really informative??

回复
Sonam .

Backend Developer| Java | Springboot Developer | Microservices | Kafka

4 年

Great work

回复

要查看或添加评论,请登录

Lovepreet Singh的更多文章

社区洞察

其他会员也浏览了