Big Data-Big Problem

Big Data-Big Problem

No alt text provided for this image
“Data is the new science. Big Data holds the answers.”                                               –Pat Gelsinger

The word Big Data is not as simple as it looks. It may look like just a combination of two simple words ‘Big’ and ‘Data’, but this term is really a big deal when we are pragmatic (real world).

We live in 21st century and almost everything in this world is rapidly getting converted to digital form. With a population of more than 7.8 billion in our world, we generate nearly 2.5 quintillion bytes of data per day. So let’s take a moment to think about what the top multinational companies do in order to manage these huge volumes of data.

Before considering the facts of how they manage the big data let’s take a look at what is big data, why is it such a big deal to talk about, and the complication it possesses. 

First of all, what is Data?

Before you can attempt to manage big data, you first have to know what the term means, said Greg Satell in  Forbes Magazine.

In simple words, data is just a collection of facts (information), such as numbers, words, measurements, observations, or just descriptions of things. This information may be in the form of text documents, images, audio clips, software programs, or other types of data. As of computer data is concerned, data is a series of bits (binary digits). 

Why data is important?

“War is 90% information.”

                                                                       -Napoleon Bonaparte

Data can be structured or unstructured, in either way they are important. The types of data which get the most attention are,

Personal Data: It covers your email, location, social media accounts, and other identifying factors. In case it is not managed properly it may lead to a scandal.

Transaction Data: Transaction data is anything that requires an action to collect. Like making a purchase, visiting a website, clicking on an ad. This is incredibly important for businesses because it helps them to expose variability and optimize their operations for the highest quality results.

Web Data: Web data is a collective term that refers to any type of data you might pull from the internet. Web data can be used to monitor competitors, track potential customers, keep track of channel partners, generate leads, build apps, etc.

Sensor Data: Sensor data is produced by objects and is often referred to as the Internet of Things. It covers everything from your smartwatch measuring your heart rate to a building with external sensors that measure the weather. These data are really sensitive, in privacy terms.

So imagine if our data is not managed properly, all our bank transaction data will be lost, no one knows who’s a debt and debtless person, every digital application will be inactive. They also contribute to companies’ decision making. 

Big Data – Big Problem:

No alt text provided for this image


Storing, reading, processing, and analyzing huge volumes of data will be a complicated process for the hard disk to take in and will consume a lot of time.

The sub-problems of big data are,



No alt text provided for this image

Volume (Size): The size big data holds plays a crucial role in processing them. This is indicated when the data volume is huge for the hard disk to store, like dumping huge data beyond the capacity of the hard disk.

Velocity / Speed ( I/O process): Processing information should be really fast when it comes to the digital world, otherwise people would lose their patience and will move on to the next product. But processing big data with how much ever huge volume hard disk we have, it will just increase the duration of the process which leads to a critical situation.

Variety: Variety signifies the heterogeneous nature of the data, like data may be of any form, ex, emails, PDFs, documents, audio, images, cookies. So this variety of huge unstructured data may cause inconvenience while storing and analyzing it.

Veracity: Data veracity, in general, is how accurate or truthful a data set may be. In the context of big data, however, it takes on a bit more meaning. More specifically, when it comes to the accuracy of big data, it's not just the quality of the data itself but how trustworthy the data source, type, and processing of it is.

There is also another sub-problem called cost; huge hard disk will definitely cost considerably higher so companies won’t be able to make a profit out of the data.

“You can have data without information, but you cannot have information without data.”

–Daniel Keys Moran

So, this quote indicates how processing big data plays a prominent role.

How multinational companies overcome the issue of big data?


“Processed data is information. Processed information is knowledge, Processed knowledge is Wisdom.” 

- Ankala V. Subbarao

No alt text provided for this image

Actually the technology to overcome this issue is a distributed storage cluster. (The Master-Slave topology). Let’s see some of the strategies used by famous companies,

GOOGLE

No alt text provided for this image

For those of us who use the internet on a regular basis, Google is the great answerer of interrogatives. Have a question? Be it common or more, Google is sure to turn up an answer that – if nothing else – points you in the right direction.

With all of these products/services and the unthinkable amount of data that come with them, how does a company like Google go about storing its information? If we get a little Meta and turn to Google with our question, we learn that our answer lies in the functionality of thousands upon thousands of servers. In August of 2011, Data Center Knowledge reported that the number is close to 900,000. Pretty remarkable, right?

No alt text provided for this image

These servers don’t all serve the same purpose, of course. Instead, each server has designated tasks. Let’s look at some of Google’s server types and the tasks they are responsible for carrying out.


1. WEB SERVERS

Google’s web servers are those that will probably resonate most with the common user, as they are responsible for handling the queries that we enter into Google Search. When a user enters a query, web servers carry out the process of interacting with other server types (e.g. index, spelling, ad, etc.) and returning results/serving ads in HTML format. Web servers are the ‘results-gathering’ servers if you will.

2. DATA-GATHERING SERVERS

Data-gathering servers do the work of collecting and organizing information for Google. These servers “spider” or crawl the internet via Googlebot (Google’s web crawler), searching for newly-added and existing content. These servers have the responsibility of indexing content, updating the index, and ranking pages based on Google’s search algorithms.

3. INDEX SERVERS

Google’s index servers are where a lot of the “magic” behind Google Search happens. These servers are responsible for returning lists of document IDs that correspond to “documents” (or indexed web pages) wherein the user’s query is present.

4. DOCUMENT SERVERS

Document servers store the document version of web page content. Each page has content saved in the form of JPEG files, PDF files, and more, all of which are stored in several servers depending on the type of information. Document servers provide snippets of information to users based on the search terms entered and are capable of returning entire documents, as well.

The document IDs returned by index servers correspond to documents housed by these servers. Due to the influx of indexed documents each and every day, these servers require more disk space than others. If we were to answer the question – Where does Google store its data? – With one server type, it’d most certainly be the document server.

5. AD SERVERS

Ad servers are vital to both Google’s revenue stream and the livelihood of thousands of businesses. These servers are responsible for managing advertisements that are integral to Google’s AdWords and AdSense services. Web servers interact with these ad servers when deciding which ads (if any) should be displayed for a particular query.

6. SPELLING SERVERS

We didn’t all get A’s in spelling during school and some of us need a little help when searching. If you have ever searched for something in Google and the results came up with the phrase, “Did you mean correct spelling,” know that a spelling server was at work. No matter how search terms are entered, spelling servers work to perform the search anyway, taking advantage of the opportunity to learn, correct, and better locate what users seek.

FACEBOOK

We all know Facebook is one of the prominent networking sites of all the time that makes it easy for you to connect and share with family and friends online. Today, Facebook is the world's largest social network, with more than 2 billion users worldwide. So how do they manage huge volumes of data?

"So just about everything we do turns out to be a big data problem,"

said Jay Parikh, vice president of Infrastructure Engineering at Facebook. Facebook designs its own servers and networking. It designs and builds its own data centers. Their staffs write most of its own applications and create virtually all of its own middleware. Everything about its operational IT unites it in one extremely large system that is used by internal and external folks alike.

No alt text provided for this image

Internet users leave vast volumes of online data behind when passing away commonly referred to as digital remains. Facebook deals with more than 500+ terabytes of data each day.  While the world is coming closer together on this platform, Facebook develops algorithms to track those connections and their presence on or outside its walls to fetch the most suitable posts for its users. Whether it is your wall post, your favorite books, movies, or your workplace, Facebook analyzes each and every bit of your data and offers you better services each time you log in. The chief technological aspects they use are:

1) Hadoop                                 4) Hive

2) Scuba                                    5) Prism

3) Cassandra                            6) Corona

7) Peregrine

Hadoop

“Facebook runs the world’s largest Hadoop cluster"

 says Jay Parikh, Vice President of Infrastructure Engineering, Facebook.

Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers:

  • The developers can freely write map-reduce programs in any language.
  • SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to developers with small subsets of SQL.

Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on the Hadoop database, i.e., Apache HBase, which has a layered architecture that supports a plethora of messages in a single day.

Scuba

Scuba could help the Hadoop developers dive into the massive data sets and carry on ad-hoc analyses in real-time. According to Jay Parikh, “Scuba gives us this very dynamic view into how our infrastructure is doing — how our servers are doing, how our network is doing, how the different software systems are interacting.”

Cassandra

“The amount of data to be stored, the rate of growth of the data, and the requirement to serve it within strict SLAs made it very apparent that a new storage solution was absolutely essential.”

- Avinash Lakshman, Search Team, Facebook

The objective was to develop a distributed storage system dedicated to managing a large amount of structured data across multiple commodity servers without failing once.

Prism

Initially, when Facebook implemented Hadoop, it was not designed to run across multiple data centers. And that’s when the requirement to develop Prism was felt by the team of Facebook. Prism is a platform that brings out many namespaces instead of the single one governed by the Hadoop. This in turn helps to develop many logical clusters.

This system is now expandable to as many servers as possible without worrying about increasing the number of data centers.

Hive

This tool improved the query capability of Hadoop by using a subset of SQL and soon gained popularity in the unstructured world. 

AMAZON

Analyzing large data sets requires significant computing capacity that can vary in size based on the amount of input data and the type of analysis. This characteristic of big data workloads is ideally suited to the pay-as-you-go cloud computing model, where applications can easily scale up and down based on demand. 

No alt text provided for this image


As requirements change, you can easily resize your environment (horizontally or vertically) on AWS to meet your needs, without having to wait for additional hardware or being required to over-invest to provision enough capacity. For mission-critical applications on a more traditional infrastructure, system designers have no choice but to over-provision, because a surge in additional data due to an increase in business need must be something the system can handle. By contrast, on AWS you can provision more capacity and compute in a matter of minutes, meaning that your big data applications grow and shrink as demand dictates, and your system runs as close to optimal efficiency as possible.

The following services for collecting, processing, storing, and analyzing big data are described in order:

 ? Amazon Kinesis

? AWS Lambda

 ? Amazon Elastic MapReduce

? Amazon Glue

? Amazon Machine Learning

 ? Amazon DynamoDB

? Amazon Redshift 

? Amazon Athena

? Amazon Elasticsearch Service

 ? Amazon QuickSight

 In addition to these services, Amazon EC2 instances are available for self-managed big data applications.

So In Conclusion Hadoop clusters, Web Servers, Cloud Servers, etc are all interconnected to each other and help in Distributed Storage cluster to overcome the problem of Big Data.

No alt text provided for this image











                                          -


Vivek Singare

Infoscion | DevOps - Cloud enthusiast

3 年

Great research work

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了