BIG DATA – a problem
In this world full of technologies, were daily many customers are connecting themselves to this technologies and as the number of customers increases, parallely it also increase, the number of data around petabytes. The more the customers are, the more the data is generated. This is basically known as Big Data.
What is Big Data?
Big Data is a large amount of data. It is a term used to describe data that is huge in amount and which keeps growing with time. This data is difficult and also time-consuming to process. Big data can be characterized mainly as :
· Volume – The quantity of data
· Velocity – The speed of i/o processing
How and where this Big Data is managed?
Here to answer this question I have done some of the research on one of the MNC Google. Google is the largest tech company in the world and having a huge data base of the users. Google stores almost all data like all picture, videos, contacts, locations, documents, search history, download history and many more. Few days before Facebook is in the news for leaking there more than 50 crore users personal data. Google store more personal data as compared to Facebook.
Google Currently processes over 20 petabytes of data per day through an average of 100,000 MapReduse jobs spread across its massive computing clusters. With all of these products/services and the unthinkable amount of data that come with them, how does a company like Google go about storing its information? If we get a little meta and turn to Google with our question, we learn that our answer lies in the functionality of thousands upon thousands of servers. In August of 2011, Data Center Knowledge reported that the number is close to 900,000. Pretty remarkable, right?
Google and any other company which generates huge amount of data uses cloud to store it’s data because the number of users are always volatile, the data generated on a day’s scale is also volatile. Therefore Google doesn’t use an off the self type of storage to store there data.
1 GB of storage costs 0.03$
20 Petabytes costs 0.03 * 20 * 1000 * 1000 =600000$
That's quite a bit of money - like hiring 2190 employees for 100,000$ a year !
Purchasing 20 PB of hardware everyday is out of question. Google not only needs a scalable data, but at the same time needs durable one.
How Google solve this problem?
A Distributed File System is a way of storing data and reading across different servers, but through the same interface as accessing a local file. Google uses it’s own file system DFS to solve its problem of scalability by incorporating the object based storage, known as the Google File System.
The GFS consists of 3 layers,
· The Client – Handles requests for data from applications.
· The Master – It stores the metadata. Mainly, the names of data files and the location of their chunks.
· The chunk server – Huge amounts of data, are broken down into chunks of few hundred Mbs and stored across servers with replicas for back up.
If you get excited over thoughts of how large amount of data may flow from one part to another, with multiple master and slave machines, you might find getting a glimpse of how Google might handle this and how Google shares millions and tones of information across a very widely distributed network. You might understand this by reading by this particular phrase –
“A system having a resource manager, and a plurality of slaves, interconnected by a communications network. To distribute data, a master determined that a destination slave of the plurality slaves requires data. The master then generates a list of slaves from which to transfer data to the destination slave. The master transmits the list to the resource manager. The resource is configured to select a source slave from the list based on available system resources.”
Google’s web servers are those that will probably resonate most with the common user, as they are responsible for handling the queries that we enter into Google Search. When a user enters a query, web servers carry out the process of interacting with other server types (e.g. index, spelling, ad, etc.) and returning results/serving ads in HTML format. Web servers are the ‘results-gathering’ servers. On a similar note, Google has servers designated to perform specific tasks –
1. Data-Gathering Servers
Data-gathering servers that send out bots to crawl the web.
2. Index Servers
Google’s index servers that contain the list of document IDs that contain the user’s query.
3. Document Servers
Document servers store the document version of web page content saved in the form of JPEG files, PDF files, and more.
4. Ad Servers
Ad servers that manage ads on the search results pages.
5. Spelling Servers
If you have ever searched for something in Google and the results came up with the phrase, “Did you mean correctspelling,” know that a spelling server was at work.
How we can implement this Distributed File System?
Now as Google uses shared computing to satisfy their customers needs. More than 1,000 computers are involved in answering every query. In fact, the most popular open source for distributed computing system is Apache Hadoop. Which is basically called Hadoop Distributed File System (HDFS) designed to run on commodity hardware. Hadoop has a compound annual growth rate of 58% and will surpass $1 billion by 2020.
What is Hadoop?
Apache Hadoop is a collection software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for Distributed Storage and processing of Big Data using the ManReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware.
Thanks for reading!!