What is BigData ? How Facebook, Google and other big company storing your data ?

What is BigData ? How Facebook, Google and other big company storing your data ?

Did you notice where you data store online and how Big MNC ( Google, Microsoft, Facebook, Whatsapp ) storing your huge data.

In this blog i tell you about what behind the scene is going on when you upload a pic in Instagram or doing chat with your Girlfriend ( if you have )

Let’s start with Social Media

Social Media

No alt text provided for this image

An article from Excellacom in 2016 what happens on the internet at one minute? Let me give some bits from that :

In one minute..

  • 700,000 logins on facebook
  • Around 530,000 photos are shared on snap chat
  • Around 350000 tweets are tweeted on twitter
  • 30,000 photos are shared on Instagram
  • 21 million messages on WhatsApp

In 2012, Facebook has revealed that it is generating around 500+ terabytes of data every day. In which 2.7 billion were likes and around 300 million photos per day. Another exciting thing is Facebook is scanning around 105 terabytes of data per each half hour.

Reference : How Big Is Facebook’s Data? 

Facebook has recently launched SIX Data centers around the globe to handle its immutable data efficiently.

Google

No alt text provided for this image


It’s pretty hard to say given how sketchy Google has been in regards to how many active GMail accounts are out there. Back in 2016, Google announced over 1 billion active GMail accounts: Google Has 7 Products With 1 Billion Users. That was then.

Let’s assume they still have 1 billion Gmail accounts which doesn’t even come close to actual Google accounts: after all if you ever owned an Android device, you need a Google account to use it. So for shiggles, I’ll assume the two billion Android devices Google announced were active in 2016 (Google Says There Are Now More Than 2 Billion Monthly Active Android Devices) are on a 1.5:1 ratio of devices to users (that is, half of the users own 2 devices — a phone and a tablet for instance.)

Each account gets 15Gb of storage for free + photos stored in a .jpg format outside of that 15Gb. Again for shiggles I’ll assume they use half of the space (emails + back up space: I also assume no one buys more storage which we know is also false but Google won’t divulge that) and 100 photos each (64 kb each photo.)

Simple math: 1.5B x 7.5Gb x 6.4Mb = about 18.5 petabytes:

No alt text provided for this image


Nevermind what else Google needs to store (i.e. their search archive, YouTube videos, their knowledge base, and yes let’s not forget they need to all run in some sort of RAID array to have redundancy to account for hard drive failure.) Let’s just say they’re keeping Seagate and Western Digital in business for now.

Now you all understand that how many data of user are collecting by MNC’s

Let’s Understand how they storing this huge amount of data

These Company are using the concept of Distributed File System, Which is a problem of BigData

So, First Understand What is BigData ?

The simplest explanation of the big data phenomenon is that, on the one hand it’s all about large amounts of data, while on the other hand it is also almost always about running analytics on those large data sets.

On the face of it, neither the volume of data nor the analytics elements are really new. For many years, enterprise organisations have accumulated growing stores of data. Some have also run analytics on that data to gain value from large information sets.

Notable here are, for example, the oil and gas industry, which has, for decades now, run very large data sets through high-performance computing (HPC) systems to model underground reserves from seismic data.

There have also been analytics in data warehousing, for example, where businesses would interrogate large data sets for business value.

he other few factors that define Big Data are –

No alt text provided for this image


  • VOLUME refers to the huge sizes of the data sets.
  • VELOCITY is the speed with which it is been generated.
  • VARIETY accounts for the different sources of BIG DATA.
  • VERACITY is the quality of the data that has been generated.
  • VALUE is the useful data to be extracted for businesses.

Now, What is Distributed File System ?

No alt text provided for this image


A distributed system contains multiple nodes that are physically separate but linked together using the network. All the nodes in this system communicate with each other and handle processes in tandem. Each of these nodes contains a small part of the distributed operating system software.

A diagram to better explain the distributed system is ?


Types of Distributed Systems

The nodes in the distributed systems can be arranged in the form of client/server systems or peer to peer systems. Details about these are as follows ?

Client/Server Systems

In client server systems, the client requests a resource and the server provides that resource. A server may serve multiple clients at the same time while a client is in contact with only one server. Both the client and server usually communicate via a computer network and so they are a part of distributed systems.

Peer to Peer Systems

The peer to peer systems contains nodes that are equal participants in data sharing. All the tasks are equally divided between all the nodes. The nodes interact with each other as required as share resources. This is done with the help of a network.

No alt text provided for this image


Advantages of Distributed Systems

Some advantages of Distributed Systems are as follows ?

  • All the nodes in the distributed system are connected to each other. So nodes can easily share data with other nodes.
  • More nodes can easily be added to the distributed system i.e. it can be scaled as required.
  • Failure of one node does not lead to the failure of the entire distributed system. Other nodes can still communicate with each other.
  • Resources like printers can be shared with multiple nodes rather than being restricted to just one.


How Is Facebook Deploying Big Data?

No alt text provided for this image


There is a combined workforce of people and technology constantly working behind the successful implementation of this platform. Though the platform is continuously being enriched, below are the prime technological aspects:

Hadoop

No alt text provided for this image


“Facebook runs the world’s largest Hadoop cluster” says Jay Parikh, Vice President Infrastructure Engineering, Facebook.

Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers:

  • The developers can freely write map-reduce programs in any language.
  • SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to the developers with small subsets of SQL.

Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase, which has a layered architecture that supports plethora of messages in a single day.

Why big Company Using Hadoop for BigData ?

No alt text provided for this image


Forrester once predicted that enterprise adoption of Hadoop will become mandatory. While some companies are still struggling with their Hadoop projects, others are using the big data framework to revolutionize their data storage and analytics.

The advantages of Hadoop — flexibility and lower costs — appeal to enterprises, so Hadoop has fundamentally changed how businesses process and store very large, fast-moving data sets. With additional software like Kognitio, organizations can also achieve high-speed BI and analytics on their Hadoop-based data.

But have you ever wondered which household-named brands and businesses have made a true success of Hadoop for big data analytics, and how?

Here are five businesses successfully using Hadoop:

1. Marks and Spencer

In 2015, Marks and Spencer adopted Cloudera Enterprise to analyze its data from multiple sources. The goal for the British retail business was to better understand its customers’ behavior.

Marks and Spencer uses Hadoop to plug gaps in campaign management, manage customer loyalty data, and it uses data from digital assets to help create more personalized and targeted communications.

Thanks to their decision to use Hadoop, the company can now successfully predict stock demand and uses business analytics to keep its shelves full during peak times.

2. Royal Mail

British postal service company Royal Mail used Hadoop to pave the way for its big data strategy, and to gain more value from its internal data.

The business used Hortonworks’ Hadoop analytics tools to transform the way it managed data across the organization. Royal Mail can now identify customers in particular industries who are most at risk of churn, allowing the sales and marketing teams to take proactive preventative steps. It also enables the company to find new ways of integrating the tech with its more conventional tools.

3. Royal Bank of Scotland

As a driver of enhanced customer experiences, Royal Bank of Scotland (RBS) decided to use Hadoop (Cloudera Enterprise) to gain intelligence from its online customer chat conversations.

RBS processes around 250,000 chat logs and associated metadata per month, storing this unstructured data in Hadoop. By using a big data management and analytics hub built on Hadoop, the business uses machine learning as well as data wrangling to map and understand its customers’ journeys.

The high street bank is also using big data analytics to delve into transactional data to analyze and identify where customers are paying twice for financial products, and deliver enhanced customer experiences.

4. British Airways

British Airways deployed Hadoop in April 2015 as a data archive for legal cases. Previously theses were stored on an enterprise data warehouse which was costly for the airline.

Since deploying Hortonworks 2.2 HDP, British Airways has gained ROI within a year, and is able to deliver 75% more free space for new projects, translating directly into cost reductions for the airline.

5. Expedia

Expedia makes use of Hadoop clusters using Amazon Elastic MapReduce (Amazon EMR) to analyze high volumes of data coming from Expedia’s global network of websites. These include clickstream, user interaction, and supply data. Highly valuable for allocating marketing spend, this data is merged from web bookings, marketing departments and marketing spend logs to analyze whether the outlay has equated to increased bookings.

The firm has seen costs drop and can process and analyze higher volumes of data.

There are many high profile businesses using Hadoop for lower-cost and big data BI and analytics, delivering enhanced customer insights, better user experiences and greater business returns.

Rahul Saini

SEO Consultant | E-Commerce SEO Specialist

4 年

Great work...Vijay

Rhythm Varshney

SDE@OneCard | Java Backend Developer | Problem Solver | Health and Tech

4 年

Great

要查看或添加评论,请登录

社区洞察

其他会员也浏览了