登录查看更多内容

Why and How the world is facing problem of Big Data

Chetan Vyas

MLOps | DevOps | Hybrid MultiCLoud | Ansible | Flutter | RedHat Linux | Openstack

发布日期: 2020年9月17日

What is Big Data ?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:

Volume: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media and more. In the past, storing it would have been a problem – but cheaper storage on platforms like data lakes and Hadoop have eased the burden.

Velocity: With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.

Variety: Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.

Some Facts :

Big data is getting bigger every minute in almost every sector, be it tech, media, retail, financial service, travel, and social media, to name just a few. The volume of data processing we are talking about is mind-boggling. Here is some statistical information to give you an idea:

The weather channels receive 18,055,555 forecast requests every minute.
Netflix users stream 97,222 hours of video every minute.
Skype users make 176,220 calls every minute.
Instagram users post 49,380 photos every minute.

This is the sixth edition of DOMO's report, and according to their research:

"Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth."

No doubt you've read the stat that some 90% of the world's data has been created in the last two years - this is how, an amazing overview of online usage growth.

Each minute of every day the following happens on the internet:

Social Media is HUGE – Reports show that almost 300 million new social media users each year. That is 550 new social media users each minute.
Since 2013, the number of Tweets each minute has increased 58% to more than 474,000 Tweets PER MINUTE in 2019!
Youtube usage more than tripled from 2014-2016 with users uploading 400 hours of new video each minute of every day! Now, in 2019, users are watching 4,333,560 videos every minute.
300 hours of video are uploaded to YouTube every minute!
Instagram users upload over 100 million photos and videos overyday. That is 69,444 million posts every minute!
Users spend nearly one hour a day on Facebook, but Instagram and Snapchat are quickly catching up.
Since 2013, the number of Facebook Posts shared each minute has increased 22%, from 2.5 Million to 3 million posts per minute in 2016. This number has increased more than 300 percent, from around 650,000 posts per minute in 2011!
Every minute on Facebook: 510,000 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded.
There are over 38,000 status updates on Facebook every minute.
Facebook users also click the like button on more than 4 million posts every minute!, and the Facebook like button has been pressed 13 trillion times.
Over 3.5 Billion Google searches are conducted worldwide each minute of everyday. That is 2 trillion searches per year worldwide. That is over 40,000 search queries per second!
Worldwide over 100 million messages are sent every minute via SMS and in-app messages!
26 billion texts were sent each day by 27 million people in the US. That is 94 texts per day per person in the US in 2017.

If we do some quick calculations, we can see the amount of data created on the internet each day. There are 1440 minutes per day…so that means there are approximately:

1,209,600 new data producing social media users each day.
682 million tweets per day!
More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day.
67,305,600 Instagram posts uploaded each day
There are over 2 billion monthly active facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016.
Facebook has 1.58 billion daily active users on average as of Q2 2019
4.3 BILLION Facebook messages posted daily!
5.76 BILLION Facebook likes every day.

Facebook

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.to manipulate these types of data we need to know what is big data and is very difficult to manage these data.

Importance and Benefits :

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

Data volumes are continuing to grow and so are the possibilities of what can be done with so much raw data available. However, organizations need to be able to know just what they can do with that data and how much they can leverage to build insights for their consumers, products, and services. Of the 85% of companies using Big Data, only 37% have been successful in data-driven insights. A 10% increase in the accessibility of the data can lead to an increase of $65Mn in the net income of a company.

The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:

Determining root causes of failures, issues and defects in near-real time.
Generating coupons at the point of sale based on the customer’s buying habits.
Recalculating entire risk portfolios in minutes.
Detecting fraudulent behavior before it affects your organization.

Common Big Data Challenges

Handling a Large Amount of Data

There is a huge explosion in the data available. Look back a few years, and compare it with today, and you will see that there has been an exponential increase in the data that enterprises can access. They have data for everything, right from what a consumer likes, to how they react, to a particular scent, to the amazing restaurant that opened up in Italy last weekend.

This data exceeds the amount of data that can be stored and computed, as well as retrieved. The challenge is not so much the availability, but the management of this data. With statistics claiming that data would increase 6.6 times the distance between earth and moon by 2020, this is definitely a challenge.

Along with rise in unstructured data, there has also been a rise in the number of data formats. Video, audio, social media, smart device data etc. are just a few to name.

Some of the newest ways developed to manage this data are a hybrid of relational databases combined with NoSQL databases. An example of this is MongoDB, which is an inherent part of the MEAN stack. There are also distributed computing systems like Hadoop to help manage Big Data volumes.

Netflix is a content streaming platform based on Node.js. With the increased load of content and the complex formats available on the platform, they needed a stack that could handle the storage and retrieval of the data. They used the MEAN stack, and with a relational database model, they could in fact manage the data.

Real-time can be Complex

When I say data, I’m not limiting this to the “stagnant” data available at common disposal. A lot of data keeps updating every second, and organizations need to be aware of that too. For instance, if a retail company wants to analyze customer behavior, real-time data from their current purchases can help. There are Data Analysis tools available for the same – Veracity and Velocity. They come with ETL engines, visualization, computation engines, frameworks and other necessary inputs.

It is important for businesses to keep themselves updated with this data, along with the “stagnant” and always available data. This will help build better insights and enhance decision-making capabilities.

However, not all organizations are able to keep up with real-time data, as they are not updated with the evolving nature of the tools and technologies needed. Currently, there are a few reliable tools, though many still lack the necessary sophistication.

Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.

How big MNC’s like Google, Facebook, Instagram etc stores, manages, and manipulates Thousands of Terabytes of data with High Speed and High Efficiency ?

The answer is with the Technologies like Distributed Storage And Distributed Computing.

Distributed Storage :

Distributed Storage, here, collectively refers to "Distributed data store" also called "Distributed databases" and "Distributed File System".

The core concept is to form redundancy in the storage of data by splitting up data into multiple parts, and ensuring there are replicas across multiple physical servers (often in various storage capacities).

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

Distributed Computing :

Distributed computing is a model in which components of a software system are shared among multiple computers to improve efficiency and performance.

According to the narrowest of definitions, distributed computing is limited to programs with components shared among computers within a limited geographic area. Broader definitions include shared tasks as well as program components. In the broadest sense of the term, distributed computing just means that something is shared among multiple systems which may also be in different locations.

What is Hadoop ?

Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Hadoop cluster :

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment.

Such clusters run Hadoop's open source distributed processing software on low-cost commodity computers. Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker; these are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker; these are the slaves. Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects them.

Hadoop clusters are known for boosting the speed of data analysis applications. They also are highly scalable: If a cluster's processing power is overwhelmed by growing volumes of data, additional cluster nodes can be added to increase throughput. Hadoop clusters also are highly resistant to failure because each piece of data is copied onto other cluster nodes, which ensures that the data is not lost if one node fails.

Facebook has the world’s largest Hadoop Cluster. Other prominent users include Google, Yahoo and IBM.Facebook is using Hadoop for data warehousing and they are having the largest Hadoop storage cluster in the world. Some of the properties of the HDFS cluster of Facebook is:

HDFS cluster of 21 PB storage capacity
2000 machines (1200 machines with 8 cores each + 800 machines with 16 cores each)
12 TB per machine and 32 GB of RAM per machine
15 map-reduce tasks per machine

. Other prominent users include Google, Yahoo and IBM.

要查看或添加评论，请登录

Chetan Vyas的更多文章

Leveraging The Power of Logical Volume Manager (LVM) In Hadoop-DFS Cluster

2021年3月15日

Leveraging The Power of Logical Volume Manager (LVM) In Hadoop-DFS Cluster

In this article, we are practically going to integrate the LOGICAL VOLUME MANAGER (LVM) concepts with Data-Node of…
How AMAZON Leveraging Power of Artificial Intelligence Machine Learning

2020年10月22日

How AMAZON Leveraging Power of Artificial Intelligence Machine Learning

What is Artificial Intelligence? Artificial Intelligence (AI) is the branch of computer sciences that emphasizes the…
What is Cloud Computing ? Why AWS Is The Leading Cloud Platform? And How UBISOFT Provides Seamless, Scalable Multiplayer Gaming Experience Using AWS ?

2020年9月22日

What is Cloud Computing ? Why AWS Is The Leading Cloud Platform? And How UBISOFT Provides Seamless, Scalable Multiplayer Gaming Experience Using AWS ?

Organizations of every type, size, and industry are using the cloud for a wide variety of use cases, such as data…
Automating Deployment Using Amazon Elastic Kubernetes Service

2020年7月17日

Automating Deployment Using Amazon Elastic Kubernetes Service

In this article, you will find out how to use EKS to automate Deployment, scaling, and management of containerized…
Automation and Integration using Git + GitHub + Jenkins + Docker

2020年7月14日

Automation and Integration using Git + GitHub + Jenkins + Docker

This article is all about integrating different environments and automate the deployment process. An end to end…

6 条评论

See all articles

Why and How the world is facing problem of Big Data

Chetan Vyas

MLOps | DevOps | Hybrid MultiCLoud | Ansible | Flutter | RedHat Linux | Openstack

What is Big Data ?

Some Facts :

This is the sixth edition of DOMO's report, and according to their research:

Each minute of every day the following happens on the internet:

Importance and Benefits :

Common Big Data Challenges

Handling a Large Amount of Data

Real-time can be Complex

How big MNC’s like Google, Facebook, Instagram etc stores, manages, and manipulates Thousands of Terabytes of data with High Speed and High Efficiency ?

The answer is with the Technologies like Distributed Storage And Distributed Computing.

Distributed Storage :

Distributed Computing :

What is Hadoop ?

Hadoop cluster :

Facebook has the world’s largest Hadoop Cluster. Other prominent users include Google, Yahoo and IBM.Facebook is using Hadoop for data warehousing and they are having the largest Hadoop storage cluster in the world. Some of the properties of the HDFS cluster of Facebook is:

Chetan Vyas的更多文章

社区洞察

其他会员也浏览了

Who’s Afraid of Big Bad Data?

Big Data Market Industry Sales Rise Up to USD 473.6 Billion by 2032 | CAGR 12.7%

Understanding Big Data: The 5 Vs of Big Data

WHAT ARE THE 7 V’S OF BIG DATA?

Big data

Demystifying Big Data for Traffic and Mobility

Big Data

Big Data 101: A Journey Through the World of Massive Data Sets

Data Lakes Market Size and Share: Unveiling Growth Potential and Forecasted Outlook for 2032

BIG DATA

What is Big Data ?

Some Facts :

This is the sixth edition of DOMO's report, and according to their research:

Each minute of every day the following happens on the internet:

Importance and Benefits :

Common Big Data Challenges

Handling a Large Amount of Data

Real-time can be Complex

How big MNC’s like Google, Facebook, Instagram etc stores, manages, and manipulates Thousands of Terabytes of data with High Speed and High Efficiency ?

The answer is with the Technologies like Distributed Storage And Distributed Computing.

Distributed Storage :

Distributed Computing :

What is Hadoop ?

Hadoop cluster :

Facebook has the world’s largest Hadoop Cluster. Other prominent users include Google, Yahoo and IBM.Facebook is using Hadoop for data warehousing and they are having the largest Hadoop storage cluster in the world. Some of the properties of the HDFS cluster of Facebook is:

Chetan Vyas的更多文章

Leveraging The Power of Logical Volume Manager (LVM) In Hadoop-DFS Cluster

How AMAZON Leveraging Power of Artificial Intelligence Machine Learning

What is Cloud Computing ? Why AWS Is The Leading Cloud Platform? And How UBISOFT Provides Seamless, Scalable Multiplayer Gaming Experience Using AWS ?

Automating Deployment Using Amazon Elastic Kubernetes Service

Automation and Integration using Git + GitHub + Jenkins + Docker

社区洞察

其他会员也浏览了

Who’s Afraid of Big Bad Data?

Big Data Market Industry Sales Rise Up to USD 473.6 Billion by 2032 | CAGR 12.7%

Understanding Big Data: The 5 Vs of Big Data

WHAT ARE THE 7 V’S OF BIG DATA?

Big data

Demystifying Big Data for Traffic and Mobility

Big Data

Big Data 101: A Journey Through the World of Massive Data Sets

Data Lakes Market Size and Share: Unveiling Growth Potential and Forecasted Outlook for 2032

BIG DATA