Why and How the world is facing problem of Big Data

Why and How the world is facing problem of Big Data

What is Big Data ?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:

No alt text provided for this image

Volume: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media and more. In the past, storing it would have been a problem – but cheaper storage on platforms like data lakes and Hadoop have eased the burden.

Velocity: With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.

Variety: Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.

Some Facts :

No alt text provided for this image

Big data is getting bigger every minute in almost every sector, be it tech, media, retail, financial service, travel, and social media, to name just a few. The volume of data processing we are talking about is mind-boggling. Here is some statistical information to give you an idea:

No alt text provided for this image
  • The weather channels receive 18,055,555 forecast requests every minute.
  • Netflix users stream 97,222 hours of video every minute.
  • Skype users make 176,220 calls every minute.
  • Instagram users post 49,380 photos every minute.

This is the sixth edition of DOMO's report, and according to their research:

"Over 2.5 quintillion bytes of data are created every single day, and it’s only going to grow from there. By 2020, it’s estimated that 1.7MB of data will be created every second for every person on earth."

No doubt you've read the stat that some 90% of the world's data has been created in the last two years - this is how, an amazing overview of online usage growth. 

Each minute of every day the following happens on the internet:

If we do some quick calculations, we can see the amount of data created on the internet each day. There are 1440 minutes per day…so that means there are approximately:

Facebook

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.to manipulate these types of data we need to know what is big data and is very difficult to manage these data.

No alt text provided for this image


Importance and Benefits :

No alt text provided for this image

Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

Data volumes are continuing to grow and so are the possibilities of what can be done with so much raw data available. However, organizations need to be able to know just what they can do with that data and how much they can leverage to build insights for their consumers, products, and services. Of the 85% of companies using Big Data, only 37% have been successful in data-driven insights. A 10% increase in the accessibility of the data can lead to an increase of $65Mn in the net income of a company.

The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:

  • Determining root causes of failures, issues and defects in near-real time.
  • Generating coupons at the point of sale based on the customer’s buying habits.
  • Recalculating entire risk portfolios in minutes.
  • Detecting fraudulent behavior before it affects your organization.
No alt text provided for this image

Common Big Data Challenges

No alt text provided for this image

Handling a Large Amount of Data

There is a huge explosion in the data available. Look back a few years, and compare it with today, and you will see that there has been an exponential increase in the data that enterprises can access. They have data for everything, right from what a consumer likes, to how they react, to a particular scent, to the amazing restaurant that opened up in Italy last weekend.

This data exceeds the amount of data that can be stored and computed, as well as retrieved. The challenge is not so much the availability, but the management of this data. With statistics claiming that data would increase 6.6 times the distance between earth and moon by 2020, this is definitely a challenge.

Along with rise in unstructured data, there has also been a rise in the number of data formats. Video, audio, social media, smart device data etc. are just a few to name.

Some of the newest ways developed to manage this data are a hybrid of relational databases combined with NoSQL databases. An example of this is MongoDB, which is an inherent part of the MEAN stack. There are also distributed computing systems like Hadoop to help manage Big Data volumes.

  • Netflix is a content streaming platform based on Node.js. With the increased load of content and the complex formats available on the platform, they needed a stack that could handle the storage and retrieval of the data. They used the MEAN stack, and with a relational database model, they could in fact manage the data.

Real-time can be Complex

When I say data, I’m not limiting this to the “stagnant” data available at common disposal. A lot of data keeps updating every second, and organizations need to be aware of that too. For instance, if a retail company wants to analyze customer behavior, real-time data from their current purchases can help. There are Data Analysis tools available for the same – Veracity and Velocity. They come with ETL engines, visualization, computation engines, frameworks and other necessary inputs.

It is important for businesses to keep themselves updated with this data, along with the “stagnant” and always available data. This will help build better insights and enhance decision-making capabilities.

However, not all organizations are able to keep up with real-time data, as they are not updated with the evolving nature of the tools and technologies needed. Currently, there are a few reliable tools, though many still lack the necessary sophistication.

No alt text provided for this image

Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.


No alt text provided for this image

How big MNC’s like Google, Facebook, Instagram etc stores, manages, and manipulates Thousands of Terabytes of data with High Speed and High Efficiency ?

The answer is with the Technologies like Distributed Storage And Distributed Computing.


Distributed Storage :

No alt text provided for this image

Distributed Storage, here, collectively refers to "Distributed data store" also called "Distributed databases" and "Distributed File System".

The core concept is to form redundancy in the storage of data by splitting up data into multiple parts, and ensuring there are replicas across multiple physical servers (often in various storage capacities).

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

Distributed Computing :

No alt text provided for this image

Distributed computing is a model in which components of a software system are shared among multiple computers to improve efficiency and performance. 

According to the narrowest of definitions, distributed computing is limited to programs with components shared among computers within a limited geographic area. Broader definitions include shared tasks as well as program components. In the broadest sense of the term, distributed computing just means that something is shared among multiple systems which may also be in different locations. 

What is Hadoop ?

No alt text provided for this image


 Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Hadoop cluster :

No alt text provided for this image


Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. 

Such clusters run Hadoop's open source distributed processing software on low-cost commodity computers. Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker; these are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker; these are the slaves. Hadoop clusters are often referred to as "shared nothing" systems because the only thing that is shared between nodes is the network that connects them. 

Hadoop clusters are known for boosting the speed of data analysis applications. They also are highly scalable: If a cluster's processing power is overwhelmed by growing volumes of data, additional cluster nodes can be added to increase throughput. Hadoop clusters also are highly resistant to failure because each piece of data is copied onto other cluster nodes, which ensures that the data is not lost if one node fails.


No alt text provided for this image

Facebook has the world’s largest Hadoop Cluster. Other prominent users include GoogleYahoo and IBM.Facebook is using Hadoop for data warehousing and they are having the largest Hadoop storage cluster in the world.  Some of the properties of the HDFS cluster of Facebook is:

  • HDFS cluster of 21 PB storage capacity
  • 2000 machines (1200 machines with 8 cores each + 800 machines with 16 cores each)
  • 12 TB per machine and 32 GB of RAM per machine
  • 15 map-reduce tasks per machine

. Other prominent users include GoogleYahoo and IBM.




要查看或添加评论,请登录

Chetan Vyas的更多文章

社区洞察

其他会员也浏览了