登录查看更多内容

The World of “Big Data”

Tushar Dighe

DevOps/Cloud Engineer | Kubernetes | GCP | AWS | Azure | Wiz | Cycode | Security

发布日期: 2020年9月26日

What is data?

The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.

What is Big Data?

Big Data is a bundle of problems that arises due to the large volume of data. Big Data as defined by the “3Vs” but now there are “5Vs” of Big Data which are also termed as the characteristics of Big Data are the problems we face at the time of handling Big Data. They are as follows:-

1. Volume:

Volume is a huge amount of data. To determine the value of data, the size of data plays a very crucial role. If the volume of data is very large then it is considered as a ‘Big Data’. This means whether a particular data can be considered as a Big Data or not, is dependent upon the volume of data. Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.

2. Velocity:

Velocity refers to the high speed of accumulation of data. In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones, etc. There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to meet the demands.

You can see the speed of data being generated in real-time by visiting the website:

Click Here ??InternetLiveStats

3. Variety:

It refers to the nature of data that is structured, semi-structured, and unstructured data. It also refers to heterogeneous sources. Variety is the arrival of data from new sources that are both inside and outside of an enterprise. It can be structured, semi-structured, and unstructured.

Structured data: This data is organized data. It generally refers to data that has defined the length and format of data.
Semi-Structured Data: This data is semi-organized. It is generally a form of data that does not conform to the formal structure of data. Log files are examples of this type of data.
Unstructured data: This data refers to unorganized data. It generally refers to data that doesn’t fit neatly into the traditional row and column structure of the relational database. Texts, pictures, videos, etc. are examples of unstructured data that can’t be stored in the form of rows and columns.

4. Veracity:

It refers to inconsistencies and uncertainty in data, that is available data can sometimes get messy and quality and accuracy are difficult to control. Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources.

5. Value:

After having the 4V’s into account there comes one more V which stands for Value!. The bulk of Data having no Value is of no good to the company, unless you turn it into something useful. Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5Vs.

Before you move on have a look at these amazing stats:-

?? International Data Corp. says by 2020, each person on earth will generate an average of about 1.7 MB of data per second.

?? IBM says worldwide, people are already generating 2.5 quintillion bytes of data each day.

?? International Data Corp. says advanced data analytics show that machine-generated data will grow to encompass more than 40% of internet data in 2020.

?? VisualCue says stored data will grow to 44 ZB by 2020.

?? BaseLine says nearly 90% of all data has been created in the last two years.

You might be wondering how the huge amount of data is being produced and causing problems?

People are generating 2.5 quintillion bytes of data each day. The amount of data we produce every day is truly mind-boggling. We are constantly producing data, even our kitchen appliances are hooked up to the internet, sharing and storing mountains of data. The amount of information being collected around the globe is too hefty to process. I have gathered some stats that will help illustrate some of the ways we create these colossal amounts of data every single day.

Internet

With so much information at our fingertips, we’re adding to the data stockpile every time we turn to our search engines for answers.

We conduct more than half of our web searches from a mobile phone now.
More than 7 billion humans use the internet (that’s a growth rate of 7.5 percent over 2016).
On average, Google now processes more than 40,000 searches EVERY second (3.5 billion searches per day)!
While 77% of searches are conducted on Google, it would be remiss not to remember other search engines are also contributing to our daily data generation. Worldwide there are 5 billion searches a day.

Social Media

Our current love affair with social media certainly fuels data creation. According to Domo’s Data Never Sleeps 5.0 report, these are numbers generated every minute of the day:

Snapchat users share 527,760 photos
More than 120 professionals join LinkedIn
Users watch 4,146,600 YouTube videos
456,000 tweets are sent on Twitter
Instagram users post 46,740 photos

As of June 2019, Facebook reports an estimated 2.4 billion Monthly Active Users.
Facebook also says it has 1.6 billion Daily Active Users.
88% of Facebook’s user activity is from a mobile device.
The average amount of time a single user spends on Facebook every day is 58 minutes.
There are over 300 million photos uploaded to Facebook every day.
On average, 5 Facebook accounts are created every second.
Approximately 30% of Facebook users are aged between 25 and 34 years.
Facebook video is still in high demand with approximately 8 billion video views per day.

Currently, YouTube has more than 1.9 billion logged-in visits every month.
149 million people log in to YouTube daily.
The average duration of a YouTube visit is 40 minutes.
Viewers are spending an average of 1 hour per day watching YouTube videos.
On average, 300 hours of video are uploaded every minute on YouTube.
There are over 5 billion video views each day.

Instagram has over 1 billion monthly active users.
There are more than 600 million daily active users.
There are now 500 million daily Stories users.
Since its creation, more than 40 billion photos have been shared.
On average, 95 million photos are uploaded daily on Instagram.
There are approximately 4.2 billion likes per day.
Most Instagram users are between 18 to 29 years of age with 32% of Instagram users being college students.

Nowadays Twitter has more than 330 million monthly active users.
There are 134 million daily active users or at least that’s how many “monetizable” daily active users (MDAU) according to Twitter.
Of their monthly active users, 68 million MAU forms the United States.
The number of mDAU from the US is 26 million.
Close to 460,000 new twitter accounts are registered every day.
Twitter users are posting 140 million tweets daily which adds up to a billion tweets in a week.
Each twitter user has on average 208 followers.
550 million accounts are reported to have at least sent a tweet.

Communication

We leave a data trail when we use our favorite communication methods today from sending texts to emails. Here are some incredible stats for the volume of communication we send out every minute:

We send 16 million text messages
There are 990,000 Tinder swipes
156 million emails are sent; worldwide it is expected that there will be 9 billion email users by 2019
15,000 GIFs are sent via Facebook messenger
Every minute there are 103,447,520 spam emails sent
There are 154,200 calls on Skype

Digital Photos

Now that our smartphones are exemplary cameras as well, everyone is a photog and the trillions of photos stored is the proof. Since there are no signs of this slowing down, expect these digital photo numbers to continue to grow:

People will take 1.2 trillion photos by the end of 2017
There will be 4.7 trillion photos stored

Services

There are some really interesting statistics coming out of businesses and service providers in our new platform-driven economy. Here are just a few numbers that are generated every minute that piqued my interest:

The Weather Channel receives 18,055,556 forecast requests
Venmo processed $51,892 peer-to-peer transactions
Spotify adds 13 new songs
Uber riders take 45,788 trips!
There are 600 new page edits to Wikipedia

Internet of Things

The Internet of Things, connected “smart” devices that interact with each other and us while collecting all kinds of data, is exploding (from 2 billion devices in 2006 to a projected 200 billion by 2020) and is one of the primary drivers for our data vaults exploding as well.

Let’s take a look at just some of the stats and predictions for just one type of device, voice search:

There are 33 million voice-first devices in circulation
8 million people use voice control each month
Voice search queries in Google for 2016 were up 35 times over 2008

Now, I think you might have got the idea from where such huge data comes. There are hundreds of companies like Facebook, Twitter, and LinkedIn generating yottabytes of data.

Why store such huge data?

To gain competitive advantage, organizations have to make the best use of the unstructured data collected for profitable business decision making. This situation where companies and institutions have to support, store, analyze, and make decisions using large amounts of data is called Big Data.

The process of analyzing large structured and unstructured data sets to discover indefinite relations, hidden patterns, and any other valuable information that can be leveraged for better business decision making. Big Data Analytics tackles even the most challenging business problems through high-performance analytics. Big data analytics drives innovations by helping organizations make the best possible decisions through –high-performance data mining, predictive analytics, text mining, social sentiment analysis, text mining, forecasting, and optimization. To add to this, organizations are realizing that distinct properties of deep learning and machine learning are well-suited to address their requirements in novel ways through big data analytics.

Who helps to store and process large datasets?

Big Data Hadoop is a software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

In simple words, Hadoop helps us to apply the concept of distributed storage, where Big Data is spilled into different smaller blocks of data before storing it. This helps us in storing & retrieving data at high speed or velocity without a single point of failure as data is stored in different nodes.

Hadoop is built to run on a cluster of machines.

Let’s start with an example. Let’s say that we need to store lots of photos. We will start with a single disk. When we exceed a single disk, we may use a few disks stacked on a machine. When we max out all the disks on a single machine, we need to get a bunch of machines, each with a bunch of disks. This is exactly how Hadoop is built. Hadoop is designed to run on a cluster of machines from the get-go.

Hadoop clusters scale horizontally

More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and expensive hardware.

Hadoop can handle unstructured/semi-structured data

Hadoop doesn’t enforce a schema on the data it stores. It can handle arbitrary text and binary data. So Hadoop can digest any unstructured data easily.

Hadoop clusters provide storage and computing

We saw how having separate storage and processing clusters is not the best fit for big data. Hadoop clusters, however, provide storage and distributed computing all in one.

Big Data Tools

Data Storage and Management Tools

MongoDB
Cassandra
neo4j
Apache Hadoop
Apache HBASE
Microsoft HDInsight
Apache ZooKeeper

Data Cleaning

Microsoft Excel
Open Refine

Data Mining

TeraData
rapidminer

Data Visualization

Tableau
IBM Watson Analytics
Plotly

Data Reporting

Power BI

Data Ingestion

Sqoop
Flume
Apache STORM

Data Analysis

Apache HIVE
PIG
Apache Hadoop MapReduce
Apache Spark

Distributed Computing

Apache YARN

Hadoop Case Studies in the Enterprise

1. BT

BT uses a Cloudera enterprise data hub powered by Apache Hadoop to cut down on engineer call-outs. By analyzing the characteristics of its network, BT can identity whether slow internet speeds are caused by a network or customer issue. They can then evaluate whether an engineer would be likely to repair the problem. The Cloudera hub provides a unified view of customer data stored in a Hadoop environment. BT earned a return on investment of between 200 and 250 percent within one year of the deployment. BT has also used it to create new services such as “View My Engineer”, an SMS and email alerting system that lets customers track the location of engineers. The company now wants to use predictive analytics to improve vehicle maintenance.

2. British Airways

British Airways deployed its first instance of Hadoop in April 2015, as a data archive for legal cases that were primarily stored, at a high cost, on its enterprise data warehouse (EDW) platform. Since deploying Hortonworks 2.2 HDP, Spanos said his department has returned on its investment within a year, and is able to deliver 75 percent more free space for new projects, which translates to cost reductions to the airline’s finance team. British Airways’ data exploitation manager Alan Spanos said: “In business intelligence, if you don’t adopt this technology to do at least part of your job role, you will not exist in a few years’ time. You can only go so far with traditional technology. It still has a place within your architecture, but quite frankly, this is where you need to be.”

3. Royal Bank Of Scotland

Royal Bank of Scotland (RBS) has been working with Silicon Valley company Trifacta to get its Hadoop data lake in order, so it can gain insight from the chat conversations its customers are having with the bank online. RBS stores approximately 250,000 chat logs plus associated metadata per month. The bank stores this unstructured data in Hadoop. However, before turning to Trifacta this was a huge and untapped source of information about its user base.

4. JPMorgan Chase & Co.

Morgan Stanley with assets over 350 billion is one of the world’s biggest financial services organizations. It relies on the Hadoop framework to make industry critical investment decisions. Hadoop provides scalability and better results through it administrator and can manage petabytes of data which is not possible with traditional database systems.

JPMorgan Chase is another financial giant which provides services in more than 100 countries. Such large commercial banks can leverage big data analytics more effectively by using frameworks like Hadoop on massive volumes of structured and unstructured data. JPMorgan Chase has mentioned it on various channels that they prefer to use HDFS to support the exponentially growing size of data as well as for low latency processing of complex unstructured data.

5. Nokia

Another example of Big Data management in the telecom industry comes from Nokia. They store and analyse massive volume of data from their manufactured mobile phones. To paint a fair picture of Nokia’s Big Data, they manage 100 TB of structured data along with 500+ TB of semi-structured data. Hadoop Distributed Framework System provided by Cloudera manages all variety of Nokia’s data and processes it in a scale of petabytes.

BigData is the name of all the problems arises due to collection, storage, retrieval and many other operations in data.

Thank You very much for your such a valuable time.

The World of “Big Data”

Tushar Dighe

DevOps/Cloud Engineer | Kubernetes | GCP | AWS | Azure | Wiz | Cycode | Security

What is data?

What is Big Data?

1. Volume:

2. Velocity:

3. Variety:

4. Veracity:

5. Value:

Before you move on have a look at these amazing stats:-

You might be wondering how the huge amount of data is being produced and causing problems?

Internet

Social Media

Communication

Digital Photos

Services

Internet of Things

Why store such huge data?

Who helps to store and process large datasets?

Big Data Tools

Hadoop Case Studies in the Enterprise

1. BT

2. British Airways

3. Royal Bank Of Scotland

4. JPMorgan Chase & Co.

5. Nokia

更多精彩文章

社区洞察

其他会员也浏览了

What Is Big Data and How Does It Benefit Your Business?

Data Products: The Future of Data Strategy in Business

Anatomy Of A Data Stack (2024 Update)

Journey From Big Data to Smart Data: How Big Data Testing Can Help You?

The Three Pillars of Data Observability: Channels, Observation Model, and Expectations

Data Definition

Big data: Big problem, big opportunity or big hype?

Data Mesh: Domain-Driven Data Products

Data pipeline

Big Data: All About Finding Value

What is data?

What is Big Data?

1. Volume:

2. Velocity:

3. Variety:

4. Veracity:

5. Value:

Before you move on have a look at these amazing stats:-

You might be wondering how the huge amount of data is being produced and causing problems?

Internet

Social Media

Communication

Digital Photos

Services

Internet of Things

Why store such huge data?

Who helps to store and process large datasets?

Big Data Tools

Hadoop Case Studies in the Enterprise

1. BT

2. British Airways

3. Royal Bank Of Scotland

4. JPMorgan Chase & Co.

5. Nokia

AWS : NASA Case Study

2020年9月27日

?? Hybrid Multi Cloud Task-3??

2020年9月7日

?? Hybrid Multi Cloud Task-2??

2020年9月5日

?? Hybrid Multi Cloud Task-1??

2020年9月4日

?? Ansible Task-2??

2020年9月1日

?? DevOps Task-6 ??

2020年8月28日

?? DevOps Task-5 ??

2020年8月26日

?? DevOps Task-4 ??

2020年8月25日

?? Ansible Task-1 ??

2020年8月7日

?? DevOps Task-3??

2020年8月2日

社区洞察

其他会员也浏览了

What Is Big Data and How Does It Benefit Your Business?

Data Products: The Future of Data Strategy in Business

Anatomy Of A Data Stack (2024 Update)

Journey From Big Data to Smart Data: How Big Data Testing Can Help You?

The Three Pillars of Data Observability: Channels, Observation Model, and Expectations

Data Definition

Big data: Big problem, big opportunity or big hype?

Data Mesh: Domain-Driven Data Products

Data pipeline

Big Data: All About Finding Value