How much data is used in companies and how to resolve storage issue ?

How much data is used in companies and how to resolve storage issue ?

What is Data??

No alt text provided for this image

In computing, data is information that has been translated into a form that is efficient for movement or processing. Relative to today's computers and transmission media, data is information converted into binary - digital form. It is acceptable for data to be used as a singular subject or a plural subject. Raw data is a term used to describe data in its most basic digital format.

what is bigdata ???

No alt text provided for this image

"Big Data" is high-volume,velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making"

This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and large data sets that have to be processed and analyzed to uncover valuable information that can benefit businesses and organizations.

Types of Big Data

Structured

Structured is one of the types of big data and By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms. For instance, the employee table in a company database will be structured as the employee details, their job positions, their salaries, etc., will be present in an organized manner. 

Unstructured

Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data. Email is an example of unstructured data. Structured and unstructured are two important types of big data.

Semi-structured

Semi structured is the third type of big data. Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data. Thus we come to the end of types of data. Lets discuss the characteristics of data.

Characteristics of Big Data

Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity, and Volume. Let’s discuss the characteristics of big data.

1) Variety

Variety of Big Data refers to structured, unstructured, and semistructured data that is gathered from multiple sources. While in the past, data could only be collected from spreadsheets and databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much more. Variety is one of the important characteristics of big data.

2) Velocity

Velocity essentially refers to the speed at which data is being created in real-time. In a broader prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and activity bursts.

3) Volume

Volume is one of the characteristics of big data. We already know that Big Data indicates huge ‘volumes’ of data that is being generated on a daily basis from various sources like social media platforms, business processes, machines, networks, human interactions, etc. Such a large amount of data are stored in data warehouses.

How much data used in companies

Google: 40,000 Google Web Searches Per Second

No alt text provided for this image

More than 3.7 billion humans have regular access to and use the internet. That results in about 40,000 web searches per second— on Google alone.

Furthermore, over half of all those web searches take place on mobile devices. It is likely the web search totals will continue to grow as more and more people get their hands on mobile devices across the world.

Facebook: 500 Terabytes Per Day

No alt text provided for this image

In 2012, Facebook’s system was generating 2.5 billion pieces of content and more than 500 terabytes data per day. There are just as many “likes,” photos, and data scans too. It was massive then, and it’s certainly grown over time.

Today, there are two billion active users on Facebook and counting, making it the largest social media platform in existence. About 1.5 billion people are active on the network per day, all generating data and content. Five new profiles join Facebook every second, and more than 300 million photos are uploaded, too.

Twitter: 12 Terabytes Per Day

No alt text provided for this image

One wouldn’t think that 140-character messages comprise large stores of data, but it turns out that the Twitter community generates more than 12 terabytes of data.

That equals 84 terabytes per week and 4368 terabytes — or 4.3 petabytes — per year. That’s a lot of data certainly for short, character-limited messages like those shared on the network.

Amazon: $258,751.90 in Sales Per Minute

No alt text provided for this image

Amazon generates data two-fold. The major retailer is collecting and processing data about its regular retail business, including customer preferences and shopping habits. But it is also important to remember that Amazon offers cloud storage opportunities for the enterprise world.

Amazon S3— on top of everything else the company handles — offers a comprehensive cloud storage solution that naturally facilitates the transfer and storage of massive data troves. Because of this, it’s difficult to truly pinpoint just how much data Amazon is generating in total.

Instead, it’s better to look at the revenue flowing in for the company which is directly tied to data handling and storage. The company generates more than $258,751.90 in sales and service fees per minute.

General Stats: Per Minute Ratings

  • Snapchat: Over 527,760 photos shared by users
  • LinkedIn: Over 120 professionals join the network
  • YouTube: 4,146,600 videos watched
  • Twitter: 456,000 tweets sent or created
  • Instagram: 46,740 photos uploaded
  • Netflix: 69,444 hours of video watched
  • Giphy: 694,444 GIFs served
  • Tumblr: 74,220 posts published
  • Skype: 154,200 calls made by users

How to resolve big data issue:

one solution to all problems =============> Distributed Computing

No alt text provided for this image

A distributed system consists of a collection of autonomous computers, connected through a network and distribution middleware, which enables computers to coordinate their activities and to share the resources of the system, so that users perceive the system as a single, integrated computing facility.

Let us say about Google Web Server, from users perspective while they submit the searched query, they assume google web server as a single system. However, behind the curtain, google has built a lot of servers which is distributed (geographically and computationally) to give us the result within few seconds.

Advantages of Distributed Computing?

  • Highly efficient
  • Scalability
  • Less tolerant to failures
  • High Availability

Hadoop is a Tool used for implementing Distributed Computing

USE HADOOP TO HANDLE BIG DATA

Hadoop is an open source software programming framework for storing a large amount of data and performing the computation. Its framework is based on Java programming with some native code in C and shell scripts.

Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc.

Hadoop Distributed File Storage ( HDFS)

  • HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files.
  • HDFS consists of two core components i.e.
  1. Name node
  2. Data Node
No alt text provided for this image
  • Name Node is the prime node which contains metadata (data about data) requiring comparatively fewer resources than the data nodes that stores the actual data. These data nodes are commodity hardware in the distributed environment. Undoubtedly, making Hadoop cost effective.
  • HDFS maintains all the coordination between the clusters and hardware, thus working at the heart of the system.


Thank you so much vimal sir & preeti mam for giving me such oppurtunity ARTH2020

要查看或添加评论,请登录

Mohan Venkatesh Ravipati的更多文章

社区洞察

其他会员也浏览了