Have you Ever tried to think about the Data..an intro to BigData.

Have you Ever tried to think about the Data..an intro to BigData.

The data, that the huge Multi National companies like Google,Facebook and many Social media platforms...dealing with.

Confused ?? Let me take you Step by step..

Firstly ,What is Data..?

Since the invention of computers, people have used the term data to refer to computer information, and this information was either transmitted or stored. But that is not the only data definition; there exist other types of data as well.

So, what is the data? Data can be texts or numbers written on papers, or it can be bytes and bits inside the memory of electronic devices, or it could be facts that are stored inside a person’s mind

Now, if we talk about data mainly in the field of science, then the answer to “what is data” will be that data is different types of information that usually is formatted in a particular manner.

Coming to the present scenario, Growth in the field of technology, specifically in smartphones has led to text, video, and audio is included under data plus the web and log activity records as well.

Now, why we think of this data..????

I hope everyone of us is well familiar and attached with the social media right ??

No alt text provided for this image

Where we post,share,chat and update our status etc...but have you ever wonder where this all of the data goes ...where it is stored and ...who manage our data to make it available and share the same with our connections within seconds in terms of speed??

So, let me take FaceBook as a case study to explain..the above in detail.

Facebook facts and stats..
  • If Facebook were a country, it would be the most populous nation on earth.
  • With over 2.7 billion monthly active users as of the second quarter of 2020, Facebook is the biggest social network worldwide.
  • 4.3 Billion Facebook messages posted daily.
  • 5.76 Billion Facebook likes every day.
  • Daily 8 Billion hours of video views are generated.

If you observe these..just imagine how big enough the data has to be stored by facebook regarding it's users and their posts and activities say,

  • The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

Now, if you think ..How facebook has gonna to store this much of data regularly...Is there any such big enough single device to store ..absolutely no....now here comes the problem **Big Data** in storing Huge Amounts of Data.

Now,What is this BigData..in brief????
No alt text provided for this image
  • Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.
  • It includes data mining, data storage, data analysis, data sharing, and data visualization.
  • The term is an all-comprehensive one including data, data frameworks, along with the tools and techniques used to process and analyze the data.
Types of BigData...

Structured

  • Structured is one of the types of big data and By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms.
  •  For instance, the employee table in a company database will be structured as the employee details, their job positions, their salaries, etc., will be present in an organized manner.

Unstructured

  • Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data.
  • Email is an example of unstructured data.

Semi-structured

  • Semi structured is the third type of big data. Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data.
  • CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured.
No alt text provided for this image

And whenever , we have to store huge data ...we generally face few problems like Volume ...we can't find enough single device and even though it is there ..then it will cause the problem in terms of Velocity how fast the data loads and shared ...known as I/O processing.

Thus, if you consider the Big Data as a problem ...then there are few sub problems under it's Umbrella ....These are termed as the 4 V's of Big Data.

4 V's of BigData..
No alt text provided for this image

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one characteristic which needs to be considered while dealing with Big Data.

(ii) Variety - Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors,Mobile devices, etc. The flow of data is massive and continuous.

(iv) Veracity - Veracity is all about making sure the data is accurate, which requires processes to keep the bad data from accumulating in your systems. The simplest example is contacts that enter your marketing automation system with false names and inaccurate contact information. How many times have you seen Mickey Mouse in your database? It’s the classic “garbage in, garbage out” challenge.

What is the Solution for these..????

Here, we use the concept of Distributed Storage Systems as the solution..

Distributed Storage Cluster

Distributed Storage Systems:

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

Distributed storage systems can store several types of data:

  • Files—a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
  • Block storage—a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
  • Objects—a distributed object storage system wraps data into objects, identified by a unique ID or hash.

Distributed storage systems have several advantages:

  • Scalability—the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
  • Redundancy—distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
  • Cost—distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
  • Performance—distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.

But now , inorder to implement this concept ,we have a great product in the market known as **Apache Hadoop**.

For more Alternatives refer this link.

What is hadoop ..?

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

No alt text provided for this image

In my case study how hadoop plays a key role..have a look at this..

“Facebook runs the world’s largest Hadoop cluster" says Jay Parikh, Vice President Infrastructure Engineering, Facebook.

Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers

Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase, which has a layered architecture that supports plethora of messages in a single day.

HDFS (Hadoop Distributed File System) cluster

Thus , Facebook is effectively managed in this BigData World..

Not only FB, if you consider the following lines..

  • YouTube1 has 2 billion monthly active users.
  • WhatsApp has 2 billion monthly active users.
  • Facebook Messenger1 has 1.3 billion monthly active users.
  • WeChat has 1.203 billion monthly active users.
  • Instagram’s2 potential advertising reach is roughly 1.08 billion.
  • Snapchat’s2 potential advertising reach is roughly 397 million.
  • Pinterest has 367 million monthly active users.
  • Twitter’s2 potential advertising reach is roughly 326 million.

These lines prove that ...based on their users and the data they share .. there is high availability of Big Data Challenge every where and they solve them using these kind of solutions..

Finally here is the benefits of Big Data Processing..

Ability to process Big Data brings in multiple benefits, such as-

  • Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their business strategies.

  • Improved customer service

Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer responses.

  • Early identification of risk to the product/services, if any
  • Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps an organization to offload infrequently accessed data.

To understand more go through these study cases..

Finally , we can say that:

“Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

That's all for the article..Thanks for reading...??????

Feel free to connect in case of suggestions..Signing off.



要查看或添加评论,请登录

Vamsi Mathala的更多文章

社区洞察

其他会员也浏览了