How are we living in a data oriented world?
The last couple of years have seen huge leaps in data. Today, there is more data than has ever been generated before. With billions of devices and gadgets such as smartphones, wireless sensors, cameras, payments systems, digital platforms, and virtual reality applications generating data every single moment, big data is being generated. In 2014, there were 2.4 billion internet users. That number grew to 3.4 billion by 2016, and in 2017, 300 million internet users were added. As of June 2019 there are now over 4.4 billion users. This is an 83% increase in the number of people using the internet in just five years!
Each day the following happens on the internet:
- 1,209,600 new data producing social media users!
- 682 million tweets per day!
- More than 4 million hours of content gets uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day!
- 67,305,600 Instagram posts gets uploaded each day!
- There are over 2 billion monthly active facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016!
- Facebook has 1.58 billion daily active users on average as of Q2 2019!
- 4.3 BILLION Facebook messages get posted daily!
- 5.76 BILLION Facebook likes every day!
- 500+ terabytes of data is being produced by Facebook each day!!
And, these are just a very few examples of data being produced in volumes over social media everyday.
Therefore, Big data is not the name of technology. It is the name of a problem.
Big data is a term that describes the large volume of data — both structured and unstructured — that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
The IT industry, in an attempt to quantify what is and isn’t Big Data, has come up with what are known as the “V’s” of Big Data. They are:
- Volume: The amount of data is immense and Input/output processing of data is difficult.
- Velocity: The speed of data and processing (analysis of streaming data to produce near or real time results).
- Variety: The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively.
- Veracity: It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.
- Value: The utility that can be extracted from the data.
How to solve the problem of Big data?
We know that whenever a problem arises in technical world, the birth of a new technology takes place. So, to solve the problem of big data, experts proposed the solution of distributed storage.
A Distributed File System (DFS) as the name suggests, is a file system that is distributed on multiple file servers or multiple locations. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. The server allows the client users to share files and store data just like they are storing the information locally. However, the servers have full control over the data and give access control to the clients.
One of the most popular distributed file system that is used by many big MNC'S is Hadoop distributed file system.
Hadoop distributed file system (HDFS) is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN. HDFS should not be confused with or replaced by Apache HBase, which is a column-oriented non-relational database management system that sits on top of HDFS and can better support real-time data needs with its in-memory processing engine.
So, now the question that arises is that 'How Tech giants like Facebook, Google, etc. stores, manages and manipulates thousands of terabytes of data with high speed and high efficiency?'
So, let's see one by one how these top big MNC's are solving their big data problem:
There is no doubt that Facebook is one of the largest Big Data specialists, dealing with petabytes of data, including historical and real-time, and will keep growing in the same horizon. While the world is coming closer together on this platform, Facebook develops algorithms to track those connections and their presence on or outside its walls to fetch the most suitable posts for its users. Whether it is your wall post, your favorite books, movies, or your workplace, Facebook analyzes each and every bit of your data and offers you better services each time you log in.
Facebook runs the world’s largest Hadoop cluster" says Jay Parikh, Vice President Infrastructure Engineering, Facebook. Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers:
- The developers can freely write map-reduce programs in any language.
- SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to the developers with small subsets of SQL.
Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase, which has a layered architecture that supports plethora of messages in a single day.
Google is actually a mountain of data and a set of tools for working with it. It has evolved from an index of web pages to a central hub for real-time data feeds on just about anything that can be measured (think: weather information, travel delays, stocks and shares, shopping… and countless other things). Google uses Big Data tools and techniques to understand our requirements based on several parameters like search history, locations, trends etc. Then it goes through an algorithm where complex calculations are done and then Google effortlessly displays the sorted or ranked search results in terms of relevancy and authority designed to match the user’s requirement.
Google always wanted to develop a search engine that has the ability to think like a human and understand the phrase, logic, and goal of any search query. Semantics has helped Google to accomplish this task to look beyond the literal meaning of any phrase of a search query. Google invokes other inbuilt algorithms that are themselves based on Big Data. Google’s translate service studies millions of other pieces of translated text or speech, to determine the most accurate interpretation.
Amazon
As the world’s largest online store, Amazon is also one of the world’s largest data-driven organizations. Once again, the differences between Amazon and the other internet giants mentioned here are largely down to marketing. Like Google and Facebook, Amazon offers a wide range of online services including information search, following friends and family, and advertising – however its brand is built on the service it first became famous for – shopping.
Amazon compares products we browse and buy with millions of other customers around the world. By building a profile of our habits, it is able to match us with products and recommendations from others which will most likely fit our needs. The Big Data tech at work here is known as a recommendation engine and Amazon’s was one of the first, and most sophisticated.
As well as shopping, Amazon lets us take advantage of its platform to make money ourselves (for a cut, of course). Anyone who sets up as a trader on their platform benefits from the data-driven recommendations which will, in theory, drive suitable customers towards their listings.
If you are an employer, or a person looking for work, LinkedIn gives access to Big Data that can be of help. Applicants can be matched to job vacancies based on their skills and experience, and even find data on how they compare to other employees at a company, as well as others who may be competing for the position.
For recruiters, LinkedIn’s Big Data allows talent which matches a particular profile – for example, successful current or former employees – to be discovered.
LinkedIn takes a “walled garden” approach to its data and this brings up one important difference worth considering when choosing where to find and use your Big Data. LinkedIn’s recruiter and applicant services all operate on data which is internal and controlled by the service itself, whereas Google (which also offers job listings, in the US) sucks in data from a large number of external sources. One approach offers potentially higher quality information, with the flipside that it may not be as comprehensive. The other offers larger volumes, which may or may not be what you are looking for.
These are just a few ways in which Big Data – far from being a tool of the well-resourced corporations and the technological elite – is something that many of us are already benefiting from in our day-to-day lives. As more and more data becomes accessible and increasingly sophisticated tools emerge to gain value from it, it is certain we will see the emergence of many more.
Netflix
With over 115 million subscribers, there is little doubt that Netflix is the uncrowned king of the online straming world. Netflix’s phenomenal rise to streaming dominance has taken industry leaders aback, forcing them to question- how could one single platform take on entire Hollywood? The answer is simple- Big Data.
According to the Wall Street Journal, Netflix uses Big Data analytics to optimize the quality and stability of its video streams, and also to assess customer entertainment preferences along with viewing pattern. This allows Netflix to target its users with offers for a show they might like watching. These collective efforts have been very pivotal in helping the streaming giant make a successful transition from renting DVDs to delivering digital video over the last decade.
Netflix even gave away $1 million to a developer group for an algorithm that even increased the accuracy of the company’s recommendation engine by 10 percent. The algorithm helped Netflix to save $1 billion a year from customer retention. Netflix knows more about your viewing habits than you think. Now, this might sound scary, but it’s pure statistics. The prediction systems powered by algorithms know what we prefer to watch before we do. Analyzing data and gaining insights have been the pillar behind the success of Netflix in recent years. They are able to gather insights, adjust algorithms, and optimize streaming experience. Viewing habits are crucial for predicting user behavior, including time spent on selecting movies, time spent on playback, number of times a show was watched and much more. Conventional calculus gave Netflix the required foundation to start studying their users and provide them with appropriate and personalized content.
Netflix is one prime example of how technological advancement can work together with human creativity. Netflix debunks the misinterpreted theories about content preference by flashing the concealed potential of user data that can read a user’s mind with incredible accuracy.
Instagram, the social networking app for sharing photos and videos, launched in 2010. Today, it boasts 800 million active users and is owned by Facebook. There are 70 million photos uploaded to Instagram every day. People interact with each of those posts by showing their love with a heart, commenting and using hashtags. What all of this activity does is create an enormous amount of data. Once analyzed, by humans as well as increasingly through artificial intelligence algorithms, it can provide incredible business intel and insights into human behavior causing Instagram CEO Kevin Systrom to say, “We’re also going to be a big data company.”
Via the use of tags and trending information, Instagram users are able to find photos for a particular activity, topic or event or discover experiences, restaurants and places around the world that are trending.
In a survey, conducted by Ditch the Label, 42% of more than 10,000 UK youth between ages 12 and 25 reported Instagram was the platform where they were most bullied. With this unfortunate distinction of having the biggest cyberbullying problem of any social media site, they became the first to use machine learning to automatically remove offensive posts, whereas Facebook and Twitter rely on users to report abusive language. Based on the success of using DeepText to identify spam and remove it, Instagram officials began to see it as a solution to identify and eliminate comments that violate Instagram’s Community Guidelines. Humans reviewed and tagged actual Instagram posts to help DeepText learn what would be considered offensive content in certain contexts and what wouldn’t be. If the algorithm finds something offensive, it is immediately removed.
From enhancing its platform for users and advertisers to finding and removing fake or offensive content, Instagram uses the insights it extracts from all the data it collects to improve while others find great potential in the enormous data it collects to uncover insights about human behavior, cultures and more.
I hope you got some good insights of a big problem in today's world i.e. Big data and how it is being solved.
THANK YOU FOR READING!!
A special thanks to World Record Holder, Vimal Daga sir for his extraordinary teaching skills and to provide such a platform where we can develop ourselves and learn new technologies, their integration, etc. from his years of hard work and research. I consider myself really lucky to be a part of his trainings where I get to improve myself and learn new & exciting things every day.
Risk Analyst at Xceedance | CAT Modeling| Python | R | SQL | Excel
4 年Nicely explained!
Senior DevOps Engineer
4 年#benifitial