Big Data is technology or a Problem?
Udit Agarwal
Software Engineer | Python | GCP Cloud | Devops | Kubernetes | Grafana | AWS cloud | JAVA enthusiast | web developer | Docker | Rhel 8
Every time we heard term "Big Data". What is it? what is use for? why we need this? Is it necessary to understand?
How big MNC’s like Google, Facebook, Instagram etc stores, manages and manipulate Thousands of Terabytes of data with High Speed and High Efficiency.
What is Big Data?
Big data is a term that describes the large volume of data — both structured and unstructured — that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Where does big data come from?
Whereas in the past most customer data could be categorized as well-structured (such as bank) transactions, today, the massive “exhaust” that organizations produce daily in the form of unstructured online customer interaction data dwarfs that which was produced only a few years ago. The recent emergence of the “Internet of Things,” the term describing the global network of billions of interconnected devices and sensors, has caused an explosion in the volume of data in the form of text, video, images, and even audio. Finally, in some regulated industries, access to data that would otherwise be archived is now often needed for compliance reasons.
Why is big data important?
The ability to consistently get business value from data is now a trait of successful organizations across every industry, and of every size. In some industries (such as Retail, Advertising, and Financial Services, with more constantly joining the list), it’s even a matter of survival.
Data analytics only returns more value when you have access to more data, so organizations across multiple industries have found big data to be a rich resource for uncovering profound business insights. And, because machine-learning models get more efficient as they are “trained” with more data, machine learning and big data are highly complementary.
How will I know if my data is “big”?
Although many enterprises have yet to reach petabyte scale with respect to data volumes, it is possible that data has one of the other two defining characteristics of big data. And, if there is any single guarantee, it’s that your data will grow over time--probably, exponentially. In that sense, all “big data” starts as “small data.”
Use Cases of Big Data
Big data can help you address a range of business activities, from customer experience to analytics. Here are just a few. (More use cases can be found at Oracle Big Data Solutions.)
Product Development Companies like Netflix and Procter & Gamble use big data to anticipate customer demand. They build predictive models for new products and services by classifying key attributes of past and current products or services and modelling the relationship between those attributes and the commercial success of the offerings. In addition, P&G uses data and analytics from focus groups, social media, test markets, and early store rollouts to plan, produce, and launch new products.
Predictive Maintenance Factors that can predict mechanical failures may be deeply buried in structured data, such as the year, make, and model of equipment, as well as in unstructured data that covers millions of log entries, sensor data, error messages, and engine temperature. By analysing these indications of potential issues before the problems happen, organizations can deploy maintenance more cost effectively and maximize parts and equipment uptime.
Customer Experience The race for customers is on. A clearer view of customer experience is more possible now than ever before. Big data enables you to gather data from social media, web visits, call logs, and other sources to improve the interaction experience and maximize the value delivered. Start delivering personalized offers, reduce customer churn, and handle issues proactively.
Fraud and Compliance When it comes to security, it’s not just a few rogue hackers—you’re up against entire expert teams. Security landscapes and compliance requirements are constantly evolving. Big data helps you identify patterns in data that indicate fraud and aggregate large volumes of information to make regulatory reporting much faster.
Machine Learning Machine learning is a hot topic right now. And data—specifically big data—is one of the reasons why. We are now able to teach machines instead of programming them. The availability of big data to train machine learning models makes that possible.
Operational Efficiency Operational efficiency may not always make the news, but it’s an area in which big data is having the most impact. With big data, you can analyse and assess production, customer feedback and returns, and other factors to reduce outages and anticipate future demands. Big data can also be used to improve decision-making in line with current market demand.
Drive Innovation Big data can help you innovate by studying interdependencies among humans, institutions, entities, and process and then determining new ways to use those insights. Use data insights to improve decisions about financial and planning considerations. Examine trends and what customers want to deliver new products and services. Implement dynamic pricing. There are endless possibilities.
Challenges in Big Data
While big data holds a lot of promise, it is not without its challenges.
First, big data is…big. Although new technologies have been developed for data storage, data volumes are doubling in size about every two years. Organizations still struggle to keep pace with their data and find ways to effectively store it.
But it’s not enough to just store the data. Data must be used to be valuable and that depends on curation. Clean data, or data that’s relevant to the client and organized in a way that enables meaningful analysis, requires a lot of work. Data scientists spend 50 to 80 percent of their time curating and preparing data before it can be used.
Finally, big data technology is changing at a rapid pace. A few years ago, Apache Hadoop was the popular technology used to handle big data. Then Apache Spark was introduced in 2014. Today, a combination of the two frameworks appears to be the best approach. Keeping up with big data technology is an ongoing challenge.
The Four V’s of Big Data
Volume: The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, click streams on a webpage or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.
Velocity: Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real time or near real time and will require real-time evaluation and action.
Velocity refers to the speed with which data is generated. High velocity data is generated with such a pace that it requires distinct (distributed) processing techniques. An example of a data that is generated with high velocity would be Twitter messages or Facebook posts.
Variety: Variety refers to the many types of data that are available. Traditional data types were structured and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video, require additional per-processing to derive meaning and support metadata.
Veracity
Veracity refers to the quality of the data that is being analyses. High veracity data has many records that are valuable to analyses and that contribute in a meaningful way to the overall results. Low veracity data, on the other hand, contains a high percentage of meaningless data. The non-valuable in these data sets is referred to as noise. An example of a high veracity data set would be data from a medical experiment or trial.
Data that is high volume, high velocity and high variety must be processed with advanced tools (analytics and algorithms) to reveal meaningful information. Because of these characteristics of the data, the knowledge domain that deals with the storage, processing, and analysis of these data sets has been labelled Big Data.
Most of Big companies are using Big Data to enhance its services
1. GOOGLE
Google handles a staggering 1.2 trillion searches every year.
Google currently processes over 20 petabytes of data per days, how much data does Google process every day? Research stats show that it is 3.5 billion queries every 24 hours. Although the leading search engine seems invincible at this point, it is surprisingly not peerless. Amazon’s ad revenue share in the US is poised to reach 15.9% by 2021 at the expense of Google.
2. Facebook
system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.
3. Instagram
95million photos and videos are shared on Instagram per day. 9. Over 40 billion photos and videos have been shared on the Instagram platform since its conception.
4. LinkedIn
91percent of executive’s rate LinkedIn as their first choice for professionally relevant content. 280 billion feed updates viewed annually.
There are 9 billion content impressions in LinkedIn feeds every week. 2 million posts, articles, and videos are published on LinkedIn every day.
5. YouTube
As of May 2019, 500 hours of video were uploaded to YouTube every minute.
So, how much content is created every day? The simplest answer is countless, as everybody can be a content creator these days and financially succeed as a reviewer, analyst, actor, or any other profession on YouTube.
8. Netflix
Streaming Observer calculates, based on its average of 71 minutes per day, that a cumulative 165 million hours of Netflix are watched daily across the globe (as of April 2019). This is set to increase to 115 minutes by 2021.
Netflix has over 100 million subscribers and with that comes a wealth of data they can analyses to improve the user experience. Big data helps Netflix decide which programs will be of interest to you and the recommendation system influences 80% of the content we watch on Netflix. Big data has helped Netflix massively in their mission to become the king of stream.
9. Other mediums.
Snap chat -Snaps created on Snap chat fell from 2.4 million per minute in 2018 to 2.1 million in 2019.
Netflix-In 2019, nearly 695,000 hours’ worth of Netflix content was watched per minute across the world.
email: In 2019, the number of emails sent every minute was 188 million.
play store -The number of apps downloaded from Google Play Store and App Store every 60 seconds in 2019 jumped to 390,030 from 375,000 only in 2018.
80% of online content is available in just one-tenth of all languages.
Text messages: 18.1 million text messages were sent every minute through LINE last year.
Games: As of January 2019, 30% of internet users played games streamed live online, 23% watched live streams of other games, and 16% watched esports tournaments every month.
Smart speaker: As of January 2019, more than 26% of the US population owned a smart speaker.
5G can elevate data transmission speed by up to 100 times and reduce latency from about 20 milliseconds to one millisecond.
By 2025, there will be 75 billion IoT devices.
By 2030, 90% of people at least six years old on the planet will be online.
The Big Data World: Big, Bigger and Biggest
A Collection of large and complex datasets which are difficult to store and process using the traditional database and data processing tools is considered as big data. Big data is collected from traditional and digital sources which, when refined properly can be used for research and analysis. With time, organizations are growing and with this data generated from these organizations are also increasing exponentially. The challenge is to have a platform which can provide a single, consistent view of the complete data. Another challenge is to organize this data so that it makes sense and can be utilized as useful information. Everything around us generates BIG DATA continuously. Social media websites and digital sources are responsible for producing such huge amount of data. How this huge amount of data is transmitted – sensors, mobile and systems are the answer.
How much data is generated per minute ?
What is Hadoop?
Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.
Technologies:
In this blog I will not deep dive into any of the technologies. I will just tell few names of the Technologies that we are using to solve Big Data Problem.