An Overview on Big Data and Case Study of Various MNC's on Big Data Management
I am going to explain the core concepts of Big Data and solution of Big Data problems.I will also discuss about case studies of various MNC's on Big Data mangement.
Big Data
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. When we handle big data, we may not sample but simply observe and track what happens. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.
History of Big Data
The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:
- Volume: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media and more. In the past, storing it would have been a problem – but cheaper storage on platforms like data lakes and Hadoop have eased the burden.
- Velocity: With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.
- Variety: Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.
Why is Big Data Important?
The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smart decision making. When you combine big data with high-powered analytics, you can accomplish business-related tasks such as:
- Determining root causes of failures, issues and defects in near-real time.
- Generating coupons at the point of sale based on the customer’s buying habits.
- Recalculating entire risk portfolios in minutes.
- Detecting fraudulent behavior before it affects your organization.
Examples of Big Data
Big data comes from myriad different sources, such as business transaction systems, customer databases, medical records, internet clickstream logs, mobile applications, social networks, scientific research repositories, machine-generated data and real-time data sensors used in internet of things (IoT) environments. The data may be left in its raw form in big data systems or preprocessed using data mining tools or data preparation software so it's ready for particular analytics uses.
Using customer data as an example, the different branches of analytics that can be done with the information found in sets of big data include the following:
- Comparative analysis. This includes the examination of user behavior metrics and the observation of real-time customer engagement in order to compare one company's products, services and brand authority with those of its competition.
- Social media listening. This is information about what people are saying on social media about a specific business or product that goes beyond what can be delivered in a poll or survey. This data can be used to help identify target audiences for marketing campaigns by observing the activity surrounding specific topics across various sources.
- Marketing analysis. This includes information that can be used to make the promotion of new products, services and initiatives more informed and innovative.
- Customer satisfaction and sentiment analysis. All of the information gathered can reveal how customers are feeling about a company or brand, if any potential issues may arise, how brand loyalty might be preserved and how customer service efforts might be improved.
Big data challenges
Besides the processing capacity and cost issues, designing a big data architecture is another common challenge for users. Big data systems must be tailored to an organization's particular needs, a DIY undertaking that requires IT teams and application developers to piece together a set of tools from all the available technologies. Deploying and managing big data systems also require new skills compared to the ones possessed by database administrators (DBAs) and developers focused on relational software.
Both of those issues can be eased by using a managed cloud service, but IT managers need to keep a close eye on cloud usage to make sure costs don't get out of hand. Also, migrating on-premises data sets and processing workloads to the cloud is often a complex process for organizations.
Making the data in big data systems accessible to data scientists and other analysts is also a challenge, especially in distributed environments that include a mix of different platforms and data stores. To help analysts find relevant data, IT and analytics teams are increasingly working to build data catalogs that incorporate metadata management and data lineage functions. Data quality and data governance also need to be priorities to ensure that sets of big data are clean, consistent and used properly.
Big data collection practices and regulations
For many years, companies had few restrictions on the data they collected from their customers. However, as the collection and use of big data have increased, so has data misuse. Concerned citizens who have experienced the mishandling of their personal data or have been victims of a data breach are calling for laws around data collection transparency and consumer data privacy.
The outcry about personal privacy violations led the European Union to pass the General Data Protection Regulation (GDPR), which took effect in May 2018; it limits the types of data that organizations can collect and requires opt-in consent from individuals or compliance with other specified lawful grounds for collecting personal data. GDPR also includes a right-to-be-forgotten provision, which lets EU residents ask companies to delete their data.
While there aren't similar federal laws in the U.S., the California Consumer Privacy Act (CCPA) aims to give California residents more control over the collection and use of their personal information by companies. CCPA was signed into law in 2018 and is scheduled to take effect on Jan. 1, 2020. In addition, government officials in the U.S. are investigating data handling practices, specifically among companies that collect consumer data and sell it to other companies for unknown use.
Solution of Big Data Problems
- Apache Hadoop:-It is open source software and its main purpose is to manage huge amounts of data in a very short span of time with great ease. The functionality of Hadoop is to divide data among multiple systems infrastructure for processing it. A map of the content is created in Hadoop so it can be easily accessed and found. Tools like Hadoop are great for managing massive volumes of structured, semi-structured and unstructured data. Being a new technology and many professionals are unfamiliar with Hadoop. To use this technology, lots of resources are required to learn and this eventually diverts the attention from solving the main problem towards learning Hadoop.
- Visualization :-Another way to perform analyses and report but sometimes granularity of data increases the problem of accessing the detail level needed.
- Grid Computing:-Grid computing is represented by a number of servers that are interconnected by a high speed network; each of the servers plays one or many roles. The two main benefits of Grid computing are the high storage capability and the processing power, which translates to the data and computational grids.
- Apache Spark:-Platforms like Spark use model plus in-memory computing to create huge performance gains for high volume and diversified data. All these approaches allow firms and organizations to explore huge data volumes and get business insights from it. There are two possible ways to deal with volume problem. We can either shrink the data or invest in good infrastructure to solve the problem of data volume and based on our cost budget and requirements we can select technologies and methods described above. If we have resources with expertise in Hadoop, we can always use it.
- OLAP Tools (On-line Analytical Processing Tools):-Data processing can be done using OLAP tools and it establishes connection between information and It eventually assembles data into a logical way in order to access it easily and OLAP tools specialists can achieve high speed and low lagging time for processing high volume data, OLAP tools process all the data provided to them no matter they are relevant or not so, this is one of the drawbacks of OLAP tools.
- SAP HANA:-SAP HANA is an in-memory data platform that is deployable as an on-premise appliance, or in the cloud. It is a revolutionary platform that's best suited for performing real-time analytics, and developing and deploying real-time applications. New DB and indexing architectures make sense of disparate data sources swiftly.
- Cloud Computing:-A solution in the cloud will scale in much easier and faster way as compared to an on-premises solution. Keeping data secured in a cloud can solve half a problem because the cloud can be secured and cloud can be highly spaced that we can call it nearly unlimited.
Big Data Tools and Technologies
Hadoop distributed file system (HDFS) and a number of related components such as Apache Hive, HBase, Oozie, Pig and Zookeeper and these components are explained as below:
- HDFS: A highly faults tolerant distributed file system that is responsible for storing data on the clusters.HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project. The project URL is https://hadoop.apache.org/
- MapReduce: MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).It is a parallel programming technique for distributed processing of huge amount of data on clusters.
- HBase: A column-oriented distributed NoSQL database for random read/write access.HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. This tutorial provides an introduction to HBase, the procedures to set up HBase on Hadoop File Systems, and ways to interact with HBase shell. It also describes how to connect to HBase using java, and how to perform basic operations on HBase using java.
- Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. A high-level data programming language for analyzing data of Hadoop computation.
- Hive: Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. A data warehousing application that provides an SQL-like access and relational model.
- Sqoop:Sqoop is a tool designed to transfer data between Hadoop and relational databases or mainframes. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. A project for transferring/importing data between relational databases and Hadoop.
- Oozie: Apache Oozie is the tool in which all sort of programs can be pipelined in a desired order to work in Hadoop’s distributed environment. Oozie also provides a mechanism to run the job at a given schedule.An orchestration and workflow management for dependent Hadoop jobs.
Future Scope of Big Data
We have the plethora of data and that data we get from several sources and this data is nothing but can prove to be a valuable weapon for making decisions which will be lucrative and can either save money or help organization gain profit of millions and millions of dollars. It has an unbeatable potential to change the way organizations work and conduct their work. It is a revolutionary big data era and so there are problems too which could be addressed and resolved by right usage of software’s and hardware’s to tackle the problems. Demand is so high that all the major companies are investing billions of dollars for big data strategies. In future, the problem of lack of expertise to perform best practices will be gone. There will be a wide variety of products and services for all the phases of big data. There will be lots of companies that will provide cloud platform. Lots of competition will be among the cloud service providers. We will be able to carry all data over a personal cloud hence the necessarily of files and folders might not be present inside PCs, as the data keeps moving with the person.Visual data discovery tools will grow faster than BI tools because of high demands of such tools in future. Big data end-user self-service will become a mandatory service for all enterprises. Cloud-based Big Data and analytics (BDA) solutions will be preferred completely and on-premise storage will get to an end. Hybrid (Cloud + premise) deployments will be a great requirement. In future, big data jobs will be in great number as compared to today. The unification in data platform will occur across information management and technology everywhere. There will be a sudden hike in usage and development of applications that use advanced and predictive analytics, including machine learning. Purchasing of data could get even more costly than today and organizations will initiate selling their data to different firms.Analyses of events through the application of Internet of Things (IoT) analytics will not be a great deal in future. It will be a usual thing to do over the data. Decision management platforms will be robust and will grow more that will provide better data consistency and better results. Media analytics could increase about 10 folds as compared to today. Cognitive computing will be used by people on a regular basis. Hadoop will get better in future. New technologies and tools will be introduced that have the capability to store, do surveillance, measure and incorporate all forms and formats of data present surround us. Anonymizing data, analysis, sharing and managing tools will be developed that will work on our own personal data .
Case Studies of Big Data
As we know that the current industry is dependent on data for their earning and data is money for most of companies so as we daily accomodate our data and update it on various online platforms the problem of Big Data arises in the industry.I am going to present here some of case studies of MNCs of the Big Data problems faced by them and how they manage the Big data problem.
Google.com is the most visited website on our planet. Followed by YouTube.com. Both services are owned by Google. Besides these two there are other multiple online services owned by Google each with over a billion users like Gmail, Google Ads, Google Play, Google Maps, Google Drive, Google Chrome.
On a day to day basis, Google has to deal with petabytes of data. Just YouTube alone needs more than a petabyte of new storage every single day.
Google Search is the core service of the company. The search service indexes & caches trillions of web pages, pdfs, images and more containing terabytes of data to enable users to quickly find the information they run a search for. Google search receives approx. 5.4 billion searches every single day. By the year 2010, Google had over 10 billion images indexed in its database.
Google Photos that enables the users to upload their photos on the cloud. It got pretty popular & has over 1.2 billion photos uploaded to the service every single day. Collectively the data amounts to approx. 14 petabytes of storage. The service has over a billion users.
Google Ads is an advertising service run by Google. It serves ads over various forms of media created by content creators and earns a share from it. This service is the main source of revenue (86%) for the company.
Gmail & Google Drive have over 1.5 billion users. Google Play has over 1 billion users, it has had over 100 billion app downloads and approx. 3.5 million apps published. Google Maps has over 1 billion users. Google Analytics the website analytics service is the most widely used analytics service on the web. Google Assistant is installed on over 400 million devices. Google Chrome is the most used web browser in the world. Besides these, there are several other add on services offered by Google such as Google docs, sheets, slides, calendar etc.
BigTable is a distributed storage system for managing structured data across thousands of commodity servers. More than 60 Google services like Google Earth, Google Finance, Google Search, Google Analytics, Writely etc. store data in Bigtable.
BigTable efficiently handles the data demands of these services providing scalability, high availability & performance whether it is for indexing urls, processing real-time data or latency-sensitive data serving.
The tech uses Google File System to store log and data files. The BigTable cluster runs on a shared pool of machines & relies on cluster management for scheduling jobs, managing resources on shared machines, dealing with machine failures and monitoring machine status. It uses a distributed lock service called Chubby for locking resources in a distributed environment.
Arguably the world’s most popular social media network with more than two billion monthly active users worldwide, Facebook stores enormous amounts of user data, making it a massive data wonderland. It’s estimated that there will be more than 183 million Facebook users in the United States alone by October 2019. Facebook is still under the top 100 public companies in the world, with a market value of approximately $475 billion.
Every day, we feed Facebook’s data beast with mounds of information. Every 60 seconds, 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That is a LOT of data.
Facebook Inc. analytics chief Ken Rudin says, “Big Data is crucial to the company’s very being.” He goes on to say that, “Facebook relies on a massive installation of Hadoop, a highly scalable open-source framework that uses clusters of low-cost servers to solve problems. Facebook even designs its hardware for this purpose. Hadoop is just one of many Big Data technologies employed at Facebook.”
Apache Hadoop is the ideal open-source utility to manage big data & Facebook uses it for running analytics, distributed storage & for storing MySQL database backups.
Besides Hadoop, there are also other tools like Apache Hive, HBase, Apache Thrift that are used for data processing
Facebook has open-sourced the exact versions of Hadoop which they run in production. They have possibly the biggest implementation of the Hadoop cluster in the world. Processing approx. 2 petabyte of data per day in multiple clusters at different data centres.
Facebook messages use a distributed database called Apache HBase to stream data to Hadoop clusters.Another use case is collecting user activity logs in real-time in Hadoop clusters.
LinkedIn tracks every move users make on the site, and the company analyses this mountain of data in order to make better decisions and design data-powered features. Clearly, LinkedIn uses Big Data right across the company, but here are just a couple of examples of it in action.
Like other social media networks, LinkedIn uses data to make suggestions for users (“people you may know”). LinkedIn uses machine learning techniques to refine its algorithms and make better suggestions for users. So, if the site regularly suggested people you may know from Company A (which you worked at nine years ago) and Company B (which you worked at four years ago), but you almost never clicked on the company A profiles, LinkedIn would tailor its suggestions going forward with that in mind. This personalised approach enables users to build the networks that work best for them.
Also, the site is constantly gathering and displaying new data for users. LinkedIn uses stream-processing technology to display the most up-to-date information when users are on the site – from who got a new job to useful articles that contacts have shared. Not only does this constant streaming of data add interest, it also speeds up the analytic process. Instead of capturing data and storing it to be analysed at a later time, real-time stream-processing technology allows LinkedIn to stream data direct from the source (user activity) and analyse it on the fly.
LinkedIn tracks every move its users make on the site, from everything liked and shared to every job clicked on and every contact messaged. Hadoop forms the core of LinkedIn’s Big Data infrastructure, but other key parts of the LinkedIn Big Data jigsaw include Oracle, Pig, Hive, Kafka, Java and MySQL. In order to ensure high availability and avoid a single point of failure, the company operates out of three main data centres.
Amazon
Amazon has thrived by adopting an “everything under one roof” model. However, when faced with such a huge range of options, customers can often feel overwhelmed. They effectively become data-rich, with tons of options, but insight-poor, with little idea about what would be the best purchasing decision for them.
To combat this, Amazon uses Big Data gathered from customers while they browse to build and fine-tune its recommendation engine. The more Amazon knows about you, the better it can predict what you want to buy. And, once the retailer knows what you might want, it can streamline the process of persuading you to buy it – for example, by recommending various products instead of making you search through the whole catalogue.
Amazon’s recommendation technology is based on collaborative filtering, which means it decides what it thinks you want by building up a picture of who you are, then offering you products that people with similar profiles have purchased.
Amazon gathers data on every one of its customers while they use the site. As well as what you buy, the company monitors what you look at, your shipping address (Amazon can take a surprisingly good guess at your income level based on where you live), and whether you leave reviews/feedback.
This mountain of data is used to build up a “360-degree view” of you as an individual customer. Amazon can then find other people who fit into the same precise customer niche (employed males between 18 and 45, living in a rented house with an income of over $30,000 who enjoy foreign films, for example) and make recommendations based on what those other customers like.
Amazon collects data from users as they navigate the site, such as the time spent browsing each page. The retailer also makes use of external datasets, such as census data for gathering demographic details.
Amazon’s core business is handled in its central data warehouse, which consists of Hewlett-Packard servers running Oracle on Linux.
Flipkart
Flipkart the World’s number one e-commerce platform is using analytics and algorithms to get better insights into its business during any type of sale or festival season. This article will explain you how the Flipkart is leveraging Big Data Platform for processing big data in streams and batches. This service-oriented architecture empowers user experience, optimizes logistics and improves product listings. It will give you an insight into how this ingenious big data platform is able to process such large amounts of data.
Flipkart Data Platform is a service-oriented architecture that is capable of computing batch data as well as streaming data. This platform comprises of various micro-services that promote user experience through efficient product listings, optimization of prices, maintaining various types of data domains – Redis, HBase, SQL, etc. This FDP is capable of storing 35 PetaBytes of data and is capable of managing 800+ Hadoop nodes on the server. This is just a brief of how Big Data is helping Flipkart.
OLA
Ola’s platform generates TBs of data daily.Deriving insights from data is at the heart of everything they do at Ola. They have a big data platform that ingests, stores and processes terabytes of data every day. Their in-house platform runs on an open source software that is horizontally scalable to parallel-process billions of rows of data coming in from a multitude of data sources -- micro-services, application logs, relational database systems and message queues.
Their platform runs on Hortonworks implementation of Hadoop framework. Micro-services (Customer App, Partner app, etc) pump data into relational/transactional databases such as MySQL and PostGRESQL. Apache Hive is their data warehouse, hosted on top of Amazon S3 as their storage layer that ensures we don’t have to worry about infrastructure and storage. This enables data scientists to focus on other critical elements like experimentation and innovation.
Analysts query data via Apache Hue. We recently opened up Presto, which is essentially a distributed SQL query engine, for supersonic-speed querying. Presto's speed comes due to its ability to cache data in-memory. Apache Ranger gives them the ability to democratize data through row-based filtering and column-based masking capabilities, when integrated with LDAP for identity and access management.
NetFlix
With a 51 percent market share of the American streaming industry and over 148 million streaming subscribers worldwide as of Q4 2018, Netflix is certainly a force to be reckoned with.More interestingly, Netflix is on track to be profitable. The chart below, courtesy of Statista, shows Netflix’s annual revenue from 2002 to 2018, and one thing is clear: Netflix is growing consistently and exponentially.
While many organizations have yet to effectively leverage data available to them, Netflix is a noteworthy exception.This is despite the fact that at the time, more than 30 million Netflix users lived in countries where Netflix’s service is unavailable without using a VPN or other location-masking services (and where Netflix is now recording most of its subscription gains).The same year, Netflix hiked its prices and refused to back down despite protests from users and loss of hundreds of thousands of users.Yet, Netflix has only grown since.
Netflix is betting big on content and user experience, the larger chunk of Netflix’s budget is spent on content. In 2019, Netflix is committing a $15 billion budget to content. For comparison, they are committing a meager $2.9 billion for marketing.
Netflix doesn’t use a traditional data center-based Hadoop data warehouse. In order to allow it to store and process a rapidly increasing data set, it uses Amazon’s S3 to warehouse its data, allowing it to spin up multiple Hadoop clusters for different workloads accessing the same data. In the Hadoop ecosystem, it uses Hive for ad hoc queries and analytics and Pig for ETL (extract, transform, load), and algorithms.
It then created its own Genie project to help handle increasingly massive data volumes as it scales. All this points to one thing: Netflix is very particular about having a lot of data and being able to process this data to ensure it understands exactly what its users want.The result has been nothing short of amazing. Netflix has been able to ensure a high engagement rate with its original content, such that 90 percent of Netflix users have engaged with its original content.
Netflix’s big data approach to content is so successful that, compared to the TV industry, where just 35 percent of shows are renewed past their first season, Netflix renews 93 percent of its original series.
Instagram is the most popular photo-oriented social network on the planet today. With over a billion users, it has become the first choice for businesses to run their marketing campaigns on.
The server-side code is powered by Django Python. All the web & async servers run in a distributed environment & are stateless.
PostgreSQL is the primary database of the application, it stores most of the data of the platform such as user data, photos, tags, meta-tags etc.
As the platform gained popularity & the data grew huge over time, the engineering team at Insta meditated on different NoSQL solutions to scale & then finally decided to shard the existing PostgreSQL database as it best suited their requirements.
Speaking of scaling the database via sharding & other means, this article YouTube Database – How Does It Store So Many Videos Without Running Out Of Storage Space? is an interesting read.
So, the main database cluster of Instagram contains 12 replicas in different zones & involves 12 Quadruple extra large memory instances.
Hive is used for data archiving. It’s a data warehousing software built on top of Apache Hadoop for data query & analytics capabilities. A scheduled batch process runs at regular intervals to archive data from PostgreSQL DB to Hive.
Vmtouch, a tool for learning about & managing the file system cache of Unix & Unix like servers, is used to manage in-memory data when moving from one machine to another.
Using Pgbouncer to pool PostgreSQL connections when connecting with the backend web server resulted in a huge performance boost.
Redis an in-memory database is used to store the activity feed, sessions & other app’s real-time data.
Memcache an open-source distributed memory caching system is used for caching throughout the service.
Conclusion
The Big Data help in various technological sectors and companies in order to improve their growth and to increase their ongoing demand of data accomodation efficiently.It also helps the companies to analyze large stacks of data for various events and upscale their customers in the market.
Hope that you will find this article informative:- Thanks for reading
CKAD Certified | 2xAWS Certified | 2xGCP Certified | 1xRHCE Ansible Certified | Devops | MLOps | Kubernetes | Terraform | Jenkins | Azure
4 年Very informative.. good start Saurav Majumder