ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Big Data Problems and Solutions

Shubham Bhardwaj

AWS Community Builder | 3X Redhat Certified Engineer(DO180, RH294, RHCSA) | Aspiring SRE (Ops) | DevOps | Ansible? | OpenShift? | Terraform | Kubernetes | AWS | GCP | Azure | OpenStack | Jenkins | Big Data Ecosystem

å‘å¸ƒæ—¥æœŸ: 2020å¹´9æœˆ18æ—¥

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.[2] Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. When we handle big data, we may not sample but simply observe and track what happens. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.

Big data can be described by the following characteristics:

Volume

The quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.

Variety

The type and nature of the data. The earlier technologies like RDBMSs were capable to handle structured data efficiently and effectively. However, the change in type and nature from structured to semi-structured or unstructured challenged the existing tools and technologies. The Big Data technologies evolved with the prime intention to capture, store, and process the semi-structured and unstructured (variety) data generated with high speed(velocity), and huge in size (volume). Later, these tools and technologies were explored and utilized for handling structured data also but preferable for storage. Eventually, the processing of structured data was still kept as optional, either using big data or traditional RDBMSs. This helps in analyzing data towards effective usage of the hidden insights exposed from the data collected via social media, log files, and sensors, etc. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.

Velocity

The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.

Veracity

It is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.

Lets talk about some companies who are facing such problems:

FACEBOOK

Arguably the worldâ€™s most popular social media network with more than two billion monthly active users worldwide, Facebook stores enormous amounts of user data, making it a massive data wonderland. Itâ€™s estimated that there will be more than 183 million Facebook users in the United States alone by October 2019. Facebook is still under the top 100 public companies in the world, with a market value of approximately $475 billion.

Every day, we feed Facebookâ€™s data beast with mounds of information. Every 60 seconds, 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That is a LOT of data.At first, this information may not seem to mean very much. But with data like this, Facebook knows who our friends are, what we look like, where we are, what we are doing, our likes, our dislikes, and so much more. Some researchers even say Facebook has enough data to know us better than our therapists!

Apart from Google, Facebook is probably the only company that possesses this high level of detailed customer information. The more users who use Facebook, the more information they amass. Heavily investing in its ability to collect, store, and analyze data, Facebook does not stop there. Apart from analyzing user data, Facebook has other ways of determining user behavior.

Tracking cookies: Facebook tracks its users across the web by using tracking cookies. If a user is logged into Facebook and simultaneously browses other websites, Facebook can track the sites they are visiting.

Facial recognition: One of Facebookâ€™s latest investments has been in facial recognition and image processing capabilities. Facebook can track its users across the internet and other Facebook profiles with image data provided through user sharing.

Tag suggestions: Facebook suggests who to tag in user photos through image processing and facial recognition.

Analyzing the Likes: A recent study conducted showed that it is viable to predict data accurately on a range of personal attributes that are highly sensitive just by analyzing a userâ€™s Facebook Likes. Work conducted by researchers at Cambridge University and Microsoft Research shows how the patterns of Facebook Likes can very accurately predict your sexual orientation, satisfaction with life, intelligence, emotional stability, religion, alcohol use and drug use, relationship status, age, gender, race, and political viewsâ€”among many others.

Facebook Inc. analytics chief Ken Rudin says, â€œBig Data is crucial to the companyâ€™s very being.â€ He goes on to say that, â€œFacebook relies on a massive installation of Hadoop, a highly scalable open-source framework that uses clusters of low-cost servers to solve problems. Facebook even designs its hardware for this purpose. Hadoop is just one of many Big Data technologies employed at Facebook.

GOOGLE:

Google has not only significantly influenced the way we can now analyse big data (think MapReduce, BigQuery, etc.) â€“ but they are probably more responsible than anyone else for making it part of our everyday lives. I believe that many of the innovative things Google is doing today, most companies will do in years to come. Many people, particularly those who didnâ€™t get online until this century had started, will have had their first direct experience of manipulating big data through Google. Although these days Googleâ€™s big data innovation goes well beyond basic search, itâ€™s still their core business. They process 3.5 billion requests per day, and each request queries a database of 20 billion web pages.

This is refreshed daily, as Googleâ€™s bots crawl the web, copying down what they see and taking it back to be stored in Googleâ€™s index database. What pushed Google in front of other search engines has been its ability to analyse wider data sets for their search. Initially it was PageRank which included information about sites that linked to a particular site in the index, to help take a measure of that siteâ€™s importance in the grand scheme of things. Previously leading search engines worked almost entirely on the principle of matching relevant keywords in the search query to sites containing those words. PageRank revolutionized search by incorporating other elements alongside keyword analysis. Their aim has always been to make as much of the worldâ€™s information available to as many people as possible (and get rich trying, of courseâ€¦) and the way Google search works has been constantly revised and updated to keep up with this mission. Moving further away from keyword-based search and towards semantic search is the current aim. This involves analysing not just the â€œobjectsâ€ (words) in the query, but the connection between them, to determine what it means as accurately as possible. To this end, Google throws a whole heap of other information into the mix. Starting in 2007 it launched Universal Search, which pulls in data from hundreds of sources including language databases, weather forecasts and historical data, financial data, travel information, currency exchange rates, sports statistics and a database of mathematical functions.

Cornerstone:

Cornerstone is a software tool which helps assess and understand employees and candidates by crunching half a billion data points on everything from gas prices, unemployment rates and social media use. Clients such as Xerox use it to predict, for example, how long an employee is likely to stay in his or her job, and remarkable insights gleaned include the fact that in some careers, such as call centre work, employees with criminal records perform better than those without. big data - case study collection 11 Its prowess has made Cornerstone into a huge success, with sales growing by 150% from 2012 to 2013 and the software being put to use by 20 of the Fortune 100 companies. The â€œdata pointsâ€ are measurements taken from employees working across 18 industries in 13 different countries, providing information on everything from how long they take to travel to work, to how often they speak to their managers. Data collection methods include the controversial â€œsmart badgesâ€ that monitor employee movements and track which employees interact with each other. Cornerstone has certainly caused positive change in companies using it â€“ Bank of America reportedly improved performance metrics by 23% and decreased stress levels (measured by analysing workerâ€™s voices) by 19%, simply by allowing more staff to take their breaks together. And Xerox reduced call centre turnover by 20% by applying analytics to prospective candidates â€“ finding among other things that creative people were more likely to remain with the company for the 6 months necessary to recoup the $6,000 cost of their training than inquisitive people. So far data gathering and analysis has focused mainly on customerfacing members of staff, who in larger organizations will tend to be those with less responsibility and decision-making power. Could even greater benefits be taken by applying the same principles to the movers and shakers in the boardroom, who hold the keys to widerreaching business change? Certainly some companies are starting to think that way.

Microsoft:

Since it was founded in 1975 by Bill Gates and Paul Allen, Microsoft has been a key player in just about every major advance in the use of computers, at home and in business. Just as it anticipated the rise of the personal computer, the graphical operating system and the internet, it wasnâ€™t taken by surprise by the dawn of the big data era. It might not always be the principle source of innovation, but it has always excelled at bringing innovation to the masses, and packaging it into a user-friendly product (even though many would argue against this). It has caused controversy along the way, though, and at one time was called an â€œabusive monopolyâ€ by the US Department of Justice, over its packaging of Internet Explorer with Windows operating systems. And in 2004 it was fined over $600m by the European Union following anti-trust action. big data - case study collection 15 The companyâ€™s fortunes have wavered in recent years â€“ notably, they were slow to come up with a solid plan for capturing a significant share of the booming mobile market, causing them to lose ground (and brand recognition) to competitors Apple and Google. However it remains a market leader in business and home computer operating systems, office productivity software, web browsers, games consoles and search â€“ Bing having overtaken Yahoo as the second most-used search engine. It is now angling to become a key player in big data, too â€“ offering a suite of services and tools including data hosting and analytics services based on Hadoop to businesses. But Microsoft had a substantial head-start over the competition â€“ in fact their first forays into the world of big data started way before even the first version of MS-DOS. Gates and Allenâ€™s first business venture, two years before Microsoft, a service providing realtime reports for traffic engineers using data from roadside traffic counters. Itâ€™s clear that the founders of what would grow into the worldâ€™s biggest software company knew how important information (specifically, getting the right information to the right people, at the right time) would become in the digital age. Microsoft competed in the search engine wars from the beginning, rebranding its engine along the way from MSN Search, to Windows Live Search and Live Search before finally arriving at Bing in 2009. Although most of the changes it brought in appeared designed to ape the undisputed champion of search Google (such as incorporating various indexes, public records and relevant paid advertising into its results) there are differences. Bing places more importance on how well-shared information is on social networks when ranking it.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Shubham Bhardwajçš„æ›´å¤šæ–‡ç«

Docker Introduction And Running Graphical Applications inside it

2021å¹´4æœˆ25æ—¥

Docker Introduction And Running Graphical Applications inside it

What is Container? Container is basically isolated environment to run your application. Similar to VM, It looks likeâ€¦
Enabling Idempotence in HTTPD server using Ansible

2021å¹´3æœˆ27æ—¥

Enabling Idempotence in HTTPD server using Ansible

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enablingâ€¦
Increase or Decrease the Size of Static Partition in Linux

2021å¹´3æœˆ27æ—¥

Increase or Decrease the Size of Static Partition in Linux

Step 1: Add a Hard disk of any size Finally we have created a virtual harddisk of 10gb and attached to our vm, Step 2-â€¦
Launching Webserver and Python Interpreter on Docker Container

2021å¹´3æœˆ26æ—¥

Launching Webserver and Python Interpreter on Docker Container

What is Docker ? Docker is an open platform for developing, shipping, and running applications. Docker enables you toâ€¦
How NASA use Amazon SQS to empower its Image and Video Library

2021å¹´3æœˆ26æ—¥

How NASA use Amazon SQS to empower its Image and Video Library

What is AWS SQS? Amazon Simple Queue Service (SQS) is a fully managed message queuing service that enables you toâ€¦
Configuring Haproxy Loadbalancer via Ansible

2021å¹´3æœˆ25æ—¥

Configuring Haproxy Loadbalancer via Ansible

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enablingâ€¦
Neural Network helping Facebook to solve Advanced Mathematical equations

2021å¹´3æœˆ25æ—¥

Neural Network helping Facebook to solve Advanced Mathematical equations

What is Neural Network? A neural network is a series of algorithms that endeavors to recognize underlying relationshipsâ€¦
Configuring Apache Httpd server on top of Docker using Ansible

2021å¹´3æœˆ21æ—¥

Configuring Apache Httpd server on top of Docker using Ansible

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enablingâ€¦
How Jenkins empowers e-Science and virtual research communities with software released at D4Science?

2021å¹´3æœˆ20æ—¥

How Jenkins empowers e-Science and virtual research communities with software released at D4Science?

D4Science is a Data Infrastructure connecting +12.000 scientists in +50 countries, integrating +50 heterogeneous dataâ€¦
Industry UseCases: SoftBank Adopts Redhat Openshift for Agile Devops Approach

2021å¹´3æœˆ20æ—¥

Industry UseCases: SoftBank Adopts Redhat Openshift for Agile Devops Approach

Red Hat OpenShift is the hybrid cloud platform of open possibility: powerful, so you can build anything and flexibleâ€¦

See all articles

Big Data Problems and Solutions

Shubham Bhardwaj

AWS Community Builder | 3X Redhat Certified Engineer(DO180, RH294, RHCSA) | Aspiring SRE (Ops) | DevOps | Ansible? | OpenShift? | Terraform | Kubernetes | AWS | GCP | Azure | OpenStack | Jenkins | Big Data Ecosystem

FACEBOOK

GOOGLE:

Cornerstone:

Microsoft:

Shubham Bhardwajçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

#143 Denormalized Data into Embeddings Lake Vectoria

Mastering Semi-Structured Data Handling in Snowflake: A Technical Deep Dive

Hive Optimization 50 Tips

Data Mesh

9 Key Reasons Causing Big Data Need

A summary to understand the value of Microsoft products from raw data to Large Language Models

Big Data Applications and Examples

<BIG DATA>

Data pipeline

Big Data, Big...WHAT?

FACEBOOK

GOOGLE:

Cornerstone:

Microsoft:

Shubham Bhardwajçš„æ›´å¤šæ–‡ç«

Docker Introduction And Running Graphical Applications inside it

Enabling Idempotence in HTTPD server using Ansible

Increase or Decrease the Size of Static Partition in Linux

Launching Webserver and Python Interpreter on Docker Container

How NASA use Amazon SQS to empower its Image and Video Library

Configuring Haproxy Loadbalancer via Ansible

Neural Network helping Facebook to solve Advanced Mathematical equations

Configuring Apache Httpd server on top of Docker using Ansible

How Jenkins empowers e-Science and virtual research communities with software released at D4Science?

Industry UseCases: SoftBank Adopts Redhat Openshift for Agile Devops Approach

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

#143 Denormalized Data into Embeddings Lake Vectoria

Mastering Semi-Structured Data Handling in Snowflake: A Technical Deep Dive

Hive Optimization 50 Tips

Data Mesh

9 Key Reasons Causing Big Data Need

A summary to understand the value of Microsoft products from raw data to Large Language Models

Big Data Applications and Examples

<BIG DATA>

Data pipeline

Big Data, Big...WHAT?

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†