Big Data What is it?
Hello, connections..!!
In today’s world, Social media developed as a very powerful tool. We use Facebook, Google, Instagram in day to day life as social connectivity or gathering information. So this platform as a huge amount of massive data can be stored daily. So here is a question arise how they can face a such big issue? , or they how can handle such Tremendous Data? So for this curious issues read my article:
Before going to Big Data we first know what is actually a data:
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage or process.
Big Data :
Big data is a field that treats ways to analyze, systematically extract information from, or otherwise, deal with datasets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate.
However, there are certain basic tenets of Big Data that will make it even simpler to answer what is Big Data:
- It refers to a massive amount of data that keeps on growing exponentially with time.
- It is so voluminous that it cannot be processed or analyzed using conventional data processing techniques.
- It includes data mining, data storage, data analysis, data sharing, and data visualization.
Types of Big Data:
Structured BigData: Structured is one of the types of big data and By structured data, we mean data that can be processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be readily and seamlessly stored and accessed from a database by simple search engine algorithms.GPS tracking and Audio/Video stream are examples of structured data.
Unstructured BigData: Unstructured data refers to the data that lacks any specific form or structure whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data. Data warehouses are an example of unstructured data.
Semi-Structured BigData: Semi-structured is the third type of big data. Semi-structured data pertains to the data containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that although has not been classified under a particular repository (database), yet contains vital information or tags that segregate individual elements within the data. Email is an example of Semi-Structured BigData.
Characteristics of Big Data:
Volume
Volume refers to the unimaginable amounts of information generated every second from social media, cell phones, cars, credit cards, M2M sensors, images, video, and whatnot. We are currently using distributed systems, to store data in several locations and brought together by a software Framework like Hadoop.
Facebook alone can generate about billion messages, 4.5 billion times that the “like” button is recorded, and over 350 million new posts are uploaded each day. Such a huge amount of data can only be handled by Big Data Technologies
Variety
As Discussed before, Big Data is generated in multiple varieties. Compared to the traditional data like phone numbers and addresses, the latest trend of data is in the form of photos, videos, and audios and many more, making about 80% of the data to be completely unstructured
Structured data is just the tip of the iceberg.
Veracity
Veracity basically means the degree of reliability that the data has to offer. Since a major part of the data is unstructured and irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the data is crucial in business developments
Value
Value is the major issue that we need to concentrate on. It is not just the amount of data that we store or process. It is actually the amount of valuable, reliable and trustworthy data that needs to be stored, processed, analyzed to find insights.
Velocity
Last but never least, Velocity plays a major role compared to the others, there is no point in investing so much to end up waiting for the data. So, the major aspect of Big Dat is to provide data on demand and at a faster pace.
Applications of Big Data
Big Data is considered the most valuable and powerful fuel that can run the massive IT industries of the 21st Century. Big Data is being the most wide-spread technology that is being used in almost every business sector. Let us now check out a few as mentioned below.
Travel and Tourism is one of the biggest users of Big Data Technology. It has enabled us to predict the requirements for travel facilities in many places, improving business through dynamic pricing and many more
Financial and Banking Sectors extensively uses Big Data Technology. Big data analytics can aid banks in understanding customer behaviour based on the inputs received from their investment patterns, shopping trends, motivation to invest and personal or financial backgrounds.
Big Data has already started to create a huge difference in the healthcare sector. With the help of predictive analytics, medical professionals and Health Care Personnel are now able to provide personalized healthcare services to individual patients.
Telecommunication and Multimedia sector is one of the primary users of Big Data. There are zettabytes of getting generated every day and to handle such huge data would need nothing other than Big Data Technologies.
Government and Military also use Big Data Technology at a higher rate. You can consider the amount of data Government generates on its records and in the military, a normal fighter jet plane requires to process petabytes of data during its flight.
Advantages of Big Data
Big Data Technology has given us multiple advantages, Out of which we will now discuss a few.
· Big Data has enabled predictive analysis which can save organisations from operational risks.
· Predictive analysis has helped organisations grow business by analysing customer needs.
· Big Data has enabled many multimedia platforms to share data Ex: youtube, Instagram
· Medical and Healthcare sectors can keep patients under constant observations.
· Big Data changed the face of customer-based companies and worldwide market
Here we some case studies of popular companies in the world related to data storing :
1]Google:
Google.com is the most visited website on our planet. Followed by YouTube.com. Both services are owned by Google. Besides these two there are other multiple online services owned by Google each with over a billion users like Gmail, Google Ads, Google Play, Google Maps, Google Drive, Google Chrome day. By the year 2010, Google had over 10 billion images indexed in its database.
Google photo got pretty popular & has over 1.2 billion photos uploaded to the service every single day. Collectively the data amounts to approx. 14 petabytes of storage. The service has over a billion users.
YouTube is a social video sharing platform, the second most visited website on the planet. It has over a billion users. With over 2 billion users, the video-sharing platform is generating billions of views with over 1 billion hours of videos watched every single day.
2] Facebook:
Arguably the world’s most popular social media network with more than two billion monthly active users worldwide, Facebook stores enormous amounts of user data, making it a massive data wonderland. It’s estimated that there will be more than 183 million Facebook users in the United States alone by October 2019. Facebook is still under the top 100 public companies in the world, with a market value of approximately $475 billion. here I collect some details which facebook receives on daily basis.
- Users share 2.5 B Content every day.
- Users generate 2.7 B -Liked every day
- More than 250 billion photos have been uploaded to Facebook.
- 100+PB - Disk Space in a Single HDFS cluster.
- Hive is Facebook’s data warehouse, with 300 petabytes of data.
- 70,000 - queries executed
- 500+TB new data ingested
- Users spend an average of 20 minutes per day on the site.
- Facebook now sees 100 million hours of daily video watch time.
- 30% of internet users use Facebook more than once a day.
- Facebook generates 4 new petabytes of data per day.
3] Instagram:
95 million photos and videos are shared on Instagram per day. 9. Over 40 billion photos and videos have been shared on the Instagram platform since its conception.
· 1 billion Instagram monthly users as of June 2018
· 500 million daily Instagram Stories users
· 110 million Instagram US users, 70 million in Brazil, and 69 million in India
· 34% of Instagram users aged 25-34; 31% are 18-24
· 51.2% of the global Instagram user base are female, 48.8% male
· US Instagram penetration at 37%
· 75% of US 18-24 year olds are Instagram users
· 35% of US teenagers say Instagram is their favourite social media
· Two thirds of 18-24-year-old Instagram users use the platform multiple times per day, compared to 60% of 25-34-year-olds, and 49% of 35-44-year olds
· Instagrammers under the age of 25 spend 32 minutes per day on the platform; those older spend 24 minutes, according to Facebook.
4] Linkedin:
LinkedIn tracks every move users make on the site, and the company analyses this mountain of data in order to make better decisions and design data-powered features. Clearly, LinkedIn uses Big Data right across the company, but here are just a couple of examples of it in action. Like other social media networks, LinkedIn uses data to make suggestions for users (“people you may know”). LinkedIn uses machine learning techniques to refine its algorithms and make better suggestions for users. So, if the site regularly suggested people you may know from Company A (which you worked at nine years ago) and Company B (which you worked at four years ago), but you almost never clicked on the company A profiles, LinkedIn would tailor its suggestions going forward with that in mind. This personalized approach enables users to build networks that work best for them.
Also, the site is constantly gathering and displaying new data for users. LinkedIn uses stream-processing technology to display the most up-to-date information when users are on the site – from who got a new job to useful articles that contacts have shared. Not only does this constant streaming of data add interest, but it also speeds up the analytic process. Instead of capturing data and storing it to be analyzed at a later time, real-time stream-processing technology allows LinkedIn to stream data directly from the source (user activity) and analyze it on the fly.
4. Walmart
Walmart is the largest retailer in the world and the world’s largest company by revenue, with more than 2 million employees and 20000 stores in 28 countries. It started making use of big data analytics much before the word Big Data came into the picture.
Walmart uses Data Mining to discover patterns that can be used to provide product recommendations to the user, based on which products were brought together. Walmart by applying effective Data Mining has increased its conversion rate of customers. It has been speeding along big data analysis to provide best-in-class e-commerce technologies with a motive to deliver superior customer experience. The main objective of holding big data at Walmart is to optimize the shopping experience of customers when they are in a Walmart store. Big data solutions at Walmart are developed with the intent of redesigning global websites and building innovative applications to customize the shopping experience for customers whilst increasing logistics efficiency. Hadoop and NoSQL technologies are used to provide internal customers with access to real-time data collected from different sources and centralized for effective use.
The solution to the Big Data Problems
Now, what is the traditional method we were using? Generally, we store our data in an HDD or SSD, but when it comes to big companies’ data, these companies have to manage a big number of data (in TB) on daily basis. Is there any HDD or SDD available with 500 TB storage? So, let me tell you that an American data storage company Nimbus data is one firm that holds the World's biggest SSD, which comes with a storage capacity of a whopping 100TB till Now the price of SSD was only available on-demand as per a report by Techradar but the company has revealed it for everyone to see, and it is $40,000 or about Rs. 30 lakhs. And still, it lacks the other 400 TB data for any company to be stored at a time.
Here comes the first sub-problem under Big Data which is called Volume. Volume refers to a problem where we have a huge amount of data to be stored but we do not have resources to store it. What do you think, if a company had to create an HDD with 500 TB storage, would it not have made it yet? Of course yes!! But that time company also will face I/O problem. So to solve this problem we have the most optimal solution nowadays that almost all companies uses are Distributed Storage.
A distributed object store is made up of many individual object stores, normally consisting of one or a small number of physical disks. These object stores run on commodity server hardware, which might be the compute nodes or might be separate servers configured solely for providing storage services. As such, the hardware is relatively inexpensive. The disk of each virtual machine is broken up into a large number of small segments, typically a few megabytes in size each, and each segment is stored several times (often three) on different object stores. Each copy of each segment is called a replica. The system is designed to tolerate failure. As relatively inexpensive hardware is used, failure of individual object stores is comparatively frequent; indeed, with enough object stores, failure becomes inevitable. However, as it would require every replica to become unavailable for data to be lost, failure of individual object stores is not an ‘emergency event’ requiring call-out of storage engineers, but something handled through routine maintenance. Performance does not noticeably degrade, and the under-replicated data is gradually and automatically re-replicated from existing replicas. There is no ‘re-silvering’ operation to perform when the defective object store is replaced in the same way that would happen with a replacement RAID disk.
Best Big Data Tools: Here is the list of top 10 big data tools –
. Apache Hadoop
- Apache Spark
- Flink
- Apache Storm
- Apache Cassandra
- MongoDB
- Kafka
- Tableau
- RapidMiner
- R Programming
Apache Hadoop :
Apache Hadoop is one of the most popularly used tools in the Big Data industry. Hadoop is an open-source framework from Apache and runs on commodity hardware. It is used to store process and analyze Big Data. Hadoop is written in Java. Apache Hadoop enables parallel processing of data as it works on multiple machines simultaneously. It uses clustered architecture. A Cluster is a group of systems that are connected via LAN.
Conclusion
At the end of the article — the purpose of Data Science, we conclude that Data Scientists are the backbone of data-intensive companies. The purpose of Data Scientists is to extract, preprocess and analyze data. Through this, companies can make better decisions. Various companies have their own requirements and use data accordingly. In the end, the goal of Data Scientist to make businesses grow better. With the decisions and insights provided, the companies can adopt appropriate strategies and customize themselves for enhanced customer experience.
??Thank you to reading my Article??
DevOps @Forescout ?? | Google Champion Innovator | Google Developer Expert | AWS | DevOps | 3X GCP | 1X Azure | 1X Terraform | Ansible | Kubernetes | SRE | Platform | Jenkins | Tech Blogger ??
4 年Really great research Utkarsh Bagal