Big data & Big Management
Tejal Bangali
Associate SAP Analytics Cloud Consultant | SAP SAC | SAC Analytic Application
What is Big Data?
To really understand big data, it’s helpful to have some historical background. Here is Gartner’s definition: Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. This is known as the three Vs. Put simply, big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.
The Three Vs of Big Data
- Volume: The amount of data matters. With big data, you’ll have to process high volumes of low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds, clickstreams on a webpage or a mobile app, or sensor-enabled equipment. For some organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.
- Velocity: Velocity is the fast rate at which data is received and perhaps acted on. Normally, the highest velocity of data streams directly into memory versus being written to disk. Some internet-enabled smart products operate in real-time or near real-time and will require real-time evaluation and action.
- Variety: Variety refers to the many types of data that are available. Traditional data types were structured and fit neatly in a relational database. With the rise of big data, data comes in new unstructured data types. Unstructured and semi-structured data types, such as text, audio, and video, require additional preprocessing to derive meaning and support metadata.
Do you Know? According to research -
???????????????? crunches more than 120 billion relationships per day and uses Hadoop capabilities to blend large scale data computation with high volume, low latency site serving. The ‘People You May Know’ feature is the result of scanning through billions of user triggers & activities via a pipeline of 82 MapReduce jobs each processing 16TB of data. This job uses a statistical model to predict the probability of two people knowing each other. LinkedIn also builds an index structure in their Hadoop pipeline - this produces a multi-TB lookup structure that uses perfect hashing (requiring only 2.5 bits per key). This process results in faster server-cluster responses. It takes LinkedIn about 90 minutes to build a 900 GB datastore on a 45 node development cluster.
- ???????????????? revealed some big stats that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half-hour. Another stat Facebook revealed was that over 100 petabytes of data are stored in a single Hadoop disk cluster, ie the single largest Hadoop system in the world. User actions such as a "like" or status update are stored in a highly distributed, customized MySQL database but applications such as Facebook Messaging run on top of HBase, Hadoop's NoSQL database framework.
- ?????????????????? operates one of the world’s largest Hadoop clusters and very likely the world’s largest MySQL implementation. It has developed everything from a PHP-optimized platform to a NoSQL database to a tool for auto-provisioning and configuring tens of thousands of servers wherein all the images are stored & processed. There are 500 million daily Instagram Stories users and 73.5% of the content is images, 13.7% is video, and 12.7% is carousels.
- ?????????????? uses Hadoop, especially Pig and Hbase for Data analysis. Twitter has large data storage and processing requirements, therefore it uses Hadoop to optimize data storage and workflow solutions. It uses Hadoop and Pig layers on top of LZO files which are compressed for quick analysis of data. Due to this, it has almost completely negated infrastructure failures.
- ??????????????????’s engineering team has created a self-serving platform — comprised of homegrown, open-source and commercial tools — that hooks in with Hadoop to orchestrate and process all of the company’s enormous amounts of data, consisting of 30 billion pins. They extensively utilize MapReduce for the processing of stored data which is synced with an AWS cloud, wherein all their data is currently stored.
- Over 1 billion hours of YouTube is watched globally per day. According to the new recording rate, 20hrs of video per minute could require YouTube to increase its storage capacity by up to 21.0 Terabytes per day, or 7.7 Petabytes per year, placing it at a point roughly 4x more than the total amount of data generated per year by the NCSA’s supercomputers in Urbana, IL.
- Google doesn’t provide numbers on how much data they store. Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide. A place where google stores and handles all its data is a data center. Google doesn’t hold the biggest of data centers but still it handles a huge amount of data. A data center normally holds petabytes to exabytes of data. Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters.
- Netflix A core component of its stream processing system is something called the Keystone data pipeline, which is used to move upwards of 12PB of data per day into its S3 data warehouse, where 100PB of highly compressed data reside. Keystone Router, a key piece of software that distributes the 3 trillion events per day across 2,000 routing jobs and 200,000 parallel operators to other data sinks in Netflix's S3 repository.
How big is Big Data really?
From time to time, various organizations brag about how much data they have, how big their clusters are, how many requests per second they serve, etc. Every time I come across these statistics, I make note of them. It's quite amazing to see how these numbers change over time.
Software Developer Analyst at TIAA
4 年Thanks for sharing a lots of information??
SWE @ Shiprocket | Java Developer | ex- Persistent Systems, DeepIotics
4 年Have you ever thought of youtube going out of storage ??? just a ques.
System Engineer @TCS || Global Rank 1350 in TCS CodeVita Season 10
4 年Yes, please give me your time to explain such thing, I am also curious about new skills
Expertise in ITSM, ITOM, ATF, and CMDB | Certified in 7+ Micro Certifications
4 年Thanks for sharing