Data Problem of Big Tech Companies
Naitik Shah
Data Scientist | Expert in Predictive Modeling, Machine Learning & Data Engineering | Python, SQL, Azure, Databricks | Achieved 15% Cost Reduction & Optimized Operations
Every hour, 30,000 hours of videos are uploaded to YouTube, crazy isn't it? and that data is of March 2019, so, I am sure that figure would have gone up from 2019 to 2020. The videos uploaded are of 8K to 144p, so let's see how many Terabytes of data is 30,000 hours of footage, I will calculate using 1080p video data:
An average 1080p uploaded on YouTube is a whopping 8.64 GB per hour! so multiply it with 30,000 = 259 Terabytes, 6,200 Terabytes per day! so, how do they store this data efficiently and in a fast way(remember they need to keep them for a lifetime and the sped needs to be fast because you can't wait for YouTube to load a video because the load on the data center is high), any layman will say, make faster network connections in the data centers, make many folders, but that's not how it is done. The biggest challenge being all the data collected from you is unstructured, how to make the data structured, and store them more efficiently.
As you can notice, Big Data is a huge problem for them, in the earlier days, there weren't many laptops, smartphone, etc, and the internet connections were slow, so the data collected from the big tech companies were less, so normal hard drives used to solve the problem, but as more and more data came in, the efficiency started decreasing, imagine, typing something in the search box of Google and hitting enter and it starts loading and takes hours for results to show, the experience will be frustrating right?
Just look at that speed, and I am not the only one who hit search, there were probably millions of them.
Let's solve the first solve problem, the storage problem, now any layman will say that buy more storage, more hard drives, but the problem is hard drives are really slow, so we move onto SSD, but right now, fast SSDs available have a max storage capacity of 2TB:
But what if your data is of 3 TB, you can even have 100 TB SSD now, but the problem is, they are larger, have storage speeds of Hard drives, are more expensive than hard drives if you want to learn more about that insane hard drive, head over to this video on YouTube:
https://www.youtube.com/watch?v=ZFLiKClKKhs&t=39s
And the next problem is the speed of the storage, the storage solution needs to be super fast, and the costing problem is huge, I mean, we are not talking about 1TB of storage, we are talking about 1000s of PetaByte of storage.
But, as you might know, there is no such problem that technology cannot solve, even if the problem is about technology.
So, the solution to that problem is:
Distributed storage solutions.
But what is a distributed storage system? How do we implement it?
Well, I got you covered, from the last two days, I have been researching about it and I will take you through it.
Source of the image: medium
You have 10 laptops or 10 storage servers, usually called slave nodes, in this manner, quite quickly imagine. Each laptop is attached to one main laptop, usually known as the Master Node, by networking. Now imagine that every computer has 10 GB of capacity, but if somewhere 20 GB of data arrives then we won't be able to store it in one cloud, so here arrives the Distributed Capacity action.
- Master often collects the data and distributes the data among the slaves. That means we don't have to worry about issues with volume now. Since no matter how huge the Data is, we can conveniently spread it in the slaves and we don't need to buy bigger storages either.
- So, when we don't buy bigger storages so our expense can decrease too. We can buy several tiny storage servers now and connect them to the master. Suppose the data will get bigger in the future, so we will buy more storage servers, and continue to connect them with a master.
- Last thing time, if you remember that it takes 1 minute for one storage system to store 10 GB data, now that there are several storage services in parallel, then we'll only need a few seconds to store the same 10 GB data in 10 storage devices (1 GB in each system). It's also not only about saving the data, but it's also about how you can interpret the data quicker. As there are 10 storage servers in parallel, it will take just a couple of seconds to read the same 10 GB data, whereas if we use one storage to read 10 GB data, it will take over 1 minute. These are basic cases, these architectures are actually larger in the industry with loads of modules connected to each other.
Technologies which are used to solve the big data problem:
- Hadoop.
- Cassandra.
- MongoDB.
- Apache Hive.
As I had stated above, there is no problem which technology cannot solve, and here is the solution, it is quite literally 3 problems 1 solution: Distributed File System.
Google was the first company to showcase Distributed File System back in 2003, from that time, the distributed file system has become better and better.
All right, fellas, that's all form my side!
Open to opportunities
4 年Good work!