Data Problem of Big Tech Companies

Data Problem of Big Tech Companies

Every hour, 30,000 hours of videos are uploaded to YouTube, crazy isn't it? and that data is of March 2019, so, I am sure that figure would have gone up from 2019 to 2020. The videos uploaded are of 8K to 144p, so let's see how many Terabytes of data is 30,000 hours of footage, I will calculate using 1080p video data:

An average 1080p uploaded on YouTube is a whopping 8.64 GB per hour! so multiply it with 30,000 = 259 Terabytes, 6,200 Terabytes per day! so, how do they store this data efficiently and in a fast way(remember they need to keep them for a lifetime and the sped needs to be fast because you can't wait for YouTube to load a video because the load on the data center is high), any layman will say, make faster network connections in the data centers, make many folders, but that's not how it is done. The biggest challenge being all the data collected from you is unstructured, how to make the data structured, and store them more efficiently.

As you can notice, Big Data is a huge problem for them, in the earlier days, there weren't many laptops, smartphone, etc, and the internet connections were slow, so the data collected from the big tech companies were less, so normal hard drives used to solve the problem, but as more and more data came in, the efficiency started decreasing, imagine, typing something in the search box of Google and hitting enter and it starts loading and takes hours for results to show, the experience will be frustrating right?

No alt text provided for this image

Just look at that speed, and I am not the only one who hit search, there were probably millions of them.

Let's solve the first solve problem, the storage problem, now any layman will say that buy more storage, more hard drives, but the problem is hard drives are really slow, so we move onto SSD, but right now, fast SSDs available have a max storage capacity of 2TB:

No alt text provided for this image

But what if your data is of 3 TB, you can even have 100 TB SSD now, but the problem is, they are larger, have storage speeds of Hard drives, are more expensive than hard drives if you want to learn more about that insane hard drive, head over to this video on YouTube:

No alt text provided for this image

https://www.youtube.com/watch?v=ZFLiKClKKhs&t=39s

And the next problem is the speed of the storage, the storage solution needs to be super fast, and the costing problem is huge, I mean, we are not talking about 1TB of storage, we are talking about 1000s of PetaByte of storage.

But, as you might know, there is no such problem that technology cannot solve, even if the problem is about technology.

So, the solution to that problem is:

Distributed storage solutions.

But what is a distributed storage system? How do we implement it?

Well, I got you covered, from the last two days, I have been researching about it and I will take you through it.

No alt text provided for this image

Source of the image: medium

You have 10 laptops or 10 storage servers, usually called slave nodes, in this manner, quite quickly imagine. Each laptop is attached to one main laptop, usually known as the Master Node, by networking. Now imagine that every computer has 10 GB of capacity, but if somewhere 20 GB of data arrives then we won't be able to store it in one cloud, so here arrives the Distributed Capacity action.

  • Master often collects the data and distributes the data among the slaves. That means we don't have to worry about issues with volume now. Since no matter how huge the Data is, we can conveniently spread it in the slaves and we don't need to buy bigger storages either.
  • So, when we don't buy bigger storages so our expense can decrease too. We can buy several tiny storage servers now and connect them to the master. Suppose the data will get bigger in the future, so we will buy more storage servers, and continue to connect them with a master.
  • Last thing time, if you remember that it takes 1 minute for one storage system to store 10 GB data, now that there are several storage services in parallel, then we'll only need a few seconds to store the same 10 GB data in 10 storage devices (1 GB in each system). It's also not only about saving the data, but it's also about how you can interpret the data quicker. As there are 10 storage servers in parallel, it will take just a couple of seconds to read the same 10 GB data, whereas if we use one storage to read 10 GB data, it will take over 1 minute. These are basic cases, these architectures are actually larger in the industry with loads of modules connected to each other.

Technologies which are used to solve the big data problem:

  1. Hadoop.
  2. Cassandra.
  3. MongoDB.
  4. Apache Hive.
As I had stated above, there is no problem which technology cannot solve, and here is the solution, it is quite literally 3 problems 1 solution: Distributed File System.

Google was the first company to showcase Distributed File System back in 2003, from that time, the distributed file system has become better and better.

All right, fellas, that's all form my side!

Faraz A.

Open to opportunities

4 年

Good work!

回复

要查看或添加评论,请登录

Naitik Shah的更多文章

  • JavaScript - Journey from Zero to Hero with Vimal Daga Sir

    JavaScript - Journey from Zero to Hero with Vimal Daga Sir

    I have seen a lot of "Free" courses on YouTube, which assure you to take your basic level in JavaScript to next level…

  • Hybrid Computing Task 1

    Hybrid Computing Task 1

    Why Cloud? Many companies have a hard time maintaining their data centers. It's also inconvenient for new startups to…

    2 条评论
  • Chest X-Ray Medical Diagnosis with Deep Learning

    Chest X-Ray Medical Diagnosis with Deep Learning

    Project Name: Chest X-Ray Medical Diagnosis with Deep Learning Team Members: Naitik Shah Ashutosh Kumar Sah This…

    2 条评论
  • Top 5 Billion Dollar Companies Using AWS Cloud

    Top 5 Billion Dollar Companies Using AWS Cloud

    Hello Readers, AWS has captured a whopping 51% of the total cloud computing service providers, and their competitors…

    2 条评论
  • Multi-Cloud Project

    Multi-Cloud Project

    A quick summary of the project: The purpose is to deploy a WordPress framework using Terraform on Kubernetes. For this,…

    2 条评论
  • Hybrid Cloud Computing Task 4

    Hybrid Cloud Computing Task 4

    Hey fellas, presenting you my Hybrid Cloud Computing Task 4, which I am doing under the mentorship of Vimal Daga Sir…

  • Hybrid Cloud Computing Task 3

    Hybrid Cloud Computing Task 3

    Hey fellas, I bring you the Hybrid Cloud Computing task 3. What is this task all about? The motive is for our company…

  • Automating AWS Service(EC2, EFS, S3, Cloud Front) using Terraform

    Automating AWS Service(EC2, EFS, S3, Cloud Front) using Terraform

    So let me take you through the steps: First of all, create an IAM user by going to AWS GUI, and don't forget to…

  • Deploying Prometheus and Grafana on top of Kubernetes

    Deploying Prometheus and Grafana on top of Kubernetes

    Hello readers, this is my DevOps Task 5, and the problem statement is: Integrate Prometheus and Grafana and perform in…

  • Integrating Groovy with Kubernetes and Jenkins (DevOps Task 6)

    Integrating Groovy with Kubernetes and Jenkins (DevOps Task 6)

    Hola! so you guys might remember my DevOps Task 3 , if you haven't read it, then do give it a read, because this Task…

社区洞察

其他会员也浏览了