登录查看更多内容

Data Problem of Big Tech Companies

Naitik Shah

Data Scientist | Expert in Predictive Modeling, Machine Learning & Data Engineering | Python, SQL, Azure, Databricks | Achieved 15% Cost Reduction & Optimized Operations

发布日期: 2020年9月17日

Every hour, 30,000 hours of videos are uploaded to YouTube, crazy isn't it? and that data is of March 2019, so, I am sure that figure would have gone up from 2019 to 2020. The videos uploaded are of 8K to 144p, so let's see how many Terabytes of data is 30,000 hours of footage, I will calculate using 1080p video data:

An average 1080p uploaded on YouTube is a whopping 8.64 GB per hour! so multiply it with 30,000 = 259 Terabytes, 6,200 Terabytes per day! so, how do they store this data efficiently and in a fast way(remember they need to keep them for a lifetime and the sped needs to be fast because you can't wait for YouTube to load a video because the load on the data center is high), any layman will say, make faster network connections in the data centers, make many folders, but that's not how it is done. The biggest challenge being all the data collected from you is unstructured, how to make the data structured, and store them more efficiently.

As you can notice, Big Data is a huge problem for them, in the earlier days, there weren't many laptops, smartphone, etc, and the internet connections were slow, so the data collected from the big tech companies were less, so normal hard drives used to solve the problem, but as more and more data came in, the efficiency started decreasing, imagine, typing something in the search box of Google and hitting enter and it starts loading and takes hours for results to show, the experience will be frustrating right?

Just look at that speed, and I am not the only one who hit search, there were probably millions of them.

Let's solve the first solve problem, the storage problem, now any layman will say that buy more storage, more hard drives, but the problem is hard drives are really slow, so we move onto SSD, but right now, fast SSDs available have a max storage capacity of 2TB:

But what if your data is of 3 TB, you can even have 100 TB SSD now, but the problem is, they are larger, have storage speeds of Hard drives, are more expensive than hard drives if you want to learn more about that insane hard drive, head over to this video on YouTube:

https://www.youtube.com/watch?v=ZFLiKClKKhs&t=39s

And the next problem is the speed of the storage, the storage solution needs to be super fast, and the costing problem is huge, I mean, we are not talking about 1TB of storage, we are talking about 1000s of PetaByte of storage.

But, as you might know, there is no such problem that technology cannot solve, even if the problem is about technology.

So, the solution to that problem is:

Distributed storage solutions.

But what is a distributed storage system? How do we implement it?

Well, I got you covered, from the last two days, I have been researching about it and I will take you through it.

Source of the image: medium

You have 10 laptops or 10 storage servers, usually called slave nodes, in this manner, quite quickly imagine. Each laptop is attached to one main laptop, usually known as the Master Node, by networking. Now imagine that every computer has 10 GB of capacity, but if somewhere 20 GB of data arrives then we won't be able to store it in one cloud, so here arrives the Distributed Capacity action.

Master often collects the data and distributes the data among the slaves. That means we don't have to worry about issues with volume now. Since no matter how huge the Data is, we can conveniently spread it in the slaves and we don't need to buy bigger storages either.
So, when we don't buy bigger storages so our expense can decrease too. We can buy several tiny storage servers now and connect them to the master. Suppose the data will get bigger in the future, so we will buy more storage servers, and continue to connect them with a master.
Last thing time, if you remember that it takes 1 minute for one storage system to store 10 GB data, now that there are several storage services in parallel, then we'll only need a few seconds to store the same 10 GB data in 10 storage devices (1 GB in each system). It's also not only about saving the data, but it's also about how you can interpret the data quicker. As there are 10 storage servers in parallel, it will take just a couple of seconds to read the same 10 GB data, whereas if we use one storage to read 10 GB data, it will take over 1 minute. These are basic cases, these architectures are actually larger in the industry with loads of modules connected to each other.

Technologies which are used to solve the big data problem:

Hadoop.
Cassandra.
MongoDB.
Apache Hive.

As I had stated above, there is no problem which technology cannot solve, and here is the solution, it is quite literally 3 problems 1 solution: Distributed File System.

Google was the first company to showcase Distributed File System back in 2003, from that time, the distributed file system has become better and better.

All right, fellas, that's all form my side!

Faraz A.

Open to opportunities

4 年

Good work!

查看更多评论

要查看或添加评论，请登录

Naitik Shah的更多文章

JavaScript - Journey from Zero to Hero with Vimal Daga Sir

2021年6月14日

JavaScript - Journey from Zero to Hero with Vimal Daga Sir

I have seen a lot of "Free" courses on YouTube, which assure you to take your basic level in JavaScript to next level…
Hybrid Computing Task 1

2020年10月10日

Hybrid Computing Task 1

Why Cloud? Many companies have a hard time maintaining their data centers. It's also inconvenient for new startups to…

2 条评论
Chest X-Ray Medical Diagnosis with Deep Learning

2020年9月23日

Chest X-Ray Medical Diagnosis with Deep Learning

Project Name: Chest X-Ray Medical Diagnosis with Deep Learning Team Members: Naitik Shah Ashutosh Kumar Sah This…

2 条评论
Top 5 Billion Dollar Companies Using AWS Cloud

2020年9月21日

Top 5 Billion Dollar Companies Using AWS Cloud

Hello Readers, AWS has captured a whopping 51% of the total cloud computing service providers, and their competitors…

2 条评论
Multi-Cloud Project

2020年9月20日

Multi-Cloud Project

A quick summary of the project: The purpose is to deploy a WordPress framework using Terraform on Kubernetes. For this,…

2 条评论
Hybrid Cloud Computing Task 4

2020年9月16日

Hybrid Cloud Computing Task 4

Hey fellas, presenting you my Hybrid Cloud Computing Task 4, which I am doing under the mentorship of Vimal Daga Sir…
Hybrid Cloud Computing Task 3

2020年9月1日

Hybrid Cloud Computing Task 3

Hey fellas, I bring you the Hybrid Cloud Computing task 3. What is this task all about? The motive is for our company…
Automating AWS Service(EC2, EFS, S3, Cloud Front) using Terraform

2020年8月31日

Automating AWS Service(EC2, EFS, S3, Cloud Front) using Terraform

So let me take you through the steps: First of all, create an IAM user by going to AWS GUI, and don't forget to…
Deploying Prometheus and Grafana on top of Kubernetes

2020年8月30日

Deploying Prometheus and Grafana on top of Kubernetes

Hello readers, this is my DevOps Task 5, and the problem statement is: Integrate Prometheus and Grafana and perform in…
Integrating Groovy with Kubernetes and Jenkins (DevOps Task 6)

2020年8月29日

Integrating Groovy with Kubernetes and Jenkins (DevOps Task 6)

Hola! so you guys might remember my DevOps Task 3 , if you haven't read it, then do give it a read, because this Task…

See all articles

Data Problem of Big Tech Companies

Naitik Shah

Data Scientist | Expert in Predictive Modeling, Machine Learning & Data Engineering | Python, SQL, Azure, Databricks | Achieved 15% Cost Reduction & Optimized Operations

Naitik Shah的更多文章

社区洞察

其他会员也浏览了

Dell EMC PowerScale: enter the new Storage Matrix!

5G in Big Data World

Data Storage Capacity

WHAT IS BIG DATA !!!!!!!!!!!!!!!!!!!!!!!!!!!

Data is binary, but that doesn’t make portability simple.

Bloom Filter Explained

Computers vs. human brain, who’s winning? PART 1

Managed Spark clusters in Azure Synapse Analytics

The Rise of Big Data

How are we living in a data oriented world?

Naitik Shah的更多文章

JavaScript - Journey from Zero to Hero with Vimal Daga Sir

Hybrid Computing Task 1

Chest X-Ray Medical Diagnosis with Deep Learning

Top 5 Billion Dollar Companies Using AWS Cloud

Multi-Cloud Project

Hybrid Cloud Computing Task 4

Hybrid Cloud Computing Task 3

Automating AWS Service(EC2, EFS, S3, Cloud Front) using Terraform

Deploying Prometheus and Grafana on top of Kubernetes

Integrating Groovy with Kubernetes and Jenkins (DevOps Task 6)

社区洞察

其他会员也浏览了

Dell EMC PowerScale: enter the new Storage Matrix!

5G in Big Data World

Data Storage Capacity

WHAT IS BIG DATA !!!!!!!!!!!!!!!!!!!!!!!!!!!

Data is binary, but that doesn’t make portability simple.

Bloom Filter Explained

Computers vs. human brain, who’s winning? PART 1

Managed Spark clusters in Azure Synapse Analytics

The Rise of Big Data

How are we living in a data oriented world?