登录查看更多内容

Play by Play: Hadoop.AI.ML.

Avinash Patil

Solution Architect| Cloud-Native Consultant | LLMOps, MLOps, DevSecOps | Tech Evangelist and Blogger

发布日期: 2024年4月1日

Hello Readers,?

Let’s talk around things how everything tech is interrelated and why we care for Hadoop Ecosystem which is Apache Foundation’s infamous open-source software what AI and ML are evolving tech to understand the buzz and used cases around it. This is going to be historical ride and also can be dazzling circus too as there is good and evil analytics too.

Let’s talk Hadoop, which is evolved as Hadoop Distributed database and Projects from companies like Yahoo (Media Analytics Warehouse, Beyond Hadoop), Google (Big Query, Google File System), Facebook (Scribe, Hive, Hadoop).

Let’s now think how this Hadoop Analytics ecosystem has outgrown and FAANG companies adopted open-source technologies and made their custom versions.?

import pydoop.hdfs as hdfs

# Open a file on HDFS
with hdfs.open('/media/us/colarado/dataset1') as f:
    # Read the file
    data = f.read()

print(data)

In Yahoo’s case, the data warehousing solution is quite interesting! Here’s what we?know:

Yahoo! uses a custom-built data warehouse called MAW (Media Analytics Warehouse). This solution isn’t based on a single, off-the-shelf product but rather leverages a combination of technologies.
A core component is likely Hadoop: Similar to Facebook, Yahoo! might utilize Apache Hadoop for distributed storage and processing of their massive datasets. This allows them to handle the huge volume of data efficiently across numerous servers.
Focus on Scalability: Yahoo! is known for its enormous data warehouse, once claimed to be the world’s largest. Hadoop’s horizontal scaling capabilities are perfect for such needs.
Beyond Hadoop: While Hadoop plays a role, there’s likely more to the story. Yahoo! has a reputation for innovation and might use other open-source technologies or custom-developed solutions alongside Hadoop for specific functionalities within MAW.

Unfortunately, specific details about Yahoo’s data warehouse architecture are not publicly available. However, the use of a custom solution built around Hadoop seems to be the general consensus.

领英推荐

The Big 'Big Data' Question: Hadoop or Spark?

Bernard Marr 9 年前

Hadoop and the Iceberg

Peter Smulovics 1 个月前

Hadoop vs. Snowflake: Which One is Better

DrighnaTech 8 个月前

Hadoop plays a crucial role in supporting AI and Machine Learning (ML) models in several ways:

1. Handling Massive Datasets: Traditional data storage solutions struggle with the immense volume of data required for training complex AI and ML models. Hadoop’s distributed file system, HDFS, allows you to store and manage these massive datasets efficiently across clusters of commodity servers. This provides the raw material needed to train and refine your models.

2. Parallel Processing Power: Training AI and ML models can be computationally intensive. Hadoop’s MapReduce framework enables you to parallelize the processing tasks across multiple machines in a cluster. This significantly reduces training times compared to running them on a single machine.

3. Data Preprocessing and Feature Engineering: Before training, data often needs cleaning, transformation, and feature engineering. Tools like Pig and Hive within the Hadoop ecosystem can handle these tasks efficiently on large datasets. This ensures the quality and relevance of data fed to your models.

4. Scalability and Flexibility: As your data volume and processing needs grow, Hadoop can easily scale up by adding more nodes to the cluster. This flexibility allows your AI and ML workloads to adapt to changing requirements.

5. Integration with AI and ML Frameworks: Hadoop integrates well with popular AI and ML frameworks like TensorFlow, PyTorch, and scikit-learn. This allows you to leverage HDFS for data storage and utilize these frameworks for model development and training within the same ecosystem.

That being said with share your love and support. thanks for reading, keep being awesome.

Disclaimer: Made in Love with Gemini AI

要查看或添加评论，请登录

Avinash Patil的更多文章

What is System Design anyway??

2024年8月22日

What is System Design anyway??

Howdy Fellow Readers, let’s put it a thought that we all are designers and how we articulate our roadmap to achieve…
The 12-Factor App: Pythonic Way

2024年7月30日

The 12-Factor App: Pythonic Way

Let’s discuss some of Software Principles and defy the title of the blog. The 12-Factor App is a methodology for…
Cloud-Agnostic vs Cloud-Native

2024年7月24日

Cloud-Agnostic vs Cloud-Native

Hello Readers let’s discuss the key choices and differences between cloud-native and cloud-agnostic services…
Five Ideas to write Better Cloud Native Microservices

2024年7月23日

Five Ideas to write Better Cloud Native Microservices

Hello Readers let's proceed with our microservice series and explore some strategies to enhance your microservice…
Product management in a nutshell

2024年7月23日

Product management in a nutshell

I will discuss my perspective on product management, which to me is not necessarily about creating a breakthrough…
Keep you microservices clean, neat and tidy.

2024年7月23日

Keep you microservices clean, neat and tidy.

Hello, fellow readers, let’s delve into microservices first. I have a devotion that microservices represent a…
Keep you microservices clean, neat and tidy.

2024年7月16日

Keep you microservices clean, neat and tidy.

Hello, fellow readers, let’s delve into microservices first. I have a devotion that microservices represent a…
Validate data-driven decision making with DBT tool

2024年6月17日

Validate data-driven decision making with DBT tool

Let’s proceed with the ‘All Things Data’ series in this blog. We’ll think conclusively to understand why data…
Why Deviate from Data Driven Decision ?

2024年5月29日

Why Deviate from Data Driven Decision ?

In Silicon Valley, there's a saying that hope is a waking dream, and all dreams are realized through investment. ISV…
Data Downtime, is the new oil leaking ?

2024年5月13日

Data Downtime, is the new oil leaking ?

In today’s data-centric world, understanding the common causes of data downtime is crucial for any organization. Data…

See all articles

Play by Play: Hadoop.AI.ML.

Avinash Patil

Solution Architect| Cloud-Native Consultant | LLMOps, MLOps, DevSecOps | Tech Evangelist and Blogger

In Yahoo’s case, the data warehousing solution is quite interesting! Here’s what we?know:

领英推荐

Avinash Patil的更多文章

社区洞察

其他会员也浏览了

What Are The Key Differences Between Spark And Hadoop?

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is Hadoop on a Downtrend?

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop File Formats, when and what to use?

Hadoop: Pioneering the Era of Big Data Storage Technologies

Understanding Hadoop: The Backbone of Big Data Processing

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

The 9 main applications of the Hadoop Ecosystem

Unleashing the Power of Big Data with Hadoop

In Yahoo’s case, the data warehousing solution is quite interesting! Here’s what we?know:

领英推荐

Avinash Patil的更多文章

What is System Design anyway??

The 12-Factor App: Pythonic Way

Cloud-Agnostic vs Cloud-Native

Five Ideas to write Better Cloud Native Microservices

Product management in a nutshell

Keep you microservices clean, neat and tidy.

Keep you microservices clean, neat and tidy.

Validate data-driven decision making with DBT tool

Why Deviate from Data Driven Decision ?

Data Downtime, is the new oil leaking ?

社区洞察

其他会员也浏览了

What Are The Key Differences Between Spark And Hadoop?

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Is Hadoop on a Downtrend?

Setting Up Hadoop Cluster on Top of AWS & Checking the Existence of Replica by Crashing the data node

Hadoop File Formats, when and what to use?

Hadoop: Pioneering the Era of Big Data Storage Technologies

Understanding Hadoop: The Backbone of Big Data Processing

Spark vs. Hadoop: A Comprehensive Comparison for Big Data Processing

The 9 main applications of the Hadoop Ecosystem

Unleashing the Power of Big Data with Hadoop