登录查看更多内容

MinIO DataPod, Architecting a Modern Data Lake, Apache Iceberg: The August 2024 MinIO Newsletter

MinIO

High Performance, Kubernetes Storage Built for AI

发布日期: 2024年8月12日

Forget about the dog days of summer, the team was in overdrive last month with a broad range of content from modern data lakes, open table format and AI reference architectures. Shall we get into it??

It wouldn’t be right to start with anything other than the release of MinIO DataPod , a first-of-its-kind reference architecture for building data infrastructure to support exascale AI and large-scale data lake workloads. Object storage has become foundational for AI infrastructure, and with DataPod, we are building the blueprint for storing massive volumes of structured and unstructured data while providing performance at scale.?

MinIO’s CMO Jonathan Symonds says it best: “The simplicity of the play and broad availability of hardware from 英伟达 , Supermicro , 惠普企业服务 and Dell Technologies make this an ultra-compelling reference architecture. The key is to think bigger. 100 PiB building blocks make total sense in an exascale world. "

As always, feel free to opt out . Conversely, if this was forwarded to you, you can always opt-in to stay current.

The Corner Office

Jonathan breaks down Insight Partner’s State of Enterprise Tech 2024 report , focusing primarily on their Data & AI and Infrastructure & Dev Ecosystem sections—aka the stuff you all care about. It’s a great report that echoes much of what we have been talking about for the last year or so, and Jonathan makes it super digestible.?

For a few years there, the term “private cloud” had a negative connotation. But as we know, technology is more of a wheel than an arrow, and right on cue, the private cloud is getting a ton of attention and it’s all positive. MinIO’s CTO Ugur Tigli ’s piece on The Architect’s Guide to the New Private Cloud discusses the right architecture to repatriate to, engineering-first principles of the private cloud and how to design data infrastructure for AI requirements.

The AI Corner

The world of data storage is all AI all the time these days, and we are setting the pace. The following is must-read stuff.??

ARM Scalable Vector Extension (SVE) represents a major technological improvement that enhances real-world performance for object storage and AI data infrastructure. Frank Wessels gets into the weeds of what ARM SVEs are, why it’s important for the MinIO server and how we enabled it.?

An embedding subsystem is one of four subsystems needed to implement RAG. In this post, Keith Pijanowski builds a distributed embedding subsystem that can run on engineering workstations and in a fully distributed cloud-native production environment. MinIO is the best storage solution for generative AI—embedding models need a high-speed, scalable storage solution.?

In this article about data-centric AI with Snorkel , Keith deconstructs the myth that good AI is solely a matter of model design. Good AI requires properly constructed training and testing data. He talks data-centric AI and how to use Snorkel Flow with MinIO to create a training pipeline for your AI workload.?

ICYMI: The most-read article of last month’s newsletter was Keith’s Architect’s Guide to the GenAI Tech Stack . This was a fan favorite… and rightfully so—Keith makes it very simple and concrete.

Databases, Datalakes and Other Data Musings

The semantic layer is an important part of modern datalake architectures. It not only simplifies data management but also enhances the security, quality, and usability of the data… key features of a successful AI implementation. Brenna Buuck ’s article provides a high-level overview of the tools that are either designed for or play well with modern data lakes.

AutoMQ is an open-source segment of Kafka that replaces Kafka's storage layer with a shared storage architecture based on object storage. This co-authored piece by Brenna and Kaiming Wan from AutoMQ guides you through how to deploy the AutoMQ cluster and store its data on MinIO .?

Earlier this summer, Databricks announced its acquisition of Tabular , a data platform by the original creators of Apache Iceberg, ultimately illustrating the increasing importance of open frameworks in the data landscape. Brenna takes you through the details.?

Alex Merced 8 个月前

Data Bricks - The New Way to Manage Data Efficiently

Miracle Software Systems, Inc 7 个月前

Virtualization + Lakehouse + Mesh = Data At Scale

Alex Merced 1 个月前

In a modern data lake, catalogs serve as the backbone for organizing and querying data efficiently. Brenna believes the era of fragmented data catalogs is ending. As the industry rallies around open standards, like Apache Iceberg REST API, the focus can shift to innovation and user-centric development.?

Open table formats like Apache Iceberg, Apache Hudi, and Delta Lake have become standard for query processors, with Iceberg gaining ground fast . This shift highlights the importance of scalable, cloud-native storage solutions, as opposed to query engines, which in turn have become commoditized. This change offers users more flexibility and cost-efficiency in their modern data architectures. Truly a win for the modern datalake.

MinIO Internals:

AJ hit with two stellar posts highlighting features from the MinIO Enterprise Object Store , specifically Firewall and Observability. The first post focuses on how the Load Balancer in the EOS Firewall solves the network bottleneck problem by taking a sidecar approach instead. The second post dives deeper into each of the different features of Observability and how we can use them to have production-grade monitoring ready to roll out of the box.?

When enterprises in crucial sectors like banking, healthcare, oil and gas repatriate their data from the cloud, using a bastion host is essential. AJ introduces Hashicorp Boundary as a way for actions on bastion hosts to be recorded and audited.?

As data scales and cloud costs rise, industries are looking to repatriate data from the cloud while retaining cloud-native management benefits. OpenShift OperatorHub addresses this need by offering cloud-native management for Kubernetes on your own infrastructure. AJ demonstrates how to install the MinIO Operator using OperatorHub .?

Bits and Bytes:

Data Engineer Georgios (George) Zefkilis created a very thorough tutorial building a local data lake with MinIO, Iceberg, Spark, StarRocks, Mage and Docker.?

OpenNebula on Medium introduces MinIO as the open source object storage available on the OpenNebula Marketplace , allowing users and customers to efficiently manage their data within OpenNebula cloud deployments.?

Next up we have an article written in Turkish by Data Engineer ‘myoztiryaki ’ covering ETL Project Steps with Airflow, Apache Spark, MinIO and Delta Lake.?

Zuda P. created a two-part series exploring the integration of MinIO, MySQL, Jasper and GraphQL with Spring Boot. In the first part , Zuda covers the initial steps in building a Spring Boot project with MySQL and GraphQL integration.?

The second part dives deeper into the integration with GraphQL, MinIO implementation and creating a reporting API using Jasper Report.?

Bounphone Kosada takes you through how to install MinIO on your new cloud server with Ubuntu OS using Docker.?

‘Romanchechyotkin ’ shows you how to integrate PostgreSQL and MinIO .?

Soumil S. created an awesome Apache Hudi Streamer master tutorial featuring MinIO. Check it out here .?

We saw two different Linkedin posts about MinIO written in Portuguese (with English translation available). The first one is by Software Engineer Thales Morato and introduces MinIO as a local alternative to S3 . The second one is by Data Engineer Daniel Pereira and covers the process of creating ETL Pipeline using Python, DuckDB and MinIO ?

Thank you for tuning in, everyone! See you next month.

MinIO Team

Soumil S.

3 个月

Thank you team for having my blog on your list ??

1 次回应

查看更多评论

要查看或添加评论，请登录

MinIO DataPod, Architecting a Modern Data Lake, Apache Iceberg: The August 2024 MinIO Newsletter

MinIO

High Performance, Kubernetes Storage Built for AI

The Corner Office

The AI Corner

Databases, Datalakes and Other Data Musings

领英推荐

MinIO Internals:

Bits and Bytes:

更多精彩文章

社区洞察

其他会员也浏览了

Understanding Batch and Real-Time Processing in DataBricks

Understanding Batch and Real-Time Processing in DataBricks

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Migrating from Traditional Databases to Databricks: A Strategic Path to Data Modernization

NuoData open data lake-house

Real-Time Challenges and Solutions for Data Engineers in Azure Databricks

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

Mastering AWS OpenSearch for High-Volume Data: Best Practices and Optimizations — part 1

Why Databricks: Use Cases for Databricks Data Intelligence

The Corner Office

The AI Corner

Databases, Datalakes and Other Data Musings

领英推荐

MinIO Internals:

Bits and Bytes:

Chat With Your Objects Using the AIStor Prompt API

2024年11月21日

Unveiling MinIO’s AIStor - What You Need to Know: The November 2024 MinIO Newsletter

2024年11月15日

Replication, Data Consolidation, and Data Migration

2024年11月1日

Cloud as an Operating Model - Not a Physical Location

2024年10月16日

AI Data Lakes, AWS Repatriation, Annual Object Storage Survey and WarpStream: The October 2024 MinIO Newsletter

2024年10月2日

The Challenge in Big Data is Small Files

2024年9月24日

The Bank of the North - A Quick Case Study for HDFS Modernization

2024年9月17日

NFS Must Die, Case Studies, What Open Source AI Really Means and HDFS Replacement: The September 2024 MinIO Newsletter

2024年9月9日

The Bank of the East - Replacing Hadoop with MinIO and Dremio

2024年9月5日

The Role of Object Storage in AI, The Modern Datalake, MinIO Days Recap: The July 2024 MinIO Newsletter

2024年7月10日

社区洞察

其他会员也浏览了

Understanding Batch and Real-Time Processing in DataBricks

Understanding Batch and Real-Time Processing in DataBricks

Simplifying Analytics with Azure Databricks' Open Lakehouse Architecture

Migrating from Traditional Databases to Databricks: A Strategic Path to Data Modernization

NuoData open data lake-house

Real-Time Challenges and Solutions for Data Engineers in Azure Databricks

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

Mastering AWS OpenSearch for High-Volume Data: Best Practices and Optimizations — part 1

Why Databricks: Use Cases for Databricks Data Intelligence