MinIO DataPod, Architecting a Modern Data Lake, Apache Iceberg: The August 2024 MinIO Newsletter
Forget about the dog days of summer, the team was in overdrive last month with a broad range of content from modern data lakes, open table format and AI reference architectures. Shall we get into it??
It wouldn’t be right to start with anything other than the release of MinIO DataPod , a first-of-its-kind reference architecture for building data infrastructure to support exascale AI and large-scale data lake workloads. Object storage has become foundational for AI infrastructure, and with DataPod, we are building the blueprint for storing massive volumes of structured and unstructured data while providing performance at scale.?
MinIO’s CMO Jonathan Symonds says it best: “The simplicity of the play and broad availability of hardware from 英伟达 , Supermicro , 惠普企业服务 and Dell Technologies make this an ultra-compelling reference architecture. The key is to think bigger. 100 PiB building blocks make total sense in an exascale world. "
As always, feel free to opt out . Conversely, if this was forwarded to you, you can always opt-in to stay current.
The Corner Office
Jonathan breaks down Insight Partner’s State of Enterprise Tech 2024 report , focusing primarily on their Data & AI and Infrastructure & Dev Ecosystem sections—aka the stuff you all care about. It’s a great report that echoes much of what we have been talking about for the last year or so, and Jonathan makes it super digestible.?
For a few years there, the term “private cloud” had a negative connotation. But as we know, technology is more of a wheel than an arrow, and right on cue, the private cloud is getting a ton of attention and it’s all positive. MinIO’s CTO Ugur Tigli ’s piece on The Architect’s Guide to the New Private Cloud discusses the right architecture to repatriate to, engineering-first principles of the private cloud and how to design data infrastructure for AI requirements.
The AI Corner
The world of data storage is all AI all the time these days, and we are setting the pace. The following is must-read stuff.??
ARM Scalable Vector Extension (SVE) represents a major technological improvement that enhances real-world performance for object storage and AI data infrastructure. Frank Wessels gets into the weeds of what ARM SVEs are, why it’s important for the MinIO server and how we enabled it.?
An embedding subsystem is one of four subsystems needed to implement RAG. In this post, Keith Pijanowski builds a distributed embedding subsystem that can run on engineering workstations and in a fully distributed cloud-native production environment. MinIO is the best storage solution for generative AI—embedding models need a high-speed, scalable storage solution.?
In this article about data-centric AI with Snorkel , Keith deconstructs the myth that good AI is solely a matter of model design. Good AI requires properly constructed training and testing data. He talks data-centric AI and how to use Snorkel Flow with MinIO to create a training pipeline for your AI workload.?
ICYMI: The most-read article of last month’s newsletter was Keith’s Architect’s Guide to the GenAI Tech Stack . This was a fan favorite… and rightfully so—Keith makes it very simple and concrete.
Databases, Datalakes and Other Data Musings
The semantic layer is an important part of modern datalake architectures. It not only simplifies data management but also enhances the security, quality, and usability of the data… key features of a successful AI implementation. Brenna Buuck ’s article provides a high-level overview of the tools that are either designed for or play well with modern data lakes.
AutoMQ is an open-source segment of Kafka that replaces Kafka's storage layer with a shared storage architecture based on object storage. This co-authored piece by Brenna and Kaiming Wan from AutoMQ guides you through how to deploy the AutoMQ cluster and store its data on MinIO .?
Earlier this summer, Databricks announced its acquisition of Tabular , a data platform by the original creators of Apache Iceberg, ultimately illustrating the increasing importance of open frameworks in the data landscape. Brenna takes you through the details.?
领英推荐
In a modern data lake, catalogs serve as the backbone for organizing and querying data efficiently. Brenna believes the era of fragmented data catalogs is ending. As the industry rallies around open standards, like Apache Iceberg REST API, the focus can shift to innovation and user-centric development.?
Open table formats like Apache Iceberg, Apache Hudi, and Delta Lake have become standard for query processors, with Iceberg gaining ground fast . This shift highlights the importance of scalable, cloud-native storage solutions, as opposed to query engines, which in turn have become commoditized. This change offers users more flexibility and cost-efficiency in their modern data architectures. Truly a win for the modern datalake.
MinIO Internals:
AJ hit with two stellar posts highlighting features from the MinIO Enterprise Object Store , specifically Firewall and Observability. The first post focuses on how the Load Balancer in the EOS Firewall solves the network bottleneck problem by taking a sidecar approach instead. The second post dives deeper into each of the different features of Observability and how we can use them to have production-grade monitoring ready to roll out of the box.?
When enterprises in crucial sectors like banking, healthcare, oil and gas repatriate their data from the cloud, using a bastion host is essential. AJ introduces Hashicorp Boundary as a way for actions on bastion hosts to be recorded and audited.?
As data scales and cloud costs rise, industries are looking to repatriate data from the cloud while retaining cloud-native management benefits. OpenShift OperatorHub addresses this need by offering cloud-native management for Kubernetes on your own infrastructure. AJ demonstrates how to install the MinIO Operator using OperatorHub .?
Bits and Bytes:
Data Engineer Georgios (George) Zefkilis created a very thorough tutorial building a local data lake with MinIO, Iceberg, Spark, StarRocks, Mage and Docker.?
OpenNebula on Medium introduces MinIO as the open source object storage available on the OpenNebula Marketplace , allowing users and customers to efficiently manage their data within OpenNebula cloud deployments.?
Next up we have an article written in Turkish by Data Engineer ‘myoztiryaki ’ covering ETL Project Steps with Airflow, Apache Spark, MinIO and Delta Lake.?
Zuda P. created a two-part series exploring the integration of MinIO, MySQL, Jasper and GraphQL with Spring Boot. In the first part , Zuda covers the initial steps in building a Spring Boot project with MySQL and GraphQL integration.?
The second part dives deeper into the integration with GraphQL, MinIO implementation and creating a reporting API using Jasper Report.?
Bounphone Kosada takes you through how to install MinIO on your new cloud server with Ubuntu OS using Docker.?
‘Romanchechyotkin ’ shows you how to integrate PostgreSQL and MinIO .?
Soumil S. created an awesome Apache Hudi Streamer master tutorial featuring MinIO. Check it out here .?
We saw two different Linkedin posts about MinIO written in Portuguese (with English translation available). The first one is by Software Engineer Thales Morato and introduces MinIO as a local alternative to S3 . The second one is by Data Engineer Daniel Pereira and covers the process of creating ETL Pipeline using Python, DuckDB and MinIO ?
Thank you for tuning in, everyone! See you next month.
MinIO Team
Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber
3 个月Thank you team for having my blog on your list ??