Announcing my new book - "Engineering Lakehouses with Open Table Formats" ??
TBH, I have been thinking about this for quite some time.
A lot of times, in conversations with folks exploring table formats, questions have come up around choosing the right table formats, understanding use cases, and designing the overall lakehouse architecture.
So, the goal with this book is to provide a comprehensive resource for data/software engineers, architects, and decision-makers to understand the essentials of these formats.
But, also to elaborate on some of the less talked about 'core' stuff (beyond marketing jargons).
Specifically, the book will target 4 angles:
?? Table format Internals - e.g. How ACID transactions works, What is a Storage Engine, Performance optimization methods, etc.
?? Decisions on selecting a table format - factors to consider from a technical standpoint, ecosystem, features.
?? Use-cases and how to implement - streaming/batch, single-node workloads, CDC, integration with MLFlow, etc.
?? What's happening next - Interoperability (Apache XTable (Incubating), UniForm), Catalogs (Hive to newer ones such as Unity Catalog, Apache Polaris (Incubating))
I’ve been fortunate to have first-hand experience working with open table formats like Apache Iceberg and Apache Hudi primarily, and in some capacity with Delta Lake (circa 2019).
And, I intent to bring those experiences and touch upon the intricacies along with some of the pain points of getting started.
I am also thrilled to have Vinoth Govindarajan as a co-author, who brings a wealth of experience building lakehouses at scale with these formats at organizations like Uber and Apple.
We have drafted the first few chapters, but there's still work to do.
We’d love to take this opportunity to learn more from the community about any additional topics of interest for the book.
I'll be opening a formal feedback channel in a few days.
Oh, and the book is already available for pre-order on Amazon (link in comments).
Thanks to Packt for their continuous support in making this a solid effort!
#dataengineering #softwareengineering