???How I’d Learn Apache Iceberg (if I Had To Start Over) Apache Iceberg is everywhere. Major cloud providers and data platform vendors—including Google, Confluent, and Snowflake—have bundled Apache Iceberg support into their managed service offerings, making it an essential skill in every data professional's toolkit, whether you like it or not. Want to learn Iceberg and struggling with where to start? Let me help you here. I've created a comprehensive 7-week study plan that balances Iceberg theoretical concepts with hands-on practice. Though I'm still working through it, I wanted to share my learning roadmap—both to help others grasp the basics and to gather feedback for improvements. Here’s the gist of it. You can find the long version here https://lnkd.in/eS8Kqj94 ??Week 1: Understanding the problem context I will spend the first week studying what led to the creation of Apache Iceberg. Through reading articles, and books, and watching videos, I'll build a mental model of Iceberg and understand why it exists. ??Week 2: What is Iceberg? I will spend the second week trying to understand the architecture of Apache Iceberg—what it is made of and how it works. ??Week 3: Getting hands-on The third week is all about applying everything I’ve learned so far into practice. I will try to set up a local Iceberg environment where I can experiment with basic table-level operations. ??Week 4: Working with Apache Spark, partitioning and time-traveling I will dedicate week 4 to exploring how query engines work with core Iceberg features. I will start with Apache Spark. ??Week 5: Record level operations, version controlling for tables In the fifth week, I will further explore the core Iceberg features using a different query engine and catalog: Dremio and Nessie. ??Week 6: Streaming with Apache Flink, schema evolution Now that I understand Iceberg’s core capabilities, it’s time to explore how Iceberg integrates batch and real-time processing. I will experiment with Apache Flink. ??Week 7: Advanced concepts I will wrap up my study in week 7, focusing on advanced concepts of Iceberg. Even after completing this 7-week schedule, I won't feel fully confident until I apply this knowledge practically. Therefore, I plan to build a real-world data lakehouse project that incorporates batch and real-time data processing, a BI dashboard, and a machine learning use case at the end. I hope this learning plan is helpful. If you're an expert in this field, I welcome your feedback on any topics I may have missed.
Dunith Danushka的动态
最相关的动态
-
What is Apache Flink ? Let's demystify it ! This is from the Official documentation. Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. What is bounded and unbounded data streams ? Bounded data streams: The data has a clear beginning and ending, i.e data has bounds of data. E.g. a movie, a song, a file Unbounded data streams: This type of data is more real-time and does not have fixed limit / ending. E.g.- data from social media posts, live sports scores, or weather reports. Consider a viral video from Taylor Swift / Katy Perry. There will be many comments, likes, these events have to be processed in Real-time. What does stateful computation mean ? Flink manages state (data context) across multiple events efficiently. This feature is essential for complex processing tasks that depend on previous data. For E.g. - Managing transactions. Every transaction changes account balance. Flink can handle the dynamic updates to persist and update balance for the transactions. Also, whenever you hear distributed it means the computation / processing is spread over multiple machines. Exploring Flink in AWS very straightforward - https://lnkd.in/gaj3jH4Z GCP does not offer a specific managed service like AWS. There are couple of ways though - 1. Deploy Apache Flink on Google Kubernetes Engine ( GKE ) 2. Manually set up Flink clusters on Compute Engine virtual machines. 3. Google Cloud Dataflow can be used to run Flink applications since Apache Beam has a runner that supports Flink. Similarly in Azure, you can explore using - 1. HDInsight: Azure HDInsight is a cloud service that simplifies, enhances, and manages Apache Hadoop-based projects, including Apache Flink. 2. Azure Kubernetes Service (AKS): Similar to GKE on Google Cloud, AKS allows you to deploy containerized Flink applications using Kubernetes. Official website - https://flink.apache.org/ Learn more on the key use cases of Flink here - https://lnkd.in/e5F_b-ii 1. Event Driven applications 2. Data Analytics applications 3. Data Pipeline applications Guess the best part of Apache Flink, it's a nice Squirrel logo. Do share your experience with Apache Flink if you have used it or explored it ! #Apache #Flink #Data #Cloud #AWS #GCP #Azure #Stream #Batch
要查看或添加评论,请登录
-
-
Had another productive week diving deep into Apache Spark Higher level APIs - Here's a brief description of topics covered in Week 5 : 1.Importance of Higher-Level APIs in Apache Spark: Higher-Level APIs like DataFrames and Spark SQL provides us with more user-friendly way to work with data. RDDs are raw data distributed across the partitions in the cluster without any schema associated. Higher level APIs are more performant as data comes with some structure associated with it which helps in handling data more effectively. 2.Overview of DataFrames: DataFrames are in the form of RDDs with schema associated. They are not persistent and can be used only in the session it is created. Data is stored in memory and metadata is stored in temporary metadata catalog. Working of Dataframes: 1)Load the data file and create a Spark Dataframe. 2)Perform Transformations, 3)Write results back to the storage. Dataframe reader: Standard way to create a Dataframe. There are several shorcut methods as well to create dataframes for different file formats. 3.Overview of Spark SQL/Tables: Spark tables are structured data that can be queried using SQL syntax. Data files are stored on the disk and the schema is stored in a metastore.These are persistent,can be used over different sessions. 4.Types of SQL Tables: There are mainly two types of tables in Spark SQL namely managed tables and external tables. Managed tables are controlled by Spark, meaning Spark manages both the metadata and the data. Once the managed table is dropped both the data and metadata associated will be dropped. External tables reference data stored outside of Spark, and only the metadata is managed by Spark i.e, the user. Hence, when we drop such tables then only metadata is lost. 5.Overview of Apache Spark Optimizations: A) Application / Code level optimization : Using cache, preferring reduceByKey over groupByKey. B) Resource level optimizations: It involve efficient use of cluster resources to enhance performance. Basically resources include CPU cores(Compute) and Memory(RAM). Spark executor/container: Is a container of resources(CPU and RAM). There are different strategies to create Executors: A)Thin executor: Intention is to create more executors with minimal resources. Limitations: a)No multi-threading b)In case of Shared variables(like broadcast variable) multiple copies needs to be maintained for each executor. B)Fat executor: Maximum resources are provided to each executor. Drawbacks: a)It is observed that if an executor holds more than 5 CPU cores then the HDFS throughput suffers. b)Takes a lot of time in garbage collection (removal of unused objects in the memory) Right/Balanced approach to create executor is to consider below points : A)HDFS shouldn't suffer from throughput and this can be ensured by providing each executor with 5 CPU cores. B)Multi-threading within the executor. Thanks Sumit Mittal and TrendyTech. #dataengineering #bigdataengineer #bigdata
要查看或添加评论,请登录
-
A few months ago,?ParadeDB?released?pg_analytics, a Postgres extension which embeds DuckDB inside Postgres to enable fast analytics over data lakes (AWS S3, GCS, etc.) and open table formats (Apache Iceberg, Delta Lake, etc.). Initially licensed as AGPL-3.0 and part of the wider ParadeDB project, we’ve since decided to separate it into?its own subproject (https://lnkd.in/dN5VJqTZ),?pg_analytics, and license it permissively as PostgreSQL, a similar license to the popular Apache-2.0. The goal is for the wider community to be able to adopt and benefit from it without any legal restriction. Over time, we’d like to move?pg_analytics?outside of the ParadeDB GitHub organization and into a foundation or into its own top-level organization to enable community-led governance. If you run Postgres, whether as a service or internally, you are now able to adopt, host and resell?pg_analytics?as you wish. If you’re interested in helping us move?pg_analytics?into a community-led governance organization or experienced with open-source governance, please reach out.
要查看或添加评论,请登录
-
Greetings, Linked in Family! I will present you with the simplest explanation for knowing apache spark jobs, as well as its core components and functionalities for distributed data processing.?? ??Let's examine Apache Spark's fundamental parts and functions in order to comprehend how it operates internally ?? Spark Jobs : 10 Crucial Configuration Properties! ?? spark.sql.shuffle.partitions: ?- Example: spark.conf.set("spark.sql.shuffle.partitions", "200") ?- Usage: Configures the number of partitions to use when shuffling data during joins or aggregations. ?? spark.executor.memory: ?- Example:??spark.conf.set("spark.executor.memory", "4g") ?- Usage: Sets the amount of memory to allocate per Spark executor. ??spark.shuffle.file.buffer: ?- Example:??spark.conf.set("spark.shuffle.file.buffer", "64k") ?- Usage: Specifies the buffer size for reading and writing shuffle files. ??spark.task.maxFailures: ?- Example:??spark.conf.set("spark.task.maxFailures", "5") ?- Usage: Defines the maximum number of failures allowed for a single task before the whole job is considered failed. ??? spark.reducer.maxSizeInFlight: ?- Example:??spark.conf.set("spark.reducer.maxSizeInFlight", "96m") ?- Usage: Sets the maximum size of map output data to fetch simultaneously from each reduce task. ?? spark.sql.autoBroadcastJoinThreshold: ?- Example: spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10m") - Usage: Controls the size at which Spark will automatically broadcast the smaller side of a join. ?? spark.sql.broadcastTimeout: ?- Example:??spark.conf.set("spark.sql.broadcastTimeout", "300") ?- Usage: Sets the maximum time to wait for broadcast data to be sent to all nodes. ?? spark.databricks.delta.autoCompact: ?- Example:??spark.conf.set("spark.databricks.delta.autoCompact", "true") ?- Usage: Enables automatic compaction of the Delta table files for better performance. ??spark.databricks.delta.retentionDurationCheck.enabled: ?- Example: spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true") ?- Usage: Enables or disables retention duration checks during Delta table compaction. ?? spark.databricks.io.cache.enabled: ?- Example:??spark.conf.set("spark.databricks.io.cache.enabled", "true") ?- Usage: Enables or disables caching for improved I/O performance. In summary, Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties. Spark propertiescontrol most application parameters and can be set by using a SparkConf object, or through Java system properties. #Dataengineering #sparkjobs #databricks #azuredataengineering
要查看或添加评论,请登录
-
-
?? New Blog Post: Integrating Apache Iceberg with AWS and Snowflake! ?? Hey #DataEnthusiasts and #CloudGeeks! ??? I've just published an article on how MRH Trowe integrates Apache Iceberg with Amazon Web Services (AWS) Glue, S3, Athena, Matillion, and Snowflake. ?? ?? What to Expect? ?? Collecting data via an API and storing it in S3 as Parquet files ?? Using Glue Crawler and Athena to update Iceberg tables ??? Orchestrating the entire process with Matillion ?? Querying the data in Snowflake ?? Tips to avoid compatibility issues with non-ASCII characters ?? Read the full article here: https://lnkd.in/gVX7HTrW #DataEngineering #AWS #Snowflake #Matillion #ApacheIceberg
Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 2
dev.to
要查看或添加评论,请登录
-
Integrating MongoDB with Elasticsearch is common but can be complex and time-consuming. In our latest blog, we explore different integration methods and introduce a simpler alternative: SingleStore Kai. It offers real-time analytics on MongoDB data without the need for extra tools or complex setups, all while using the familiar Mongo syntax. Check out the article to learn more -https://lnkd.in/evhYRcz5
Easy Integration of MongoDB? and Elasticsearch
singlestore.com
要查看或添加评论,请登录
-
PHASE ONE UPDATE A few days ago, I announced my commitment to a personal project aimed at building a comprehensive machine learning lifecycle for a regression task, encompassing data ingestion and monitoring phases. I’m happy to share that I’ve successfully completed?Phase One?of the project. Here's a summary of my accomplishments so far: > GitHub Repository Setup: Created a structured GitHub repository to host and manage the project. > Folder Organization: Established a well-structured directory for efficient navigation. > MongoDB Atlas Connection: Successfully connected to MongoDB Atlas, setting up a robust database management system. > ETL Pipeline Development: Built a basic?ETL (Extract, Transform, Load)?pipeline that: 1. Extracts data from my local. 2. Transforms the data into JSON format. 3. Loads the data into MongoDB Atlas. This process posed some challenges as it was my first attempt at implementing an ETL pipeline. However, overcoming these challenges provided valuable insights and a strong foundation for future tasks. Note: While this pipeline extracts data from a local source, future iterations may involve data from various other sources. Tools and Technologies Utilized so far: > pymongo: For MongoDB operations. > MongoDB Atlas: As the database management platform. > Git: For version control The next phase involves developing a?data ingestion pipeline?to automate data collection and preprocessing. Stay tuned for updates as I continue this journey. Additionally, I’m excited to share a preview of an upcoming?collaborative project?with my amazing team, where we will: 1. Use?Apache Airflow?to extract data from the?YouTube API. 2. Store the data in a?PostgreSQL data warehouse and 3. Use the extracted data to perform Sentiment Analysis. I’d love your feedback! You can check out the repository and review the work completed so far: GitHub?Repo: https://lnkd.in/diYhcCfz Thank you for following my progress, and I look forward to sharing more updates as the project unfolds! #MachineLearning #DataScience #ETLPipeline #MongoDB #PythonDevelopment #GitHubProjects #AIProjects #DataEngineering #pymongo #MongoDBAtlas #DataIngestion #ProjectUpdate #AIJourney #TechInnovation #DataPipeline #APIAutomation #PostgreSQL #ApacheAirflow #YouTubeAPIIntegration #FullStackML #PythonProgramming
要查看或添加评论,请登录
-
What the heck are Apache Iceberg, Apache Hudi, and Delta Lake? Apache Iceberg, Apache Hudi, and Delta Lake are open-source table formats designed to bring ACID (Atomicity, Consistency, Isolation, Durability) transactions and advanced data management features to data lakes (hooray!). They address the limitations of traditional data lakes by enabling efficient data ingestion, storage, and query processing at scale. Despite sharing common goals, they differ in architecture, features, and optimal use cases. ?? Apache Iceberg Best For: - Organizations dealing with petabyte-scale datasets requiring schema evolution. - Environments that utilize multiple processing engines (e.g., Spark, Flink, Trino) and need consistent table behavior across them. - Scenarios where read performance and scalability are critical, and write operations are mostly batch-oriented. ??♂??? Apache Hudi Best For: - Use cases requiring real-time data ingestion and availability. - Scenarios needing record-level mutations, such as upserts and deletes. - Environments where incremental data processing and streaming are essential. ?? Delta Lake Best For: - Organizations already leveraging Apache Spark for data processing. - Workloads that require a unified approach to batch and streaming data. - Scenarios where time travel and historical data analysis are needed. ?? Which should you choose? Use Apache Iceberg if: ?- You need scalable read performance over massive datasets. ?- Consistent table behavior across multiple query engines is important. ?- Complex schema evolution without table rewrites is required. Use Apache Hudi if: ?- Real-time data ingestion and immediate query availability are necessary. ?- Your workload involves frequent record-level updates and deletes. ?- Incremental data processing capabilities are crucial. Use Delta Lake if: ?- You're heavily invested in Apache Spark and want seamless integration. ?- Combining batch and streaming data processing is a priority. Which table format do you use? #dataanalytics #dataengineering #tableformat
要查看或添加评论,请登录
-
-
Introducing ParadeDB ?? ParadeDB is an alternative to Elasticsearch, built on Postgres. Experience modernized real-time search and analytics like never before. ?? https://lnkd.in/eCU3tK-w
GitHub - paradedb/paradedb: Postgres for Search and Analytics
github.com
要查看或添加评论,请登录
-
Now is the time to share this article I wrote a few months ago. At Milsat, we are building software to track real-time analytic data to help organizations and individuals monitor real-time electricity usage. I wrote an article on how we are using MongoDB aggregation to aggregate complex data and manipulate large datasets in real time and do complex analytics queries. MongoDB aggregation is a way of processing a large number of documents in a collection by passing them through stages, called pipeline. Some of the reasons why we use MongoDB aggregation: - Joining data together from different collections on the "server-side" - Real-time analytics - Real-time queries where deeper "server-side" data post-processing is required than provided by the MongoDB Query Language (MQL) Check out the article: https://lnkd.in/dgMuDDSJ #nodejs #mongodb
MongoDB Aggregation, is really powerful
dev.to
要查看或添加评论,请登录