Enterprise DataHub
Tomasz Krol Lead Data Engineer | Digital Hub Warsaw at Bayer

Enterprise DataHub

Introduction??

As we are all aware, the world around us is becoming faster every day. Things that used to change over decades now change in weeks. At Bayer, we strive to align with that pace and equip our teams with the most competitive tools on the market. Therefore, we have developed a global, cutting-edge data streaming platform: Enterprise DataHub (EDH).??

This platform empowers engineers across the entire company to focus solely on delivering business value, instead of spending time on building infrastructure, monitoring and other features.????

At the core of EDH is Apache Kafka, along with other components such as connectors, monitoring & alerting systems, Portal (WebUI), and robust auditing tools. In this article, I will provide main highlights for these components.??

Currently, users use both production and non-prod environments, processing thousands of topics that carry tens of gigabytes of data in and out daily. Some of them are replicated across AWS US, AWS EU and GCP clusters.??

Architecture??

High Level Architecture Diagram??


Apache Kafka Brokers??

As mentioned in the introduction, the platform comprises multiple different components. All of them are important, but definitely there is one without which the whole organism couldn’t live, that is Apache Kafka. This distributed system is responsible for storing and processing all the data sent through EDH.???

There are many reasons why Kafka was chosen, but below I will describe the most important ones:??

  • High Throughput - Kafka is designed to handle large volumes of data with high throughput. It can process millions of records per second, that’s making it an ideal choice for use cases at Bayer, where delivering real-time data is a common case.??

  • Scalability - Kafka is highly scalable. It can be scaled out by adding brokers and scaled up by using more memory and CPUs. Our brokers are hosted on virtual machines, allowing us to achieve scalability through Auto-Scaling Groups.??

  • Durability and Fault Tolerance - Kafka is replicating data across multiple brokers. Therefore, even though some of the brokers fail, the data remains consistent and available. The replication mechanism ensures quality attributes such as high availability and reliability.??

  • Real-Time processing - Kafka supports real time processing for publishing and consuming data. This type of processing is a must have in any modern data driven organisation.??

  • Decoupling of Systems - Kafka acts as an intermediary between producers and consumers, which work as independent components. This type of architecture allows for high flexibility and scalability.??

  • Integration with Ecosystem - Kafka has a rich ecosystem of external tools such as Kafka Connect, Schema Registry, and Zookeeper (we’re in process of transitioning to KRaft). While some users use Kafka Connect in their data pipelines, we created similar tool - EDH Wire, which is better aligned with Bayer’s environment and is widely used.?????

?? ??

  • Maturity, Community and Support - Kafka has been on the market for years and is used by multiple organisations which process up to petabytes of data. It is actively developed with tons of documentation available, and because of its popularity, it has significant community support.??

Kafka, along with Zookeeper and Schema Registry, currently runs on virtual machines across AWS US, AWS EU and GCP US. In each cloud instance, there are respectively five and seven brokers in non-production and production environments. We’re continuously reviewing the possibilities of migration to Kubernetes. However, there are still not enough pros for Kubernetes in terms of key factors such as stability, performance, scalability, and maintenance to give us a clear green light for that change.??

We perform regular upgrades to assure new enhancements and security patches in the clusters. Typically, we install pre-latest versions to eliminate potential issues introduced in the most recent releases.??


Replicant??

Replicant, it’s a simple tool, but very powerful. Its main function is to transport data from one to another cluster or clusters. We could discuss many reasons for having such a functionality, but the main ones are listed below:???

  • Geographical redundancy - by spreading clusters in different locations, can protect from outages in specific region (e.g. any natural disaster, power outages or network issues)??

  • Cloud redundancy - using multiple cloud providers reduces dependency on a single vendor. Also giving more flexibility in terms of offerings from different providers.??

  • Improved latency and performance - locating data closer to users may significantly decrease latency and improve performance for users which are nearby specific data centres??

  • Compliance and Data Privacy - some industries or regional law regulations require strict restrictions on data processing and storing depending on its location.???

  • Scalability - spreading processing and storing data on multiple clusters, allowing for distributed load, that is enhancing performance and don’t overload specific clusters.??

  • Business Continuity - in case of any failures in a specific region or cloud provider, we assure business continuity by having our data on other clusters.???


Replicant is running on at least 3 instances in each cluster to assure high availability and fault tolerance. It’s an application developed in Java, which consumes data from topic in one cluster and produce that data to the same topic in another cluster.???

Users can make all configurations through the Portal, where they can decide on a few parameters:??

  • Source and Target/s clusters, where the topic should be replicated??

  • How many instances of the process should run, its dependent on the size of the topic and number of partitions??

  • Using Round Robin method for topics with multiple partitions??

  • Starting offset???

要查看或添加评论,请登录

Digital Hub Warsaw I Bayer的更多文章

社区洞察

其他会员也浏览了