Enterprise DataHub
Introduction??
As we are all aware, the world around us is becoming faster every day. Things that used to change over decades now change in weeks. At Bayer, we strive to align with that pace and equip our teams with the most competitive tools on the market. Therefore, we have developed a global, cutting-edge data streaming platform: Enterprise DataHub (EDH).??
This platform empowers engineers across the entire company to focus solely on delivering business value, instead of spending time on building infrastructure, monitoring and other features.????
At the core of EDH is Apache Kafka, along with other components such as connectors, monitoring & alerting systems, Portal (WebUI), and robust auditing tools. In this article, I will provide main highlights for these components.??
Currently, users use both production and non-prod environments, processing thousands of topics that carry tens of gigabytes of data in and out daily. Some of them are replicated across AWS US, AWS EU and GCP clusters.??
Architecture??
High Level Architecture Diagram??
Apache Kafka Brokers??
As mentioned in the introduction, the platform comprises multiple different components. All of them are important, but definitely there is one without which the whole organism couldn’t live, that is Apache Kafka. This distributed system is responsible for storing and processing all the data sent through EDH.???
There are many reasons why Kafka was chosen, but below I will describe the most important ones:??
?? ??
领英推荐
Kafka, along with Zookeeper and Schema Registry, currently runs on virtual machines across AWS US, AWS EU and GCP US. In each cloud instance, there are respectively five and seven brokers in non-production and production environments. We’re continuously reviewing the possibilities of migration to Kubernetes. However, there are still not enough pros for Kubernetes in terms of key factors such as stability, performance, scalability, and maintenance to give us a clear green light for that change.??
We perform regular upgrades to assure new enhancements and security patches in the clusters. Typically, we install pre-latest versions to eliminate potential issues introduced in the most recent releases.??
Replicant??
Replicant, it’s a simple tool, but very powerful. Its main function is to transport data from one to another cluster or clusters. We could discuss many reasons for having such a functionality, but the main ones are listed below:???
Replicant is running on at least 3 instances in each cluster to assure high availability and fault tolerance. It’s an application developed in Java, which consumes data from topic in one cluster and produce that data to the same topic in another cluster.???
Users can make all configurations through the Portal, where they can decide on a few parameters:??