Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

Wer Ordnung h?lt, ist nur zu faul zum Suchen (If you keep things tidily ordered, you’re just too lazy to go searching.) —German proverb

Drawing inspiration from the principles explored in Chapter 3 of Designing Data-Intensive Applications, we embark on our quest to harness the power of Elasticsearch and Kibana for data storage and visualisation in real-time.

Building on the first two parts (Part-1 , Part-2), where we explored real-time data ingestion with Apache Flink and storage in PostgreSQL, part 3 delves into the exciting world of real-time data visualisation.


Project Architecture Recap

Part 3 Focus is on Elasticsearch and Kibana

What is Elasticsearch ?

Here's a quick rundown on Elasticsearch:

  • Search Engine Powerhouse: It's an open-source, distributed search and analytics engine built for speed and scalability. Imagine a supercharged search bar that can handle massive amounts of data across different formats (text, numbers, location data, etc.).
  • Real-time Ready: Elasticsearch excels at processing information close to real-time, making it ideal for situations where you need up-to-date insights.
  • Data Buffet: It can store a wide variety of data, structured or unstructured, making it a versatile tool for various tasks.
  • Part of the Gang: Elasticsearch is the core component of the Elastic Stack, a suite of tools for data ingestion, processing, visualisation, and analysis. It works alongside Kibana (visualisation), Logstash (data processing), and Beats (data collection) for a complete data management solution.

In a nutshell, Elasticsearch lets you store, search, and analyze large amounts of data quickly and efficiently, making it a valuable tool for tasks like log analysis, website search, and real-time analytics.


What is Kibana ?

Here's a quick explanation of Kibana:

  • Visual Insights Partner: Kibana is an open-source data visualization tool designed specifically to work with Elasticsearch. It acts as the front-end interface, allowing you to explore and analyze the data stored within Elasticsearch.
  • Dashboard Dynamo: Kibana excels at creating informative dashboards that present complex data in a clear and understandable way. Think of it as a customizable cockpit for your data, letting you see trends, patterns, and anomalies at a glance.
  • Visualization Powerhouse: It offers a wide range of visualizations like charts, graphs, maps, and time series analysis. This allows you to tailor the data presentation to best suit your needs.
  • Elastic Stack Teammate: Kibana is part of the Elastic Stack, a comprehensive suite of tools for data management. It works alongside Elasticsearch (search and analytics engine), Logstash (data processing), and Beats (data collection) for a streamlined data analysis workflow.

In essence, Kibana takes the power of Elasticsearch and transforms it into visually compelling insights. It allows you to interact with, analyze, and communicate your data effectively.


Why Elasticsearch and Kibana for Data Visualisation in real-time?

Here are some of the key reasons why Elasticsearch and Kibana are a popular choice for data visualization:

Perfect Match:

  • Built to Work Together: Elasticsearch and Kibana are designed specifically to complement each other. Elasticsearch stores and analyzes the data, while Kibana provides a user-friendly interface for visualizing and interacting with that data. This tight integration streamlines the process and ensures smooth data exploration.

Elasticsearch Strengths:

  • Speed Demon: Elasticsearch excels at handling massive datasets and performing searches incredibly quickly. This is crucial for real-time data visualization, where insights need to be generated and displayed with minimal lag.
  • Data Buffet: It can store a wide variety of data formats, including structured (databases) and unstructured (logs, text files). This flexibility makes it suitable for different data visualization needs.
  • Scalability Powerhouse: Elasticsearch can be easily scaled horizontally by adding more nodes to the cluster. This allows it to handle growing data volumes without sacrificing performance, making it ideal for future-proof data visualization solutions.

Kibana's Visualization Prowess:

  • Visual Storytelling: Kibana offers a wide range of visualization options, including charts, graphs, maps, and time series analysis. This allows you to present complex data in a clear and engaging way, making it easier to identify trends, patterns, and anomalies.
  • Customization King: Kibana dashboards are highly customizable. You can create dashboards tailored to specific audiences or use cases, focusing on the most relevant data points and visualizations.
  • Interactive Insights: Kibana allows users to interact with the visualizations by filtering data, zooming in on details, and drilling down for deeper analysis. This interactivity fosters a more engaging and insightful data exploration experience.

Additional Considerations:

  • Open Source Advantage: Both Elasticsearch and Kibana are open-source tools, making them a cost-effective option for many organizations. There is also a large and active community that provides support and resources.
  • Elastic Stack Ecosystem: They are part of the broader Elastic Stack, which includes tools for data ingestion (Logstash, Beats) and machine learning. This creates a comprehensive data management solution for organizations looking to go beyond simple visualization.

Overall, Elasticsearch and Kibana offer a powerful and versatile combination for data visualization. Their speed, scalability, data handling capabilities, and rich visualization features make them a compelling choice for a variety of use cases.


Index in Elasticsearch

In Elasticsearch, an index is a fundamental unit of organization for your data. It's like a giant filing cabinet within the larger storage system. Here's a breakdown of what an index entails:

  • Collection of Documents: An index holds a collection of documents, similar to folders containing files in a filing cabinet. Each document represents a single piece of data, like a product description, a customer record, or a log entry.
  • Logical Namespace: An index has a unique name that acts as a logical identifier. It helps you categorize and manage your data efficiently. Imagine labeling each folder in the filing cabinet for easy retrieval.
  • Schema Definition: Optionally, you can define a schema for your index, specifying the structure and data types of the fields within your documents. This schema acts like a blueprint for your documents, ensuring consistency and enabling efficient searching.
  • Shards and Replicas: Internally, Elasticsearch splits each index into smaller partitions called shards. These shards distribute the data load across your cluster of machines, improving search performance and scalability. Additionally, you can configure replicas, which are copies of shards on different nodes, to ensure data redundancy and fault tolerance.

Here's an analogy to solidify the concept:

Imagine a library with different sections (indices) like fiction, non-fiction, and biographies. Each section contains books (documents), with each book having chapters and pages (fields within a document). The library might organize the books by genre or author (similar to defining a schema).

In essence, Elasticsearch indexes help you structure, organize, and efficiently search your data. They provide a scalable and fault-tolerant way to store and retrieve information.


Getting Started With Apache Flink and Elasticsearch

The Apache Flink Elasticsearch connector acts as a bridge between your Flink streaming or batch data processing pipelines and the Elasticsearch search engine. It allows you to:

  • Sink Data into Elasticsearch: You can use the connector to send data streams or processed datasets from your Flink program directly to Elasticsearch as new documents. This is useful for real-time data analysis or storing processed data for further exploration in Kibana.

Here are some key features of the connector:

  • Fault Tolerance: Flink's checkpointing mechanism ensures at-least-once delivery of data to Elasticsearch. This means that data is not lost even if failures occur during the transfer process.
  • Bulk Indexing: The connector efficiently transmits data in bulk to Elasticsearch, optimizing performance for large datasets.
  • Flexibility: You can configure various aspects of how data is indexed in Elasticsearch, such as the target index name and how document IDs are generated.
  • Upsert Mode (Optional): If your data has a primary key, the connector can operate in upsert mode. This allows it to handle updates and deletes along with inserts, keeping your Elasticsearch data synchronized with your Flink processing results.


Define An Elasticsearch Sink

def createIndexRequest(element:(Transaction)): IndexRequest = {
    val mapper = new ObjectMapper()
    mapper.registerModule(DefaultScalaModule)
    val json: String = mapper.writeValueAsString(element)
    Requests.indexRequest.index("transactions").id(element.transactionId).source(json,XContentType.JSON)
  }

def writeToElastic(transactionStream: DataStream[Transaction]) = {

    val sink: ElasticsearchSink[Transaction] = new Elasticsearch7SinkBuilder[Transaction]
      .setHosts(new HttpHost("localhost", 9200, "http"))
      .setBulkFlushMaxActions(2)
      .setBulkFlushInterval(10L)
      .setEmitter[Transaction]{
        (transaction, context, indexer) => {
          val mapper = new ObjectMapper()
          mapper.registerModule(DefaultScalaModule)
          val json: String = mapper.writeValueAsString(transaction)

          val indexRequest = Requests.indexRequest()
            .index("transactions")
            .id(transaction.transactionId)
            .source(json, XContentType.JSON);
          indexer.add(indexRequest)
        }
      }.build()

    transactionStream.sinkTo(sink)
  }        

Details of the Class Transaction is available Here :: https://github.com/snepar/flink-ecom/blob/master/src/main/scala/ecom/generators/Dto/Transaction.scala


Write to Elasticsearch Sink

val transactionStream = readCustomData()
    transactionStream.print()
writeToElastic(transactionStream)        

Refer the Complete Code here :: https://github.com/snepar/flink-ecom/blob/master/src/main/scala/ecom/KafkaPGESIntegrationEcom.scala


Spin Up Your Playground

Follow the instructions here :: https://github.com/snepar/flink-ecom-infra

Elasticsearch will be available in localhost:5601

And Execute This application to get started with the Data Ingestion :: https://github.com/snepar/flink-ecom


Search Using Elastic

Here’s the structure of the transaction index on elasticsearch. You can get this by running

GET transactions         

in the DevTools.

GET transactions

Query In Elastic

GET transactions/_search        
_search

Re-Index Data in Elastic

To get a readable transaction date, we need to reindex into a different index. To reindex data on elasticsearch, we use the function ::

_reindex

POST _reindex
{
 "source": {"index": "transactions"}, 
 "dest": {"index": "transaction_part1"},
 "script": {"source":"""
   ctx._source.transactionDate = new 
   Date(ctx._source.transactionDate).toString();
"""}
}

GET reindex/_search        
reindex

However, using toString() does not give us much room to wiggle around the format. So we need to use a more robust way to format the data.

POST _reindex
{
"source": {"index": "transactions"}, 
"dest": {"index": "transaction_part2"},
"script": {"source": 
 """SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
  formatter.setTimeZone(TimeZone.getTimeZone('UTC'));
  ctx._source.transactionDate = formatter.format (new 
  Date(ctx._source.transactionDate));"""
 }
}

GET transaction_part2/_search        
formatted reindex

Dashboard-ing in Realtime With Kibana

Index on transaction_part2

Creating Donut Chart

donut

Number Of Transactions

Count

Top Brands

Bar Chart

Final Dashboard (Keeping It All Together)

Dashboard

Thank You For Spending Sometime Here

This is the Complete Post :: https://dev.to/snehasish_dutta_007/apache-flink-e-commerce-data-pipeline-usecase-3ha9

Repositories ::

flink-ecom-infra flink-ecom

In Case You Have Missed

Part 1 :: https://www.dhirubhai.net/pulse/cheers-real-time-analytics-apache-flink-part-1-snehasish-dutta-rv8le/

Part 2 :: https://www.dhirubhai.net/pulse/cheers-real-time-analytics-apache-flink-part-2-snehasish-dutta-ig1pe/


Acknowledgements and Inspirations



#data #bigdata #apacheflink #scala #java #kafka #streaming #python #elasticsearch #kibana


要查看或添加评论,请登录

SNEHASISH DUTTA的更多文章

社区洞察

其他会员也浏览了