登录查看更多内容

Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

SNEHASISH DUTTA

Professional Data Engineer

发布日期: 2024年7月10日

Wer Ordnung h?lt, ist nur zu faul zum Suchen (If you keep things tidily ordered, you’re just too lazy to go searching.) —German proverb

Drawing inspiration from the principles explored in Chapter 3 of Designing Data-Intensive Applications, we embark on our quest to harness the power of Elasticsearch and Kibana for data storage and visualisation in real-time.

Building on the first two parts (Part-1 , Part-2), where we explored real-time data ingestion with Apache Flink and storage in PostgreSQL, part 3 delves into the exciting world of real-time data visualisation.

Project Architecture Recap

Part 3 Focus is on Elasticsearch and Kibana

What is Elasticsearch ?

Here's a quick rundown on Elasticsearch:

Search Engine Powerhouse: It's an open-source, distributed search and analytics engine built for speed and scalability. Imagine a supercharged search bar that can handle massive amounts of data across different formats (text, numbers, location data, etc.).
Real-time Ready: Elasticsearch excels at processing information close to real-time, making it ideal for situations where you need up-to-date insights.
Data Buffet: It can store a wide variety of data, structured or unstructured, making it a versatile tool for various tasks.
Part of the Gang: Elasticsearch is the core component of the Elastic Stack, a suite of tools for data ingestion, processing, visualisation, and analysis. It works alongside Kibana (visualisation), Logstash (data processing), and Beats (data collection) for a complete data management solution.

In a nutshell, Elasticsearch lets you store, search, and analyze large amounts of data quickly and efficiently, making it a valuable tool for tasks like log analysis, website search, and real-time analytics.

What is Kibana ?

Here's a quick explanation of Kibana:

Visual Insights Partner: Kibana is an open-source data visualization tool designed specifically to work with Elasticsearch. It acts as the front-end interface, allowing you to explore and analyze the data stored within Elasticsearch.
Dashboard Dynamo: Kibana excels at creating informative dashboards that present complex data in a clear and understandable way. Think of it as a customizable cockpit for your data, letting you see trends, patterns, and anomalies at a glance.
Visualization Powerhouse: It offers a wide range of visualizations like charts, graphs, maps, and time series analysis. This allows you to tailor the data presentation to best suit your needs.
Elastic Stack Teammate: Kibana is part of the Elastic Stack, a comprehensive suite of tools for data management. It works alongside Elasticsearch (search and analytics engine), Logstash (data processing), and Beats (data collection) for a streamlined data analysis workflow.

In essence, Kibana takes the power of Elasticsearch and transforms it into visually compelling insights. It allows you to interact with, analyze, and communicate your data effectively.

Why Elasticsearch and Kibana for Data Visualisation in real-time?

Here are some of the key reasons why Elasticsearch and Kibana are a popular choice for data visualization:

Perfect Match:

Built to Work Together: Elasticsearch and Kibana are designed specifically to complement each other. Elasticsearch stores and analyzes the data, while Kibana provides a user-friendly interface for visualizing and interacting with that data. This tight integration streamlines the process and ensures smooth data exploration.

Elasticsearch Strengths:

Speed Demon: Elasticsearch excels at handling massive datasets and performing searches incredibly quickly. This is crucial for real-time data visualization, where insights need to be generated and displayed with minimal lag.
Data Buffet: It can store a wide variety of data formats, including structured (databases) and unstructured (logs, text files). This flexibility makes it suitable for different data visualization needs.
Scalability Powerhouse: Elasticsearch can be easily scaled horizontally by adding more nodes to the cluster. This allows it to handle growing data volumes without sacrificing performance, making it ideal for future-proof data visualization solutions.

Kibana's Visualization Prowess:

Visual Storytelling: Kibana offers a wide range of visualization options, including charts, graphs, maps, and time series analysis. This allows you to present complex data in a clear and engaging way, making it easier to identify trends, patterns, and anomalies.
Customization King: Kibana dashboards are highly customizable. You can create dashboards tailored to specific audiences or use cases, focusing on the most relevant data points and visualizations.
Interactive Insights: Kibana allows users to interact with the visualizations by filtering data, zooming in on details, and drilling down for deeper analysis. This interactivity fosters a more engaging and insightful data exploration experience.

Additional Considerations:

Open Source Advantage: Both Elasticsearch and Kibana are open-source tools, making them a cost-effective option for many organizations. There is also a large and active community that provides support and resources.
Elastic Stack Ecosystem: They are part of the broader Elastic Stack, which includes tools for data ingestion (Logstash, Beats) and machine learning. This creates a comprehensive data management solution for organizations looking to go beyond simple visualization.

Overall, Elasticsearch and Kibana offer a powerful and versatile combination for data visualization. Their speed, scalability, data handling capabilities, and rich visualization features make them a compelling choice for a variety of use cases.

Index in Elasticsearch

In Elasticsearch, an index is a fundamental unit of organization for your data. It's like a giant filing cabinet within the larger storage system. Here's a breakdown of what an index entails:

Collection of Documents: An index holds a collection of documents, similar to folders containing files in a filing cabinet. Each document represents a single piece of data, like a product description, a customer record, or a log entry.
Logical Namespace: An index has a unique name that acts as a logical identifier. It helps you categorize and manage your data efficiently. Imagine labeling each folder in the filing cabinet for easy retrieval.
Schema Definition: Optionally, you can define a schema for your index, specifying the structure and data types of the fields within your documents. This schema acts like a blueprint for your documents, ensuring consistency and enabling efficient searching.
Shards and Replicas: Internally, Elasticsearch splits each index into smaller partitions called shards. These shards distribute the data load across your cluster of machines, improving search performance and scalability. Additionally, you can configure replicas, which are copies of shards on different nodes, to ensure data redundancy and fault tolerance.

Here's an analogy to solidify the concept:

Imagine a library with different sections (indices) like fiction, non-fiction, and biographies. Each section contains books (documents), with each book having chapters and pages (fields within a document). The library might organize the books by genre or author (similar to defining a schema).

In essence, Elasticsearch indexes help you structure, organize, and efficiently search your data. They provide a scalable and fault-tolerant way to store and retrieve information.

Getting Started With Apache Flink and Elasticsearch

The Apache Flink Elasticsearch connector acts as a bridge between your Flink streaming or batch data processing pipelines and the Elasticsearch search engine. It allows you to:

Sink Data into Elasticsearch: You can use the connector to send data streams or processed datasets from your Flink program directly to Elasticsearch as new documents. This is useful for real-time data analysis or storing processed data for further exploration in Kibana.

Here are some key features of the connector:

Fault Tolerance: Flink's checkpointing mechanism ensures at-least-once delivery of data to Elasticsearch. This means that data is not lost even if failures occur during the transfer process.
Bulk Indexing: The connector efficiently transmits data in bulk to Elasticsearch, optimizing performance for large datasets.
Flexibility: You can configure various aspects of how data is indexed in Elasticsearch, such as the target index name and how document IDs are generated.
Upsert Mode (Optional): If your data has a primary key, the connector can operate in upsert mode. This allows it to handle updates and deletes along with inserts, keeping your Elasticsearch data synchronized with your Flink processing results.

Define An Elasticsearch Sink

def createIndexRequest(element:(Transaction)): IndexRequest = {
    val mapper = new ObjectMapper()
    mapper.registerModule(DefaultScalaModule)
    val json: String = mapper.writeValueAsString(element)
    Requests.indexRequest.index("transactions").id(element.transactionId).source(json,XContentType.JSON)
  }

def writeToElastic(transactionStream: DataStream[Transaction]) = {

    val sink: ElasticsearchSink[Transaction] = new Elasticsearch7SinkBuilder[Transaction]
      .setHosts(new HttpHost("localhost", 9200, "http"))
      .setBulkFlushMaxActions(2)
      .setBulkFlushInterval(10L)
      .setEmitter[Transaction]{
        (transaction, context, indexer) => {
          val mapper = new ObjectMapper()
          mapper.registerModule(DefaultScalaModule)
          val json: String = mapper.writeValueAsString(transaction)

          val indexRequest = Requests.indexRequest()
            .index("transactions")
            .id(transaction.transactionId)
            .source(json, XContentType.JSON);
          indexer.add(indexRequest)
        }
      }.build()

    transactionStream.sinkTo(sink)
  }

Details of the Class Transaction is available Here :: https://github.com/snepar/flink-ecom/blob/master/src/main/scala/ecom/generators/Dto/Transaction.scala

Write to Elasticsearch Sink

val transactionStream = readCustomData()
    transactionStream.print()
writeToElastic(transactionStream)

Refer the Complete Code here :: https://github.com/snepar/flink-ecom/blob/master/src/main/scala/ecom/KafkaPGESIntegrationEcom.scala

领英推荐

The 2025 Comprehensive Guide to Apache Iceberg

Alex Merced 2 个月前

Apache Flink: Real-Time Data Processing at Scale

Diogo Ribeiro 5 个月前

What is Elasticsearch and why is it so fast?

Anuj Yadav 7 个月前

Spin Up Your Playground

Follow the instructions here :: https://github.com/snepar/flink-ecom-infra

Elasticsearch will be available in localhost:5601

And Execute This application to get started with the Data Ingestion :: https://github.com/snepar/flink-ecom

Search Using Elastic

Here’s the structure of the transaction index on elasticsearch. You can get this by running

GET transactions

in the DevTools.

Query In Elastic

GET transactions/_search

Re-Index Data in Elastic

To get a readable transaction date, we need to reindex into a different index. To reindex data on elasticsearch, we use the function ::

_reindex

POST _reindex
{
 "source": {"index": "transactions"}, 
 "dest": {"index": "transaction_part1"},
 "script": {"source":"""
   ctx._source.transactionDate = new 
   Date(ctx._source.transactionDate).toString();
"""}
}

GET reindex/_search

However, using toString() does not give us much room to wiggle around the format. So we need to use a more robust way to format the data.

POST _reindex
{
"source": {"index": "transactions"}, 
"dest": {"index": "transaction_part2"},
"script": {"source": 
 """SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
  formatter.setTimeZone(TimeZone.getTimeZone('UTC'));
  ctx._source.transactionDate = formatter.format (new 
  Date(ctx._source.transactionDate));"""
 }
}

GET transaction_part2/_search

Dashboard-ing in Realtime With Kibana

Index on transaction_part2

Creating Donut Chart

Number Of Transactions

Top Brands

Final Dashboard (Keeping It All Together)

Thank You For Spending Sometime Here

This is the Complete Post :: https://dev.to/snehasish_dutta_007/apache-flink-e-commerce-data-pipeline-usecase-3ha9

Repositories ::

flink-ecom-infra flink-ecom

In Case You Have Missed

Part 1 :: https://www.dhirubhai.net/pulse/cheers-real-time-analytics-apache-flink-part-1-snehasish-dutta-rv8le/

Part 2 :: https://www.dhirubhai.net/pulse/cheers-real-time-analytics-apache-flink-part-2-snehasish-dutta-ig1pe/

Acknowledgements and Inspirations

#data #bigdata #apacheflink #scala #java #kafka #streaming #python #elasticsearch #kibana

要查看或添加评论，请登录

SNEHASISH DUTTA的更多文章

MapReduce Explained: A Tale of Love, Betrayal, and Distributed Computing

2024年12月29日

MapReduce Explained: A Tale of Love, Betrayal, and Distributed Computing

Picture this: You're browsing one day, and you stumble upon a post that catches your eye: Hi, I found my wife cheating…

1 条评论
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 5 of 5

2024年12月29日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 5 of 5

Continuing my exploration of Kafka and Flink (PyFlink and Flink-SQL), I began the algorithmic trading use case series…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

2024年12月28日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

Hey I just met you, The network’s laggy, But here’s my data, So store it maybe. — Kyle Kingsbury, Carly Rae Jepsen and…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

2024年12月27日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

The limits of my language mean the limits of my world. — Ludwig Wittgenstein, Tractatus Logico-Philosophicus (Designing…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 2 of 5

2024年12月26日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 2 of 5

Trading is no longer just about price action – it's about turning market sentiment into actionable data streams…
When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 1 of 5

2024年12月25日

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 1 of 5

Where data flows, alpha grows: The modern trader doesn't just read charts, they engineer insights. When a Data Engineer…
Cheers to Real-time Analytics with Apache Flink : Part 2 of 3

2024年6月24日

Cheers to Real-time Analytics with Apache Flink : Part 2 of 3

You can have data without information, but you cannot have information without data. —Daniel Keys Moran, an American…

1 条评论
Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

2024年6月21日

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete…

2 条评论

See all articles

Cheers to Real-time Analytics with Apache Flink : Part 3 of 3

SNEHASISH DUTTA

Professional Data Engineer

What is Elasticsearch ?

What is Kibana ?

Why Elasticsearch and Kibana for Data Visualisation in real-time?

Index in Elasticsearch

Getting Started With Apache Flink and Elasticsearch

Define An Elasticsearch Sink

Write to Elasticsearch Sink

领英推荐

Spin Up Your Playground

Search Using Elastic

Query In Elastic

Re-Index Data in Elastic

Dashboard-ing in Realtime With Kibana

Index on transaction_part2

Thank You For Spending Sometime Here

Acknowledgements and Inspirations

SNEHASISH DUTTA的更多文章

社区洞察

其他会员也浏览了

ElasticSearch

DoubleCloud’s 14th Product Update

A Friendly Introduction to Open Source Data Science for Business Leaders

Stream changes Real Time from DynamoDB into Hudi with Kinesis Flink and lambdas

May 2023 - Iceberg Community News

Understanding Apache Flink: Framework Structure and Operational Limits

Solving Massive Data Latency with Dynamic Partitioning and Adaptive Query Execution in Apache Spark

February 2023 - Iceberg Community News

Apache AGE: Bridging Relational Databases and Graphs

Building a Real-Time Lakehouse with Apache Hudi, Flink & Spark

What is Elasticsearch ?

What is Kibana ?

Why Elasticsearch and Kibana for Data Visualisation in real-time?

Index in Elasticsearch

Getting Started With Apache Flink and Elasticsearch

Define An Elasticsearch Sink

Write to Elasticsearch Sink

领英推荐

Spin Up Your Playground

Search Using Elastic

Query In Elastic

Re-Index Data in Elastic

Dashboard-ing in Realtime With Kibana

Index on transaction_part2

Thank You For Spending Sometime Here

Acknowledgements and Inspirations

SNEHASISH DUTTA的更多文章

MapReduce Explained: A Tale of Love, Betrayal, and Distributed Computing

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 5 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 4 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 3 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 2 of 5

When Data Engineers Trade: A Modern Stack for Real-time Market Sentiment Analysis and Automated Trading : Part 1 of 5

Cheers to Real-time Analytics with Apache Flink : Part 2 of 3

Cheers to Real-time Analytics with Apache Flink : Part 1 of 3

社区洞察

其他会员也浏览了

ElasticSearch

DoubleCloud’s 14th Product Update

A Friendly Introduction to Open Source Data Science for Business Leaders

Stream changes Real Time from DynamoDB into Hudi with Kinesis Flink and lambdas

May 2023 - Iceberg Community News

Understanding Apache Flink: Framework Structure and Operational Limits

Solving Massive Data Latency with Dynamic Partitioning and Adaptive Query Execution in Apache Spark

February 2023 - Iceberg Community News

Apache AGE: Bridging Relational Databases and Graphs

Building a Real-Time Lakehouse with Apache Hudi, Flink & Spark