登录查看更多内容

Real-Time Sentiment Analysis with TCP Socket, Apache Spark, OpenAI, Kafka and Elasticsearch | Data Enginering pipeline project

Rafael Vera-Mara?ón

Data Engineer @ Minsait | Cloud Architechture | AI Integration | Data Governance

发布日期: 2024年1月7日

Introduction:

In this article, I shall share my experience implementing a real-time sentiment analysis project. Following and updating few little aspects of Yusuf Ganiyu article, I ventured into this intriguing project using Yelp's Customer Reviews Dataset ( 7 million records), Apache Spark, OpenAI, Kafka and Elasticsearch.

Project Overview:

The project's aim was to analyse opinions in real time using technologies like Apache Spark, Kafka, and Elasticsearch. The process began with setting up a Docker environment and installing necessary dependencies such as OpenAI, PySpark, and Confluent Kafka.

Project Development:

I suggest you to check the code simultaneously on my repo:

https://github.com/Rafavermar/SparkStreaming

Initial Setup:

First, using Pycharm as IDE, we can create a Python env smoothly.

Second, creating a 'src' directory and configuring Docker with 'docker-compose.yml' and 'Dockerfile.spark'.

Essential Python packages like openai, pyspark, confluent_kafka, and fastavro (see assets/cmd_commands.txt) were installed.

The requirements.txt contains all the depencencies versions installed which is needed by the Dockerfile.spark.

Probably you are missing the config.py within my repo (due to gitignore), no worries, here you have it:

from dotenv import load_dotenv
import os

load_dotenv()  # Carga las variables de entorno desde .env

config = {
    "openai": {
        "api_key": os.getenv("OPENAI_API_KEY")
    },
    "kafka": {
        "sasl.username": os.getenv("KAFKA_USERNAME"),
        "sasl.password": os.getenv("KAFKA_PASSWORD"),
        "bootstrap.servers": os.getenv("KAFKA_SERVERS"),
        'security.protocol': 'SASL_SSL',
        'sasl.mechanisms': 'PLAIN',
        'session.timeout.ms': 50000
    },
    "schema_registry": {
        "url": "https://psrc-5j7x8.us-central1.gcp.confluent.cloud",
        "basic.auth.user.info": os.getenv("SCHEMA_REGISTRY_USER") + ":" + os.getenv("SCHEMA_REGISTRY_PASSWORD")
    }
}

Docker desktop view showing the Spark docker containers running and up.

Challenges and Solutions:

Adapting to OpenAI Changes:

The original project used an earlier version of OpenAI. Due to changes in the OpenAI API, it was necessary to update the sentiment analysis code to use the gpt-3.5-turbo model. This involved modifying the sentiment_analysis method to suit the new API.

Docker Configuration:

The 'docker-compose.yml' was adjusted to expose port 9999, necessary for the socket within Docker.

Credential Management:

To safely handle credentials, a '.env' file was used and integrated into Docker, altering the Dockerfile and the config.py script.

At this point you should remember to install the proper dependencies to load environment variables in Docker Container:

cd src
docker exec -it spark-master /bin/bash
pip install pandas
pip install python-dotenv

Implementation and Execution:

Socket and Streaming Setup:

Start setting up a socket to transmit data in real time. This was done by opening a specific port (9999) in Docker, allowing communication between the container and the host.

Spark Streaming was used to process the data transmitted through the socket. This included configuring a StreamingContext and defining a schema to interpret the JSON data from the Yelp dataset.

cd src
 docker exec -it spark-master /bin/bash
 python jobs/streaming-socket.py

On the command line, the execution of streaming-socket py

See in the image above, the code fragment where the spark stream df is configured to read from the socket and its schema as well.

command line output for streaming-socket py

In the image above you can see how the socket is sending records in streaming in batch of size 2.

Shows on the command line, the execution of spark-streamig py

command line output [continue] from the execution of spark-streamig py

Spark UI. Shows the master Running and only 1 worker alive as was set

Just highlighting, the functionality to resume sending messages from the last sent message instead of starting from the beginning is achieved through these key components (see streaming-socket.py)

  last_sent_index = 0: 
  #Initializes an index to keep track of the number of messages already sent

# Skipping already sent lines

for _ in range(last_sent_index):     
next(file)): 

# Before reading new lines, the code skips over the lines that have been
# previously sent, ensuring that the reading starts right after the last 
# sent line

领英推荐

Exploring Data Operations with PySpark, Pandas…

Alex Merced 5 个月前

Making Sense of Millions of Amazon Reviews Using SQL…

Soundarya (SB) Balasubramani 6 年前

Timescale Newsletter ?? Pushing Postgres Boundaries

Timescale 6 个月前

# Updating last_sent_index after each send: 

last_sent_index += 1

# The index is incremented as messages are sent, enabling the program to 
# remember how many lines have been processed and resume from the correct 
# position in the file during the next iteration.

Sentiment Analysis with OpenAI:

A sentiment analysis function was implemented using OpenAI's gpt-3.5-turbo model. This function takes a comment as input and returns a sentiment classification (positive, negative, neutral).

The integration of OpenAI was carried out via Python API, making calls to the chat.completions service to obtain the model's responses.

Code fragment to set the communication and prompting forth and back to OpenAI-Spark

Data Processing and Storage:

The data processed by Spark Streaming were sent to Apache Kafka for queue management and then stored in Elasticsearch (indexed) for analysis and visualisation.

The data flow was configured to ensure efficient processing and smooth transmission from data capture to final storage.

Apache Kakfa cluster overview:

the topic: customers_review

To this topic, a schema registry need to be perfomed passing all the key-values in avro format.

{
  "doc": "Schema for customer reviews",
  "fields": [
    {
      "name": "review_id",
      "type": "string"
    },
    {
      "name": "user_id",
      "type": "string"
    },
    {
      "name": "business_id",
      "type": "string"
    },
    {
      "name": "stars",
      "type": "float"
    },
    {
      "name": "date",
      "type": "string"
    },
    {
      "name": "text",
      "type": "string"
    },
    {
      "name": "feedback",
      "type": "string"
    }
  ],
  "name": "customers_review_schema",
  "namespace": "com.rvm",
  "type": "record"
}

the sink-Elasticsearch-connector advanced configurations:

sink connector to Elasticsearch main metrics overview

sink connector to Elasticsearch advanced configuration overview

ELASTIC SEARCH Deployment performance overview

Elasticsearch deployment performance. Main metrics

Elastic Search Management dev tools (queries)

performing queries to the index storage. Count aggregation by feeback

As soon as I change the datatype to the field Date from Stringtype to Timestamp, I will be able to query aggregation by date correctly.

Kibana dashboard made on the fly:

Things to improve:

-- Date data type need to be changed from Stringtype to TimeStamp in order to filter and visualize powerfull insights.

-- Tokenize and indexing in other way the field text.keyword, so a correct tag cloud could be added to this dashboard.

-- Of course, spending more time on creating relevant and insightfull visualizations in Kibana

Project Execution:

To initiate the process, I ran several commands in the PyCharm terminal, starting with activating Docker and continuing with the execution of specific scripts for the socket and Spark Streaming.

Technologies and Tools Used:

Languages and Frameworks: Python, PySpark, OpenAI.
Platforms and Tools: Docker, Apache Kafka, Apache Spark, Elasticsearch-Kibana
Datasets: Yelp Customer Reviews Dataset.

Commands Used:

(For the complete list of commands, see assets/cmd_comands.txt).

Conclusions and Learnings:

This project was an excellent opportunity to learn about real-time data streaming and sentiment analysis. The challenges faced, such as adapting to the new version of OpenAI and configuring Docker, added valuable practical experience in solving problems in data engineering.

Acknowledgement:

This project not only replicates but also builds upon Yusuf's foundational work, providing an up-to-date, real-world application of data engineering techniques. Inspired by airscholar's RealtimeStreamingEngineering

Israa Salameh

Software Engineer at An-Najah National University Hospital (NNUH), AI/ML Master Student in NNU

4 个月

awesome, could I know what the tool u use to draw the pipeline sketch at the beginning ?

Jesus Javier Puente Sánchez

Java Development BCIT Mentor

1 年

Awesome. You're great

1 次回应

查看更多评论

要查看或添加评论，请登录

Rafael Vera-Mara?ón的更多文章

Despliegue de código con GIT, Docker, CI/CD-Jenkins, Sonarqube

2024年11月4日

Despliegue de código con GIT, Docker, CI/CD-Jenkins, Sonarqube

Repositorio Github (ver rama main) En esta semana del máster de Ingeniería de Datos en la Escuela de Organización…
Applying Machine Learning to SWRO Desalination data with Apache Spark on Databricks

2024年9月25日

Applying Machine Learning to SWRO Desalination data with Apache Spark on Databricks

Databricks notebooks here SWRO = Sea water reverse osmosis I keep bringing my skills beyond the limits and this current…

5 条评论
From Normalization to Denormalization in Power BI. Star Schema.

2024年9月16日

From Normalization to Denormalization in Power BI. Star Schema.

Just check the Power BI report here One of the most important aspects of my development has been transitioning from…
Proyecto ETL de Comercio Exterior con DLT, DBT, DuckDB y AWS

2024年8月26日

Proyecto ETL de Comercio Exterior con DLT, DBT, DuckDB y AWS

Repositorio Github Capturas de pantalla + comandos de consola + queries en directorio Assets En este artículo, quiero…

6 条评论
Optimización de la Replicación de Datos en Tiempo Real con Airbyte y PostgreSQL

2024年7月24日

Optimización de la Replicación de Datos en Tiempo Real con Airbyte y PostgreSQL

Repositorio GITHUB Breve documentación y capturas de pantalla En este artículo pretendo compartir mi experiencia y…
Building an IoT Monitoring System with Spark Structured Streaming, Kafka and Scala

2024年7月18日

Building an IoT Monitoring System with Spark Structured Streaming, Kafka and Scala

In my recent project, I had the opportunity to develop a comprehensive IoT Smart Farm Monitoring system using Scala…
Benchmark de rendimiento entre Parquet, Delta Lake, ORC, AVRO

2024年7月11日

Benchmark de rendimiento entre Parquet, Delta Lake, ORC, AVRO

Elegir el formato de serialización adecuado puede ser decisivo para optimizar el rendimiento y la eficiencia del…
Exploring Spark and Airflow Integration for Submitting Python and Scala Jobs

2024年6月10日

Exploring Spark and Airflow Integration for Submitting Python and Scala Jobs

Github repository Medium Contents CONTEXT Project Setup Docker Configuration Creating and Configuring a DAG in Airflow…
Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

2024年6月7日

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Github Repository Table of Contents Context Project Overview Dependency Injection Decorators for Logging and Timing…
Deploying FastAPI Microservices on AWS Elastic Kubernetes Service (EKS) using Terraform

2024年5月10日

Deploying FastAPI Microservices on AWS Elastic Kubernetes Service (EKS) using Terraform

ARTICLE VAILABLE FOR FREE IN MEDIUM SHORT VIDEO DEMO OVERVIEW HERE Assets in my Github repository: Complete list of…

See all articles

Real-Time Sentiment Analysis with TCP Socket, Apache Spark, OpenAI, Kafka and Elasticsearch | Data Enginering pipeline project

Rafael Vera-Mara?ón

Data Engineer @ Minsait | Cloud Architechture | AI Integration | Data Governance

Introduction:

Project Overview:

Project Development:

领英推荐

Commands Used:

Conclusions and Learnings:

Acknowledgement:

Rafael Vera-Mara?ón的更多文章

社区洞察

其他会员也浏览了

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

Introducing: MGraph-AI - A Memory-First Graph Database for GenAI and Serverless Apps

The Power of Ten

BigData Analytics with PySpark

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Harnessing the Power of Elasticsearch: boosting your search capabilities

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines

Introduction:

Project Overview:

Project Development:

领英推荐

Commands Used:

Conclusions and Learnings:

Acknowledgement:

Rafael Vera-Mara?ón的更多文章

Despliegue de código con GIT, Docker, CI/CD-Jenkins, Sonarqube

Applying Machine Learning to SWRO Desalination data with Apache Spark on Databricks

From Normalization to Denormalization in Power BI. Star Schema.

Proyecto ETL de Comercio Exterior con DLT, DBT, DuckDB y AWS

Optimización de la Replicación de Datos en Tiempo Real con Airbyte y PostgreSQL

Building an IoT Monitoring System with Spark Structured Streaming, Kafka and Scala

Benchmark de rendimiento entre Parquet, Delta Lake, ORC, AVRO

Exploring Spark and Airflow Integration for Submitting Python and Scala Jobs

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Deploying FastAPI Microservices on AWS Elastic Kubernetes Service (EKS) using Terraform

社区洞察

其他会员也浏览了

Mastering the PySpark Developer Interview: Key Questions, Answers, and LinkedIn's Role

Introducing: MGraph-AI - A Memory-First Graph Database for GenAI and Serverless Apps

The Power of Ten

BigData Analytics with PySpark

Dask vs. Spark: Which Big Data Tool Should Data Scientists Choose?

Harnessing the Power of Elasticsearch: boosting your search capabilities

R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

Best Ways to Use Pandas with PySpark

Accessing Columns in PySpark: A Comprehensive Guide

Python in Data Engineering: Powering Databricks, Snowflake, dbt, and Airflow for Big Data Pipelines