登录查看更多内容

Incremental Data Streaming from an Oracle Database to Apache Kafka using Python

Shanoj Kumar V

VP - Senior Technology Architecture Manager @ Citi | LLMs, AI Agents & RAG | Cloud & Big Data | Author

发布日期: 2023年2月11日

This Python code implements a solution for integrating an Oracle database with Apache Kafka. It retrieves incremental data from the Oracle database and sends it indefinitely to a specified Apache Kafka topic. The script connects to the Oracle database and executes SQL statements using the cx Oracle library, interacts with Apache Kafka using the confluent Kafka library, and handles time-related operations and file management using the time and os libraries, respectively. The script saves the most recently updated identifier in a file to keep track of the most recent data sent to Apache Kafka, allowing subsequent runs to only fetch new or updated data from the database. The script runs indefinitely, waiting for a predetermined interval before polling the database again.

The code makes use of the following libraries:

cx_Oracle: This library allows us to connect to an Oracle database and execute SQL statements. It provides a convenient interface for working with Oracle databases.
confluent_kafka: This library provides a high-level API for working with Apache Kafka. It supports consuming and producing messages, making it an ideal choice for this integration project.
time: The time library provides various time-related functions in Python, such as the sleep() function used in this script to wait for a specified interval before polling the database again.
os: The os library provides a variety of methods for interacting with the operating system, such as the path.exists() function used in this script to check if a file exists.

领英推荐

FLaNK-AIM: 13 May 2024

Tim Spann 10 个月前

Testing DBtune, showing PostgreSQL double buffering…

Franck Pachot 9 个月前

Understanding the Future of Apache Iceberg Catalogs

Alex Merced 11 个月前

import cx_Oracl
import confluent_kafka
import time
import os


last_updated_id_file = "last_updated_id.txt"


def get_last_updated_id():
? ? if os.path.exists(last_updated_id_file):
? ? ? ? with open(last_updated_id_file, "r") as f:
? ? ? ? ? ? return int(f.read().strip())
? ? return None


def set_last_updated_id(last_updated_id):
? ? with open(last_updated_id_file, "w") as f:
? ? ? ? f.write(str(last_updated_id))

?
 ? # Connect to the Oracle database
? ? connection = cx_Oracle.connect("user", "password", "host:port/service_name")
? ? cursor = connection.cursor()


? ? # Fetch the last updated unique identifier from the Apache Kafka topic
? ? consumer = confluent_kafka.Consumer({
? ? ? ? "bootstrap.servers": "localhost:9092",
? ? ? ? "group.id": "group_id",
? ? ? ? "auto.offset.reset": "latest"
? ? })


? ? consumer.subscribe(["table_name"])
? ? last_updated_id = get_last_updated_id()


? ? msg = consumer.poll(1.0)
? ? if msg is not None:
? ? ? ? if msg.error() is None:
? ? ? ? ? ? last_updated_id = msg.value()
? ? ? ? else:
? ? ? ? ? ? raise Exception(f"Error fetching last updated identifier: {msg.error()}")
? ??
? ? # Connect to Apache Kafka
? ? producer = confluent_kafka.Producer({
? ? ? ? "bootstrap.servers": "localhost:9092",
? ? ? ? "acks": "all",
? ? ? ? "compression.type": "snappy"
? ? })


? ? iteration_count = 0
? ? max_iterations = 100


? ? while iteration_count < max_iterations:
? ? ? ? iteration_count += 1
? ? ? ??
? ? ? ? # Fetch the incremental data from the Oracle database
? ? ? ? if last_updated_id is not None:
? ? ? ? ? ? cursor.execute(f"SELECT * FROM table_name WHERE unique_id > {last_updated_id}")
? ? ? ? ? ? rows = cursor.fetchall()
? ? ? ? else:
? ? ? ? ? ? cursor.execute("SELECT * FROM table_name")
? ? ? ? ? ? rows = cursor.fetchall()


? ? ? ? # Prepare the data for sending to Apache Kafka
? ? ? ? data = []
? ? ? ? for row in rows:
? ? ? ? ? ? data.append(row)


? ? ? ? # Send the data to Apache Kafka
? ? ? ? for row in data:
? ? ? ? ? ?  ? ? producer.produce("table_name", key="table_name".encode(), value=row)
? ? ? ? ? ? except Exception as e:
? ? ? ? ? ? ? ? print(f"Error producing record to Apache Kafka: {e}")


? ? ? ? # Wait for the data to be sent to Apache Kafka
? ? ? ? producer.flush()


? ? ? ? # Update the last updated identifier
? ? ? ? if data:
? ? ? ? ? ? last_updated_id = max([row[0] for row in data])
? ? ? ? ? ? set_last_updated_id(last_updated_id)

? ? ? ? # Wait for a specified interval before polling the database again
? ? ? ? time.sleep(5)


? ? # Close the connection to Apache Kafka
? ? producer.close()

    # Close the connection to the Oracle databas
    cursor.close()
    connection.close()

Here are the key features of the code:

Statefulness: The code maintains the state of the last updated identifier in a text file and uses it to fetch only incremental data from the Oracle database.
Incremental Data Transfer: The code only fetches new data from the Oracle database that has been updated since the last run. This helps to minimize the amount of data transferred, which improves performance.
Performance Optimization: The code sends the data to Apache Kafka in small batches of 100, which can improve performance compared to sending all the data in one large batch.
Data Consistency: The code sends the data to Apache Kafka with the key "table_name" encoded, which ensures that the data is consistently written to the correct topic. Additionally, the code uses the "acks" configuration option set to "all" to ensure that the data is written to Apache Kafka with full data consistency guarantees.
Error Handling: The code includes error handling to raise an exception if there is an error fetching the last updated identifier from Apache Kafka. It also includes error handling for exceptions raised when connecting to the Oracle database and Apache Kafka to ensure that the connections are closed in case of exceptions.

Galvis Groeneweg

Private Equity Associate - Pre-IPO Opportunities

2 年

Thanks for sharing

2 次回应

要查看或添加评论，请登录

Shanoj Kumar V的更多文章

Enterprise LLM Scaling: Architect's 2025 Blueprint

2025年3月20日

Enterprise LLM Scaling: Architect's 2025 Blueprint

[From Reference Models to Production-Ready Systems] TL;DR Imagine deploying a cutting-edge Large Language Model (LLM)…

1 条评论
How We Built LLM Infrastructure That Works — And What I Learned

2025年3月16日

How We Built LLM Infrastructure That Works — And What I Learned

A Data Engineer’s Complete Roadmap: From Napkin Diagrams to Production-Ready Architecture TL;DR This article provides…

1 条评论
Build a Local LLM-Powered Q&A Assistant with Python, Ollama & Streamlit — No GPU Required! [Hands-on Learning with Python, LLMs, & Streamlit]

2025年3月15日

Build a Local LLM-Powered Q&A Assistant with Python, Ollama & Streamlit — No GPU Required! [Hands-on Learning with Python, LLMs, & Streamlit]

TL;DR Local Large Language Models (LLMs) have made it possible to build powerful AI apps on everyday hardware — no…

3 条评论
Model Evaluation in Machine Learning: A Real-World Telecom Churn Prediction Case Study.

2025年3月6日

Model Evaluation in Machine Learning: A Real-World Telecom Churn Prediction Case Study.

A Practical Guide to Better Models TL;DR Machine learning models are only as good as our ability to evaluate them. This…
Automating Bank Reconciliation with Machine Learning: Enhancing Transaction Matching Using BankSim Dataset

2025年3月5日

Automating Bank Reconciliation with Machine Learning: Enhancing Transaction Matching Using BankSim Dataset

TL;DR Bank reconciliation is a critical process in financial management, ensuring that bank statements align with…
Understanding the Foundations of Neural Networks: Building a Perceptron from Scratch in Python

2025年3月4日

Understanding the Foundations of Neural Networks: Building a Perceptron from Scratch in Python

TL;DR I implemented the historical perceptron and ADALINE algorithms that laid the groundwork for today’s neural…
Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker? [Part 2: Adding a Web UI With Streamlit]

2025年2月27日

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker? [Part 2: Adding a Web UI With Streamlit]

In Part 1, we built a FastAPI-based chatbot that connects to Ollama’s Mistral 7B model and manages order statuses using…
Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker (Part -1)

2025年2月26日

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker (Part -1)

I built a customer support chatbot that can answer user queries and track orders using Mistral 7B, SQLite, and Docker…
Distributed Design Pattern: Eventual Consistency with Vector?Clocks [Social Media Feed Updates Use?Case]

2025年1月28日

Distributed Design Pattern: Eventual Consistency with Vector?Clocks [Social Media Feed Updates Use?Case]

In distributed systems, achieving strong consistency often sacrifices availability or performance. The Eventual…
Distributed Systems Design Pattern: Two-Phase Commit (2PC) for Transaction Consistency [Banking Multi-Account Transfers Use?Case]

2025年1月19日

Distributed Systems Design Pattern: Two-Phase Commit (2PC) for Transaction Consistency [Banking Multi-Account Transfers Use?Case]

The Two-Phase Commit (2PC) protocol is a fundamental distributed systems design pattern that ensures atomicity in…

See all articles

Incremental Data Streaming from an Oracle Database to Apache Kafka using Python

Shanoj Kumar V

VP - Senior Technology Architecture Manager @ Citi | LLMs, AI Agents & RAG | Cloud & Big Data | Author

领英推荐

Shanoj Kumar V的更多文章

社区洞察

其他会员也浏览了

Apache Cassandra vs ScyllaDB

Apache Beam Tutorial

Apache Spark

Spring Data with MongoDB

Run the REST Version of Spring PetClinic with Angular and Distributed SQL on GKE

Integration of Pandas with Postgres Database

Connecting to MySQL Database from Python in a High-Security, High-Volume Environment...

August 2023 - Iceberg Community News

Create A Flask App To Use PostgreSQL Database

Apache Kafka

领英推荐

Shanoj Kumar V的更多文章

Enterprise LLM Scaling: Architect's 2025 Blueprint

How We Built LLM Infrastructure That Works — And What I Learned

Build a Local LLM-Powered Q&A Assistant with Python, Ollama & Streamlit — No GPU Required! [Hands-on Learning with Python, LLMs, & Streamlit]

Model Evaluation in Machine Learning: A Real-World Telecom Churn Prediction Case Study.

Automating Bank Reconciliation with Machine Learning: Enhancing Transaction Matching Using BankSim Dataset

Understanding the Foundations of Neural Networks: Building a Perceptron from Scratch in Python

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker? [Part 2: Adding a Web UI With Streamlit]

Building a Customer Support Chatbot With Ollama, Mistral 7B, SQLite, &?Docker (Part -1)

Distributed Design Pattern: Eventual Consistency with Vector?Clocks [Social Media Feed Updates Use?Case]

Distributed Systems Design Pattern: Two-Phase Commit (2PC) for Transaction Consistency [Banking Multi-Account Transfers Use?Case]

社区洞察

其他会员也浏览了

Apache Cassandra vs ScyllaDB

Apache Beam Tutorial

Apache Spark

Spring Data with MongoDB

Run the REST Version of Spring PetClinic with Angular and Distributed SQL on GKE

Integration of Pandas with Postgres Database

Connecting to MySQL Database from Python in a High-Security, High-Volume Environment...

August 2023 - Iceberg Community News

Create A Flask App To Use PostgreSQL Database

Apache Kafka