Spina bifida causes,Android video game best online games in philippines.Recharge Every day and Get Bonus up-to 50%!

Intro

Hey, everyone! I have two pieces of news for you: good and not-so-good. The good news is that I'm going to provide you with $55,000. The bad news is that instead of giving it to you in cash or a check, I'll give you an equivalent value in the form of source code developed by one developer over 4.5 months, estimated to be worth $55,000. This valuation is based on estimates provided by the command line tool SCC. See the screenshot below.

Alright, jokes aside. In this article, I continue the series of guides on writing simple microservices in the Clojure programming language. The solution's codebase and ideology are mainly based on the codebase from the previous article. I suggest stopping here, going to read it, and getting back!

From a business logic standpoint, the microservice does two things:

Ingests any number of textual documents, makes embedding via locally running LLM, and persists embeddings in a vector database.
Provides REST API for searching documents similar to an arbitrary given text.

This educational microservice project provides, as in the previous article, a Swagger descriptor for REST API with a nice Swagger UI console, Postgres-based persistence (now it is also a vector database!), a REPL-friendly development setup, and something not used in the previous article—a DSL for describing Data Pipelines using the message-oriented middleware framework Apache Camel: naturally, we use a thin Clojure wrapper for this framework.

Github repository with source code: https://github.com/dzer6/wsc (needless to say you should clone it to your local machine to start exploring the project).

This article was partially inspired by the article "Feeding LLMs efficiently: data ingestion to vector databases with Apache Camel."

Preparation

I use MacOS, but I'm sure everything explained here will work well on Linux, too. Anyway, you should pre-install on your machine the following (latest major versions):

Git
JVM
Leiningen
Docker (I personally like Docker Desktop also)

Leiningen

The project.clj file contains pretty much the same dependencies regarding configuration, HTTP server, logging, stateful resources management, Clojure data schema validation, database migration, and database persistence. In comparison with the microservice from the previous article, there are two new dependencies:

org.apache.camel libraries (with the thin Clojure-wrapper com.dzer6/clj-camel) for the data pipeline management.
dev.langchain4j libraries for LLM operations.

To build the project, go to the cloned repository folder and run the command in a terminal:

lein uberjar

The path to a resulting fat-jar (synonym to uberjar) with all needed dependencies:

./target/app.jar

When your app.jar is ready, you can build a Docker image.

Docker

It is pretty much the same as the previous time:

Dockerfile sets up a containerized environment for the application, leveraging Amazon Corretto 22 on Alpine Linux (don't forget to build the uberjar before you start building it). It exposes port 8080 and specifies the command to start our service in JVM.

Docker Compose

On a high level, it looks following:

So, we want to make embeddings via locally running LLM, right? Wait no more! Go to the terminal and run:

docker compose up

You will see something similar to:

And next similar to this:

This line means that the Ollama service is downloading LLM to run it locally in the context of the Docker container:

wsc_ollama_init-1  | {"status":"pulling 8eeb52dfb3bb","digest":"sha256:8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe","total":4661216384,"completed":359708416}

When the downloading process successfully finishes, you will see:

wsc_ollama_init-1  | {"status":"success"}

As we also have our microservice as a part of the docker-compose descriptor, we have it up and running in a Docker container. We can immediately feed it with a subset of Wikipedia data. Let's download the data sample from https://www.kaggle.com/datasets/conjuring92/wiki-stem-corpus/data. Suppose you have a spare day or two to wait until it is processed and stored in the vector database. In that case, you can copy the downloaded file archive.zip to the folder with the path "./target/inbound" (inside the folder with the cloned repo) – the microservice watches this folder, and it is expected that any file there should be zipped CSV file with columns set explained here: https://www.kaggle.com/datasets/conjuring92/wiki-stem-corpus/data.

But if you don't want to wait hours and just want to check that the pipeline works, copy the tiny file wiki_stem_corpus_sample.zip to the folder with the path "./target/inbound". It will be processed immediately.

Here is why exactly this folder – it is mounted to the docker container with the microservice:

By default, Ollama running in a Docker container has no preloaded LLMs. It can make embeddings only after being pre-initialized with a concrete LLM. That is why we use the auxiliary wsc_ollama_init service – it waits for the Ollama container to start and requests the "llama3.1" model download via the HTTP endpoint "https://wsc_ollama:11434/api/pull". Model downloading does not happen immediately – it may take several minutes.

To read more on Ollama, check out the official Github repo: https://github.com/ollama/ollama and their blog: https://ollama.com/blog.

To know more about the llama3.1 LLM, read its official web page: https://ai.meta.com/blog/meta-llama-3-1/

Read-Eval-Print Loop

Before playing with REPL, be sure that you stopped the microservice running in the docker container:

There are several ways to start a Read-Eval-Print Loop in Clojure. One way is to run the command in a terminal:

lein run

This is what you see if the configuration and environment are well set up:

Another way is the same way we used in the previous article. To start a REPL session in an IDE (it is Cursive / IntelliJ Idea again), you need to configure the local REPL:

Click on the "+" button to see:

Click on the "Local" menu item to see:

Change "Unnamed" to "Local" and press the "OK" button to see it in the toolbar:

Now click on the green bug button to see:

Type (init) command and press enter:

This is what you see if the configuration and environment are well set up:

The session logs show that the application loads configurations and establishes a connection with a PostgreSQL database. This involves initializing a HikariCP connection pool and Flyway for database migrations. The logs confirm that the database schema validation and migration checks were successful. Next, Apache Camel context (with two routes, "dead-letter-route" and "data-ingestion-pipeline") and Embedding Model/Store starting. The startup of the Jetty HTTP server follows, and the server becomes operational and ready to accept requests on the specified port.

After any code change to apply it, you should type (reset) and press Enter.

To run tests, you should type (run-tests) and press Enter.

Stateful Resources

The approach is the same – we use the mount Clojure library to manage the application state. The microservice contains the following stateful resources:

Web server on top of Eclipse Jetty 12: wiki-stem-corpus.server
Configuration manager on top of Clojure cprop library: wiki-stem-corpus.config
Postgres client on top of HikariCP pooled data source, low-level Clojure wrapper next-jdbc for JDBC-based access to databases, HugSQL library for a clean separation of SQL and Clojure code: wiki-stem-corpus.db.postgres
Database migrations manager on top of Flyway: wiki-stem-corpus.migrations
HTTP client for a model that can convert a given text into an embedding (vector representation of the text) on top of LangChain4j library: wiki-stem-corpus.langchain.embedding-model
Client to PGVector embeddings store, also known as a vector database, on top of LangChain4j library: wiki-stem-corpus.langchain.embedding-store
Apache Camel context on top of thin Clojure-wrapper library clj-camel (configures routes and policies during message exchanges between endpoints): wiki-stem-corpus.camel.state
Data ingestion pipeline in the form of Apache Camel route: wiki-stem-corpus.camel.routes.data-ingestion-pipeline
Dead Letter Channel (enterprise integration pattern of the same name) implementation in the form of Apache Camel route: wiki-stem-corpus.camel.routes.dead-letter

REST API

I use Compojure to define a REST API, but it is a question of taste—you can use Reitit or something else. Compojure looks a little bit outdated and has some known complications with complex routing, but it works well in 99% of cases.

The microservice has a trivial REST API with several HTTP endpoints.

"/search" endpoint –?for finding the most semantically similar Wikipedia descriptions to a given text (don't feed it with several words only; try to input at least a paragraph for a term explanation to see some relevant results). Current vector embedding store search implementation supports cosine similarity only.
"/metadata" endpoint –?for obtaining the ingested Wiki STEM Corpus metadata. There is intentionally not much information returning: the number of ingested items only.
"/health/live" and "/health/ready" endpoints are named after two classical health checks.

Swagger UI console looks like this:

Business Logic

Current business logic is much more interesting than in the previous article. Here, we have not only a REST API controller but also a data ingestion pipeline.

The wiki-stem-corpus.rpc.controller.langchain controller houses the business logic that defines two primary operations: similar terms search and ingested documents metadata obtaining.

Let's start with similar terms search. To picture the process, take a look at a high-level sequence diagram:

Here is the search hander:

Inside the embedding store search function, we encapsulate two operations:

Vector embedding creation via a remote call to LLM running as a Docker container locally on the dev machine: lines 29-32 in the following screenshot.
Search for the most similar (closest in the embedding space) embeddings: lines 33-36. It is an SQL query to the Postgres database running locally as a Docker container.

Here is the metadata handler:

And now, the most exciting part – the data ingestion pipeline.

To picture the process, take a look at a high-level sequence diagram:

And the code itself:

The data-ingestion-route function defines a data processing route using Apache Camel. Here's a detailed breakdown:

Route Initialization: the route starts by specifying a file endpoint (directory) from which files are read.
Unzipping Files: the route uses the unmarshal zip file feature to unzip files, allowing for further processing of the uncompressed data.
Splitting CSV Data: once the files are unzipped, the route splits the data based on CSV rows. This splitting is crucial for processing each row individually.
Processing Rows: each row is then processed using the persist-embedding processor function to store the needed part of the CSV record in the vector database. Similar to the search REST API handler, it is a two-step operation: embedding creation via a remote call to LLM and persisting it in the embedding store.
Error Handling: the route may include error handling to manage any issues that arise during the file reading, unzipping, or processing stages, ensuring the pipeline is robust.

This data ingestion pipeline is designed to handle large datasets that may not fit into memory efficiently—the Apache Camel framework provides this functionality out of the box.

This is a pretty much trivial data ingestion pipeline, and someone can say, "Why do I need Apache Camel at all? I can easily reimplement the same, let's say, in Python with the same level of expressiveness!"

Apache Camel and Python both offer powerful tools for building data ingestion pipelines, but they serve different needs and excel in different scenarios.

Why Use Apache Camel:

Integration Focus: Camel is designed explicitly for enterprise integration patterns (EIPs), making it ideal for complex routing, transformation, and orchestration of messages across diverse systems.
Declarative Approach: With Camel, you declare the entire route in a structured manner, making the code easier to read, maintain, and extend, especially in large systems.
Built-in Components: Camel provides over 300 out-of-the-box components for different protocols and systems (e.g., file systems, databases, messaging queues, REST services), reducing the need to write custom connectors or adapters in Python.
Error Handling and Reliability: Camel has robust built-in error handling, retries, and transaction management, essential in enterprise-grade applications where reliability is critical.
Scalability: Camel can scale easily with the underlying infrastructure, particularly in environments where high throughput and low latency are required. It also supports clustering and distributed deployments.

When Camel Outperforms Python:

Complex Integration Scenarios: When you need to integrate multiple systems using different protocols, Camel’s extensive library of components and support for EIPs make it a superior choice.
Enterprise-Grade Applications: For applications that require high availability, fault tolerance, and scalability, Camel’s mature ecosystem and features are more robust than a custom Python solution.
Maintainability: In projects with large teams or long lifespans, Camel’s declarative nature and standardized integration patterns result in more maintainable code than hand-written Python scripts.

Persistence Layer

Again, no ORMs, no overcomplication, pure HugSQL-joy:

Configuration

The same approach as the previous article is used: cprop function loads configuration as a plain Clojure map. The microservice follows the fail-fast approach and stops immediately if the loaded configuration does not pass the validation against the data schema.

Logging

We use org.clojure/tools.logging over Logback and Slf4j—plain text logs in the dev mode and JSON formatted logs in the production. Logging configuration in the logback.xml file.

Tests

We have two integration tests in the codebase: for the REST API controller and the Camel route.

The REST API controller test is identical to one from the previous article—it loads the test config, runs a Postgres instance in a Docker container via TestContainers, rolls out database migrations, loads seeding data into the database, starts an embedded web server, and performs several HTTP calls to check REST API endpoints. Nothing special.

The camel route test is a little bit more interesting:

It loads the test config, runs a Postgres instance in a Docker container via TestContainers, rolls out database migrations, starts a new instance of Camel context, copies a small data sample to the ingestion input folder, initializes data ingestion Camel route. After that preparation, the test uses custom asynchronous assertion (stolen from the metosin/testit library) as the data ingestion process is asynchronous by its nature. You can read an explanation of this assertion here. Additionally, we do not want to use a real implementation of the embedding store and mock function "add" using with-redefs macro and a spy stub. As a result, we have an elegant, clean-looking, declarative integration test that is highly expressive.

Conclusion

You can modify the microservice codebase described here according to your requirements and ultimately obtain a quality component that accounts for most aspects necessary for a modern microservice. For simplicity, I did not add an authorization and authentication layer. Perhaps in the following article, I will show the options for handling authorization and authentication elegantly for REST APIs.

Using the code example from this article, you can create much more complex pipelines based on the Message Oriented Middleware framework Apache Camel, leveraging the full power and expressiveness of the functional language Clojure. In this microservice, the codebase uses less than 1% of all the possible features and functions that Apache Camel provides. I intentionally did not add idempotency, persistent messaging with JMS, multithreaded message processing in Camel Route, or complex error handling and retry logic. However, all these features are often used in complex production-ready real applications. Camel essentially provides a Lego-like toolkit for creating complex integration scenarios with third-party services or for building data processing pipelines for validation, transformation, enrichment, and data aggregation. Camel is one of the oldest and most mature Java frameworks. Using this framework saves you an incalculable amount of time that the framework's authors and the community have spent on debugging and fixing errors.

?? ?? ?? ?? How to build a vector embedding pipeline in Clojure with locally running LLM

Andrew Panfilov

Programmer | Software Architect | Early-stage startup enthusiast

Intro

Preparation

Leiningen

Docker

Docker Compose

Read-Eval-Print Loop

领英推荐

Stateful Resources

REST API

Business Logic

Persistence Layer

Configuration

Logging

Tests

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Understanding the Role of Flask: A Micro Web Application Framework

DeltaXML: 2023 Round-Up

Docker in simple words

How to write YAML files for Kubernetes

IBM i Update | Positioning IBM i Development Languages [i-UG]

Configuring Docker-Compose for Container-Based Dev Environment

Harnessing Prompt Engineering to Build a FHIR Server in Python

Docker vs Docker Compose: Understanding the Differences and Use Cases

Deploying a Machine Learning model to production

AWS and API Deployment

Intro

Preparation

Leiningen

Docker

Docker Compose

Read-Eval-Print Loop

领英推荐

Stateful Resources

REST API

Business Logic

Persistence Layer

Configuration

Logging

Tests

Conclusion

Chatbot Prototype: Architectural Proposal

2024年5月23日

Let's write a simple microservice in Clojure

2024年4月11日

Similarities of a startup and museum of art. The role of an architect.

2024年2月1日

Hiring for unknown future

2024年1月30日

How to fail as a CTO: never check your hiring managers

2024年1月28日

LLM pipeline for marketing research insights

2024年1月17日

社区洞察

其他会员也浏览了

Understanding the Role of Flask: A Micro Web Application Framework

DeltaXML: 2023 Round-Up

Docker in simple words

How to write YAML files for Kubernetes

IBM i Update | Positioning IBM i Development Languages [i-UG]

Configuring Docker-Compose for Container-Based Dev Environment

Harnessing Prompt Engineering to Build a FHIR Server in Python

Docker vs Docker Compose: Understanding the Differences and Use Cases

Deploying a Machine Learning model to production

AWS and API Deployment