?? ?? ?? ?? How to build a vector embedding pipeline in Clojure with locally running LLM
After reading the article, the reader should feel comfortable in camel and llama herding.

?? ?? ?? ?? How to build a vector embedding pipeline in Clojure with locally running LLM

Intro

Hey, everyone! I have two pieces of news for you: good and not-so-good. The good news is that I'm going to provide you with $55,000. The bad news is that instead of giving it to you in cash or a check, I'll give you an equivalent value in the form of source code developed by one developer over 4.5 months, estimated to be worth $55,000. This valuation is based on estimates provided by the command line tool SCC. See the screenshot below.

Actually, it is a little bit more than $55,000.

Alright, jokes aside. In this article, I continue the series of guides on writing simple microservices in the Clojure programming language. The solution's codebase and ideology are mainly based on the codebase from the previous article. I suggest stopping here, going to read it, and getting back!

From a business logic standpoint, the microservice does two things:

  1. Ingests any number of textual documents, makes embedding via locally running LLM, and persists embeddings in a vector database.
  2. Provides REST API for searching documents similar to an arbitrary given text.

This educational microservice project provides, as in the previous article, a Swagger descriptor for REST API with a nice Swagger UI console, Postgres-based persistence (now it is also a vector database!), a REPL-friendly development setup, and something not used in the previous article—a DSL for describing Data Pipelines using the message-oriented middleware framework Apache Camel: naturally, we use a thin Clojure wrapper for this framework.

Github repository with source code: https://github.com/dzer6/wsc (needless to say you should clone it to your local machine to start exploring the project).

This article was partially inspired by the article "Feeding LLMs efficiently: data ingestion to vector databases with Apache Camel."

Preparation

I use MacOS, but I'm sure everything explained here will work well on Linux, too. Anyway, you should pre-install on your machine the following (latest major versions):

  1. Git
  2. JVM
  3. Leiningen
  4. Docker (I personally like Docker Desktop also)

Leiningen

The project.clj file contains pretty much the same dependencies regarding configuration, HTTP server, logging, stateful resources management, Clojure data schema validation, database migration, and database persistence. In comparison with the microservice from the previous article, there are two new dependencies:

  1. org.apache.camel libraries (with the thin Clojure-wrapper com.dzer6/clj-camel) for the data pipeline management.
  2. dev.langchain4j libraries for LLM operations.

Don't fear the number of Camel libraries – the framework developers try to make it granular.

To build the project, go to the cloned repository folder and run the command in a terminal:

lein uberjar        

The path to a resulting fat-jar (synonym to uberjar) with all needed dependencies:

./target/app.jar        

When your app.jar is ready, you can build a Docker image.

Docker

It is pretty much the same as the previous time:

This time, I don't have an OpenTelemetry agent for educational material simplicity.

Dockerfile sets up a containerized environment for the application, leveraging Amazon Corretto 22 on Alpine Linux (don't forget to build the uberjar before you start building it). It exposes port 8080 and specifies the command to start our service in JVM.

Docker Compose

On a high level, it looks following:

The previous time, it was only Postgres. Now, the most essential service is Ollama (which runs LLM) in addition to Postgres.

So, we want to make embeddings via locally running LLM, right? Wait no more! Go to the terminal and run:

docker compose up        

You will see something similar to:

And next similar to this:

This line means that the Ollama service is downloading LLM to run it locally in the context of the Docker container:

wsc_ollama_init-1  | {"status":"pulling 8eeb52dfb3bb","digest":"sha256:8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe","total":4661216384,"completed":359708416}        

When the downloading process successfully finishes, you will see:

wsc_ollama_init-1  | {"status":"success"}        

As we also have our microservice as a part of the docker-compose descriptor, we have it up and running in a Docker container. We can immediately feed it with a subset of Wikipedia data. Let's download the data sample from https://www.kaggle.com/datasets/conjuring92/wiki-stem-corpus/data. Suppose you have a spare day or two to wait until it is processed and stored in the vector database. In that case, you can copy the downloaded file archive.zip to the folder with the path "./target/inbound" (inside the folder with the cloned repo) – the microservice watches this folder, and it is expected that any file there should be zipped CSV file with columns set explained here: https://www.kaggle.com/datasets/conjuring92/wiki-stem-corpus/data.

But if you don't want to wait hours and just want to check that the pipeline works, copy the tiny file wiki_stem_corpus_sample.zip to the folder with the path "./target/inbound". It will be processed immediately.

Here is why exactly this folder – it is mounted to the docker container with the microservice:

By default, Ollama running in a Docker container has no preloaded LLMs. It can make embeddings only after being pre-initialized with a concrete LLM. That is why we use the auxiliary wsc_ollama_init service – it waits for the Ollama container to start and requests the "llama3.1" model download via the HTTP endpoint "https://wsc_ollama:11434/api/pull". Model downloading does not happen immediately – it may take several minutes.

To read more on Ollama, check out the official Github repo: https://github.com/ollama/ollama and their blog: https://ollama.com/blog.

To know more about the llama3.1 LLM, read its official web page: https://ai.meta.com/blog/meta-llama-3-1/

Read-Eval-Print Loop

Before playing with REPL, be sure that you stopped the microservice running in the docker container:

Docker Desktop is an excellent tool!

There are several ways to start a Read-Eval-Print Loop in Clojure. One way is to run the command in a terminal:

lein run        

This is what you see if the configuration and environment are well set up:

Most logs you see here come from initializing stateful resources right after the REPL starts.

Another way is the same way we used in the previous article. To start a REPL session in an IDE (it is Cursive / IntelliJ Idea again), you need to configure the local REPL:

Click on the "+" button to see:

Click on the "Local" menu item to see:

Change "Unnamed" to "Local" and press the "OK" button to see it in the toolbar:

Now click on the green bug button to see:

Type (init) command and press enter:

This is what you see if the configuration and environment are well set up:

The session logs show that the application loads configurations and establishes a connection with a PostgreSQL database. This involves initializing a HikariCP connection pool and Flyway for database migrations. The logs confirm that the database schema validation and migration checks were successful. Next, Apache Camel context (with two routes, "dead-letter-route" and "data-ingestion-pipeline") and Embedding Model/Store starting. The startup of the Jetty HTTP server follows, and the server becomes operational and ready to accept requests on the specified port.

After any code change to apply it, you should type (reset) and press Enter.

To run tests, you should type (run-tests) and press Enter.

Stateful Resources

The approach is the same – we use the mount Clojure library to manage the application state. The microservice contains the following stateful resources:

  1. Web server on top of Eclipse Jetty 12: wiki-stem-corpus.server
  2. Configuration manager on top of Clojure cprop library: wiki-stem-corpus.config
  3. Postgres client on top of HikariCP pooled data source, low-level Clojure wrapper next-jdbc for JDBC-based access to databases, HugSQL library for a clean separation of SQL and Clojure code: wiki-stem-corpus.db.postgres
  4. Database migrations manager on top of Flyway: wiki-stem-corpus.migrations
  5. HTTP client for a model that can convert a given text into an embedding (vector representation of the text) on top of LangChain4j library: wiki-stem-corpus.langchain.embedding-model
  6. Client to PGVector embeddings store, also known as a vector database, on top of LangChain4j library: wiki-stem-corpus.langchain.embedding-store
  7. Apache Camel context on top of thin Clojure-wrapper library clj-camel (configures routes and policies during message exchanges between endpoints): wiki-stem-corpus.camel.state
  8. Data ingestion pipeline in the form of Apache Camel route: wiki-stem-corpus.camel.routes.data-ingestion-pipeline
  9. Dead Letter Channel (enterprise integration pattern of the same name) implementation in the form of Apache Camel route: wiki-stem-corpus.camel.routes.dead-letter

REST API

I use Compojure to define a REST API, but it is a question of taste—you can use Reitit or something else. Compojure looks a little bit outdated and has some known complications with complex routing, but it works well in 99% of cases.

The microservice has a trivial REST API with several HTTP endpoints.

  1. "/search" endpoint –?for finding the most semantically similar Wikipedia descriptions to a given text (don't feed it with several words only; try to input at least a paragraph for a term explanation to see some relevant results). Current vector embedding store search implementation supports cosine similarity only.
  2. "/metadata" endpoint –?for obtaining the ingested Wiki STEM Corpus metadata. There is intentionally not much information returning: the number of ingested items only.
  3. "/health/live" and "/health/ready" endpoints are named after two classical health checks.

Swagger UI console looks like this:

Business Logic

Current business logic is much more interesting than in the previous article. Here, we have not only a REST API controller but also a data ingestion pipeline.

The wiki-stem-corpus.rpc.controller.langchain controller houses the business logic that defines two primary operations: similar terms search and ingested documents metadata obtaining.

Let's start with similar terms search. To picture the process, take a look at a high-level sequence diagram:

Here is the search hander:

The handler calls the embedding store and destructures obtained Java beans list to return it as an HTTP response.

Inside the embedding store search function, we encapsulate two operations:

  1. Vector embedding creation via a remote call to LLM running as a Docker container locally on the dev machine: lines 29-32 in the following screenshot.
  2. Search for the most similar (closest in the embedding space) embeddings: lines 33-36. It is an SQL query to the Postgres database running locally as a Docker container.

Here is the metadata handler:

The handler invokes an SQL query via a runtime-generated Clojure function created by the HugSQL library at the microservice start.

And now, the most exciting part – the data ingestion pipeline.

To picture the process, take a look at a high-level sequence diagram:

And the code itself:

The data-ingestion-route function defines a data processing route using Apache Camel. Here's a detailed breakdown:

  1. Route Initialization: the route starts by specifying a file endpoint (directory) from which files are read.
  2. Unzipping Files: the route uses the unmarshal zip file feature to unzip files, allowing for further processing of the uncompressed data.
  3. Splitting CSV Data: once the files are unzipped, the route splits the data based on CSV rows. This splitting is crucial for processing each row individually.
  4. Processing Rows: each row is then processed using the persist-embedding processor function to store the needed part of the CSV record in the vector database. Similar to the search REST API handler, it is a two-step operation: embedding creation via a remote call to LLM and persisting it in the embedding store.
  5. Error Handling: the route may include error handling to manage any issues that arise during the file reading, unzipping, or processing stages, ensuring the pipeline is robust.

This data ingestion pipeline is designed to handle large datasets that may not fit into memory efficiently—the Apache Camel framework provides this functionality out of the box.

This is a pretty much trivial data ingestion pipeline, and someone can say, "Why do I need Apache Camel at all? I can easily reimplement the same, let's say, in Python with the same level of expressiveness!"

Apache Camel and Python both offer powerful tools for building data ingestion pipelines, but they serve different needs and excel in different scenarios.

Why Use Apache Camel:

  1. Integration Focus: Camel is designed explicitly for enterprise integration patterns (EIPs), making it ideal for complex routing, transformation, and orchestration of messages across diverse systems.
  2. Declarative Approach: With Camel, you declare the entire route in a structured manner, making the code easier to read, maintain, and extend, especially in large systems.
  3. Built-in Components: Camel provides over 300 out-of-the-box components for different protocols and systems (e.g., file systems, databases, messaging queues, REST services), reducing the need to write custom connectors or adapters in Python.
  4. Error Handling and Reliability: Camel has robust built-in error handling, retries, and transaction management, essential in enterprise-grade applications where reliability is critical.
  5. Scalability: Camel can scale easily with the underlying infrastructure, particularly in environments where high throughput and low latency are required. It also supports clustering and distributed deployments.

When Camel Outperforms Python:

  1. Complex Integration Scenarios: When you need to integrate multiple systems using different protocols, Camel’s extensive library of components and support for EIPs make it a superior choice.
  2. Enterprise-Grade Applications: For applications that require high availability, fault tolerance, and scalability, Camel’s mature ecosystem and features are more robust than a custom Python solution.
  3. Maintainability: In projects with large teams or long lifespans, Camel’s declarative nature and standardized integration patterns result in more maintainable code than hand-written Python scripts.

Persistence Layer

Again, no ORMs, no overcomplication, pure HugSQL-joy:

Only one query that turns into one Clojure function on the application starts. We have it here for demonstrational educational reasons, of course.

Configuration

The same approach as the previous article is used: cprop function loads configuration as a plain Clojure map. The microservice follows the fail-fast approach and stops immediately if the loaded configuration does not pass the validation against the data schema.

Logging

We use org.clojure/tools.logging over Logback and Slf4j—plain text logs in the dev mode and JSON formatted logs in the production. Logging configuration in the logback.xml file.

Tests

We have two integration tests in the codebase: for the REST API controller and the Camel route.

The REST API controller test is identical to one from the previous article—it loads the test config, runs a Postgres instance in a Docker container via TestContainers, rolls out database migrations, loads seeding data into the database, starts an embedded web server, and performs several HTTP calls to check REST API endpoints. Nothing special.

The camel route test is a little bit more interesting:

I like chatgpt-generated test data: text-1, text-2...

It loads the test config, runs a Postgres instance in a Docker container via TestContainers, rolls out database migrations, starts a new instance of Camel context, copies a small data sample to the ingestion input folder, initializes data ingestion Camel route. After that preparation, the test uses custom asynchronous assertion (stolen from the metosin/testit library) as the data ingestion process is asynchronous by its nature. You can read an explanation of this assertion here. Additionally, we do not want to use a real implementation of the embedding store and mock function "add" using with-redefs macro and a spy stub. As a result, we have an elegant, clean-looking, declarative integration test that is highly expressive.

Conclusion

You can modify the microservice codebase described here according to your requirements and ultimately obtain a quality component that accounts for most aspects necessary for a modern microservice. For simplicity, I did not add an authorization and authentication layer. Perhaps in the following article, I will show the options for handling authorization and authentication elegantly for REST APIs.

Using the code example from this article, you can create much more complex pipelines based on the Message Oriented Middleware framework Apache Camel, leveraging the full power and expressiveness of the functional language Clojure. In this microservice, the codebase uses less than 1% of all the possible features and functions that Apache Camel provides. I intentionally did not add idempotency, persistent messaging with JMS, multithreaded message processing in Camel Route, or complex error handling and retry logic. However, all these features are often used in complex production-ready real applications. Camel essentially provides a Lego-like toolkit for creating complex integration scenarios with third-party services or for building data processing pipelines for validation, transformation, enrichment, and data aggregation. Camel is one of the oldest and most mature Java frameworks. Using this framework saves you an incalculable amount of time that the framework's authors and the community have spent on debugging and fixing errors.


Denis Shevchenko

Senior Software Engineer

1 个月

Cool&Fun!

回复
Taras Sukhovenko

Senior Software Engineer

1 个月

Good job! I really enjoyed reading this

回复
Igor Tarnavsky ????

Business developer at AlterEGO | Website development, CRM integrations, support

1 个月

Great job!

回复
Anton Panfilov

Sound Designer, Music Composer at Game Development

1 个月

Insightful! Thanks!

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了