?? ?? ?? ?? How to build a vector embedding pipeline in Clojure with locally running LLM
Andrew Panfilov
Programmer | Software Architect | Early-stage startup enthusiast
Intro
Hey, everyone! I have two pieces of news for you: good and not-so-good. The good news is that I'm going to provide you with $55,000. The bad news is that instead of giving it to you in cash or a check, I'll give you an equivalent value in the form of source code developed by one developer over 4.5 months, estimated to be worth $55,000. This valuation is based on estimates provided by the command line tool SCC. See the screenshot below.
Alright, jokes aside. In this article, I continue the series of guides on writing simple microservices in the Clojure programming language. The solution's codebase and ideology are mainly based on the codebase from the previous article. I suggest stopping here, going to read it, and getting back!
From a business logic standpoint, the microservice does two things:
This educational microservice project provides, as in the previous article, a Swagger descriptor for REST API with a nice Swagger UI console, Postgres-based persistence (now it is also a vector database!), a REPL-friendly development setup, and something not used in the previous article—a DSL for describing Data Pipelines using the message-oriented middleware framework Apache Camel: naturally, we use a thin Clojure wrapper for this framework.
Github repository with source code: https://github.com/dzer6/wsc (needless to say you should clone it to your local machine to start exploring the project).
This article was partially inspired by the article "Feeding LLMs efficiently: data ingestion to vector databases with Apache Camel."
Preparation
I use MacOS, but I'm sure everything explained here will work well on Linux, too. Anyway, you should pre-install on your machine the following (latest major versions):
Leiningen
The project.clj file contains pretty much the same dependencies regarding configuration, HTTP server, logging, stateful resources management, Clojure data schema validation, database migration, and database persistence. In comparison with the microservice from the previous article, there are two new dependencies:
To build the project, go to the cloned repository folder and run the command in a terminal:
lein uberjar
The path to a resulting fat-jar (synonym to uberjar) with all needed dependencies:
./target/app.jar
When your app.jar is ready, you can build a Docker image.
Docker
It is pretty much the same as the previous time:
Dockerfile sets up a containerized environment for the application, leveraging Amazon Corretto 22 on Alpine Linux (don't forget to build the uberjar before you start building it). It exposes port 8080 and specifies the command to start our service in JVM.
Docker Compose
On a high level, it looks following:
So, we want to make embeddings via locally running LLM, right? Wait no more! Go to the terminal and run:
docker compose up
You will see something similar to:
And next similar to this:
This line means that the Ollama service is downloading LLM to run it locally in the context of the Docker container:
wsc_ollama_init-1 | {"status":"pulling 8eeb52dfb3bb","digest":"sha256:8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe","total":4661216384,"completed":359708416}
When the downloading process successfully finishes, you will see:
wsc_ollama_init-1 | {"status":"success"}
As we also have our microservice as a part of the docker-compose descriptor, we have it up and running in a Docker container. We can immediately feed it with a subset of Wikipedia data. Let's download the data sample from https://www.kaggle.com/datasets/conjuring92/wiki-stem-corpus/data. Suppose you have a spare day or two to wait until it is processed and stored in the vector database. In that case, you can copy the downloaded file archive.zip to the folder with the path "./target/inbound" (inside the folder with the cloned repo) – the microservice watches this folder, and it is expected that any file there should be zipped CSV file with columns set explained here: https://www.kaggle.com/datasets/conjuring92/wiki-stem-corpus/data.
But if you don't want to wait hours and just want to check that the pipeline works, copy the tiny file wiki_stem_corpus_sample.zip to the folder with the path "./target/inbound". It will be processed immediately.
Here is why exactly this folder – it is mounted to the docker container with the microservice:
By default, Ollama running in a Docker container has no preloaded LLMs. It can make embeddings only after being pre-initialized with a concrete LLM. That is why we use the auxiliary wsc_ollama_init service – it waits for the Ollama container to start and requests the "llama3.1" model download via the HTTP endpoint "https://wsc_ollama:11434/api/pull". Model downloading does not happen immediately – it may take several minutes.
To read more on Ollama, check out the official Github repo: https://github.com/ollama/ollama and their blog: https://ollama.com/blog.
To know more about the llama3.1 LLM, read its official web page: https://ai.meta.com/blog/meta-llama-3-1/
Read-Eval-Print Loop
Before playing with REPL, be sure that you stopped the microservice running in the docker container:
There are several ways to start a Read-Eval-Print Loop in Clojure. One way is to run the command in a terminal:
lein run
This is what you see if the configuration and environment are well set up:
Another way is the same way we used in the previous article. To start a REPL session in an IDE (it is Cursive / IntelliJ Idea again), you need to configure the local REPL:
Click on the "+" button to see:
Click on the "Local" menu item to see:
Change "Unnamed" to "Local" and press the "OK" button to see it in the toolbar:
领英推荐
Now click on the green bug button to see:
Type (init) command and press enter:
This is what you see if the configuration and environment are well set up:
The session logs show that the application loads configurations and establishes a connection with a PostgreSQL database. This involves initializing a HikariCP connection pool and Flyway for database migrations. The logs confirm that the database schema validation and migration checks were successful. Next, Apache Camel context (with two routes, "dead-letter-route" and "data-ingestion-pipeline") and Embedding Model/Store starting. The startup of the Jetty HTTP server follows, and the server becomes operational and ready to accept requests on the specified port.
After any code change to apply it, you should type (reset) and press Enter.
To run tests, you should type (run-tests) and press Enter.
Stateful Resources
The approach is the same – we use the mount Clojure library to manage the application state. The microservice contains the following stateful resources:
REST API
I use Compojure to define a REST API, but it is a question of taste—you can use Reitit or something else. Compojure looks a little bit outdated and has some known complications with complex routing, but it works well in 99% of cases.
The microservice has a trivial REST API with several HTTP endpoints.
Swagger UI console looks like this:
Business Logic
Current business logic is much more interesting than in the previous article. Here, we have not only a REST API controller but also a data ingestion pipeline.
The wiki-stem-corpus.rpc.controller.langchain controller houses the business logic that defines two primary operations: similar terms search and ingested documents metadata obtaining.
Let's start with similar terms search. To picture the process, take a look at a high-level sequence diagram:
Here is the search hander:
Inside the embedding store search function, we encapsulate two operations:
Here is the metadata handler:
And now, the most exciting part – the data ingestion pipeline.
To picture the process, take a look at a high-level sequence diagram:
And the code itself:
The data-ingestion-route function defines a data processing route using Apache Camel. Here's a detailed breakdown:
This data ingestion pipeline is designed to handle large datasets that may not fit into memory efficiently—the Apache Camel framework provides this functionality out of the box.
This is a pretty much trivial data ingestion pipeline, and someone can say, "Why do I need Apache Camel at all? I can easily reimplement the same, let's say, in Python with the same level of expressiveness!"
Apache Camel and Python both offer powerful tools for building data ingestion pipelines, but they serve different needs and excel in different scenarios.
Why Use Apache Camel:
When Camel Outperforms Python:
Persistence Layer
Again, no ORMs, no overcomplication, pure HugSQL-joy:
Configuration
The same approach as the previous article is used: cprop function loads configuration as a plain Clojure map. The microservice follows the fail-fast approach and stops immediately if the loaded configuration does not pass the validation against the data schema.
Logging
We use org.clojure/tools.logging over Logback and Slf4j—plain text logs in the dev mode and JSON formatted logs in the production. Logging configuration in the logback.xml file.
Tests
We have two integration tests in the codebase: for the REST API controller and the Camel route.
The REST API controller test is identical to one from the previous article—it loads the test config, runs a Postgres instance in a Docker container via TestContainers, rolls out database migrations, loads seeding data into the database, starts an embedded web server, and performs several HTTP calls to check REST API endpoints. Nothing special.
The camel route test is a little bit more interesting:
It loads the test config, runs a Postgres instance in a Docker container via TestContainers, rolls out database migrations, starts a new instance of Camel context, copies a small data sample to the ingestion input folder, initializes data ingestion Camel route. After that preparation, the test uses custom asynchronous assertion (stolen from the metosin/testit library) as the data ingestion process is asynchronous by its nature. You can read an explanation of this assertion here. Additionally, we do not want to use a real implementation of the embedding store and mock function "add" using with-redefs macro and a spy stub. As a result, we have an elegant, clean-looking, declarative integration test that is highly expressive.
Conclusion
You can modify the microservice codebase described here according to your requirements and ultimately obtain a quality component that accounts for most aspects necessary for a modern microservice. For simplicity, I did not add an authorization and authentication layer. Perhaps in the following article, I will show the options for handling authorization and authentication elegantly for REST APIs.
Using the code example from this article, you can create much more complex pipelines based on the Message Oriented Middleware framework Apache Camel, leveraging the full power and expressiveness of the functional language Clojure. In this microservice, the codebase uses less than 1% of all the possible features and functions that Apache Camel provides. I intentionally did not add idempotency, persistent messaging with JMS, multithreaded message processing in Camel Route, or complex error handling and retry logic. However, all these features are often used in complex production-ready real applications. Camel essentially provides a Lego-like toolkit for creating complex integration scenarios with third-party services or for building data processing pipelines for validation, transformation, enrichment, and data aggregation. Camel is one of the oldest and most mature Java frameworks. Using this framework saves you an incalculable amount of time that the framework's authors and the community have spent on debugging and fixing errors.
Senior Software Engineer
1 个月Cool&Fun!
Senior Software Engineer
1 个月Good job! I really enjoyed reading this
Business developer at AlterEGO | Website development, CRM integrations, support
1 个月Great job!
Sound Designer, Music Composer at Game Development
1 个月Insightful! Thanks!