What the Heck is LanceDB?

What the Heck is LanceDB?

Introduction

I started seeing LanceDB early in 2023, and my first thought was that it might be an attractive fit in the Apache Iceberg ecosystem or as a general replacement for the Parquet columnar file format. I was thinking too small, though. It is an in-process multi-model serverless vector database written in Rust that is cloud-native and open-source. It’s not to be confused with the Lance columnar format (which I was doing), which is also written in Rust and a more appropriate comparison to Parquet but is fundamental to LanceDB. It is also Apache Arrow compatible.

So, what we have is a SQL-compatible vector database that supports vectors, images, text, and videos with full-text search. It is also said to be very fast, but you’ll need to do your own tests to see how it does on your own stack.?

Why a Vector Database?

Two letters, AI. A vector database is at the core of the data repository used to train Large Language Models (LLM). LanceDB has gone the extra mile to provide a GitHub repository with a few vector recipes that you can find here. So, you should store embeddings from a machine learning model, for example, to search for images using written descriptions.

The challenge here is the “Curse of dimensionality”; you can be fast or accurate but can’t be both; you have to pick one. LanceDB first went for speed, then did a lot of tuning to improve accuracy. This leads to “Embeddings,” which are high-dimensional floating-point vector representations of a query or the data. You can embed anything using an appropriate embedding model or function. The position of the embedding in a vector space has semantic significance depending on the type of modal and training you are using. LanceDB supports “explicit” and “implicit” data vectorization methods.?

At this stage, we’re getting into some deep water concerning how this all works, and it is beyond the scope of what I’m trying to convey in this blog. I’ll share an image from the LanceDB docs that illustrates how similar entries cluster within a vector system.

Vector data clusters

The LanceDB ecosystem provides all the latest and most commonly used tools for this space to make it as convenient as possible to get started.

LanceDB Ecosystem

You can use Python and Javascript to process your data into LanceDB, using popular Python machine learning and columnar data packages. A native Typescript SDK is available to allow for vectorless search as part of serverless functions. What isn’t listed here is the new LangChain integrations, making that LLM work even easier. That’s not all though, so to put it in a list, we also see:

  • Duckdb
  • LangChain JS/TS
  • LlamaIndex
  • Voxel51

There is a great, recent blog, “Serverless Multi-Modal search application” by Ayush Chaurasia, that I highly recommend. He gives a quick walkthrough using Nextjs, LanceDB, and Roboflow’s CLIP inference API. If you'd like to dive a little deeper, it’s a specific deep dive with example code.

Summary

Obviously, LanceDB isn’t a general-purpose database, it has a very specific use case, and from what I can tell, it’s a solid solution for that use case. It is an extremely fast vector database that can be used specifically in AI applications. I can envision some really useful scenarios for building up your own LLM around product blogs and documentation with this to enable users to write specific questions and get more tailored responses than trying to read through tons of blogs and docs. This could be the beginning of a tide shift in how we provide information to the public.?

Finally, a LanceDB Cloud is coming, and at the time of this writing, in October 2023, you can sign up to be notified about it at this link. I’m excited to give that a try and run some tests on ideas that have been bouncing around in my head since ChatGPT started grabbing everyone's attention.

You can read the other “What the heck” articles at these links:

What The Heck Is DuckDB? (I was pretty out front on this one.)

What the Heck Is Malloy? (I was out front on this one, too.)

What the Heck is PRQL? (slower, but also growing)

What the Heck is GlareDB? (growing quickly)

What the Heck is SeaTunnel? (interest is hot)

David Palaitis

Managing Director - Two Sigma Investments

1 年

@channel

回复

要查看或添加评论,请登录

Shawn Gordon的更多文章

  • What The Heck is Apache Polaris?

    What The Heck is Apache Polaris?

    Introduction The Data space is almost as volatile as the AI space this year, with many players consolidating. In the…

    4 条评论
  • What the Heck is GPTScript?

    What the Heck is GPTScript?

    Introduction Late in 2023, I was considering writing an article about Acorn Labs' work simplifying Kubernetes…

  • Spotlight on Ask On Data

    Spotlight on Ask On Data

    Introduction AI has been all the rage since late 2022, and it has many more practical applications than we saw from the…

    3 条评论
  • What the Heck is Puppygraph?

    What the Heck is Puppygraph?

    Introduction What the heck is PuppyGraph? That was the first thing I asked myself when I came across it in the Summer…

    1 条评论
  • What the Heck is Proton?

    What the Heck is Proton?

    Introduction This series of articles has been a lot of fun for me as I have learned about and explored new technology…

    1 条评论
  • What the Heck is Apache Paimon?

    What the Heck is Apache Paimon?

    Introduction You’ve heard of data warehouses, you’ve probably heard of data lakes and the data lakehouse, but have you…

  • What the Heck is SDF?

    What the Heck is SDF?

    Introduction 2023 has been quite a year for innovation, adoption, and competition. We saw HashiCorp generate…

  • What the Heck is Apache SeaTunnel?

    What the Heck is Apache SeaTunnel?

    Introduction I started seeing chatter about Apache SeaTunnel in early 2023 and was low-key keeping an eye on it. The…

    5 条评论
  • Branches & Tags: Comparing Iceberg, Hudi, and Delta Lake Tables

    Branches & Tags: Comparing Iceberg, Hudi, and Delta Lake Tables

    Introduction This blog assumes you know the data lake table formats; otherwise, it might not make much sense. Branching…

    1 条评论
  • What the heck is GlareDB?

    What the heck is GlareDB?

    Introduction It has been a while since my last “What the heck is??” article, and I’ve recently seen some rapid growth…

    1 条评论

社区洞察

其他会员也浏览了