How I Built a Document Chatbot From the Ground Up and Learned 4 Valuable Lessons About AI
You've been wrong before

How I Built a Document Chatbot From the Ground Up and Learned 4 Valuable Lessons About AI

Like many other AI-curious developers, the first thing I did with an open-sourced LLM was to create the “hello world” of this era:?I built an “Ask My Document” chatbot.?Let’s call it a Docbot.

For those of you who haven’t seen this application of a Large Language Model (“LLM”), it basically goes like this:

No alt text provided for this image
Documents are converted to text, vectorized, and used as context for the LLM.

To build Docbot, I used the following open-source tools:

Everything runs locally except for the LLM, which I run on a private AWS EC2 server instance with a GPU.

FastChat provides a slick interface to Docbot and works as a stand-alone LLM for experimentation:

No alt text provided for this image
Source: FastChat

To convert documents to text, I used eparse for Excel files and Unstructured for everything else.?To test the application, I ran a set of documents through Docbot and about a third of them had some kind of extraction issue or the encoding failed to convey the key points of the document.

Document extraction, transformation, and loading is not a one-size-fits-all process, even across the same file types.?The same thing goes for the way that data is encoded for the LLM as every data tells a different story.

Once converted to text, document content must then be encoded so Docbot’s LLM brain can understand it.?This process is known as “vectorization” and requires a special kind of storage service known as a “vectorstore”.?Unlike a relational database such as PostgreSQL, which was released almost 30 years ago, vectorstores are still in their infancy.?The one I used is called ChromaDB and has been in active open development for about 6 months.

The last step of the process was to integrate the vectorstore, the LLM, and a web interface that allows the user to pick a document and have a conversation.?For that, I used LangChain and Gradio.?Fortunately, since FastChat provides an API server that wraps the LLM, it was easy to set up this last bit (1).

Rise, Docbot, and tell me about my financial statements.

No alt text provided for this image
Things are about to get real.


As a CPA with a partner that is also a CPA, we prepare and maintain monthly personal financial statements (about an 8-day close, no need to DM me).?To facilitate that process, it would be helpful to be able to talk to the statements from previous months.?I loaded all my income statements from 2022 and started chatting.

Here’s a clip of the conversation I had with Docbot about it:

No alt text provided for this image

This might seem like magic.?It certainly did to me.?Soon, however, this capability will be standard across all document and productivity software such as Microsoft Word and Adobe Acrobat.

Yesterday’s billion-dollar idea is today’s “hello world”.

If I showed this demo to someone a year ago, they would probably have suggested that I turn this project into a business.?A year ago, they would have been right.?Today, however, this demonstration is just an experiment to learn about AI in a practical way.

And that’s the point of generative AI.?It’s going to enable individuals to do more, just like the personal computer did 40 years ago and the internet did 20 years ago.?Levers do leverage things.

But there are some valuable takeaways from this exercise:

#1:?Extracting and ingesting quality data is hard.

This isn’t news to data scientists, but it can be easily taken for granted until you actually try to ingest dozens of files with varying file types and sizes.?It will probably be a long time before users can just upload your massive cache of files and expect an LLM to magically give you all your (correct) answers.?Knowing your files, knowing your data, and knowing semantics about the data is key to getting it encoded in the proper way.

#2:?Vectorstores are easy to understand and difficult to get working correctly.

The biggest challenge I ran into, other than things like file encoding, file size, and error handling, was getting the right document-related vectors into the brain of the LLM such that when I asked for a summary, I didn’t get a summary of everything the LLM knows that was related to my question.?Said another way, I needed a simple filter.

I recently came across this image which sums this problem up well:

No alt text provided for this image
Credit: “The Missing WHERE Clause in Vector Search” James Briggs, Pinecone

It was also harder than I expected to apply a simple “where field = value” filter to the vectorstore.?I ended up submitting a proposed code change to LangChain about it (which they thankfully accepted (2)), so in my future projects, this process should be a lot easier.

#3:?There’s a trade-off between security, privacy, and LLM performance.

The US House Committee on Science and Space Technology recently held open hearings on AI.

One of the topics that was discussed during those hearings related to the tendency for LLMs to “leak” data from their training set.?Imagine a person who learned about the law from litigating cases that can be tricked into mentioning confidential details of their cases when discussing what they know about law.

When “fine tuning” LLMs, there are generally two approaches:

  • Train the LLM on new data, which becomes a part of its “long term” memory; or
  • Stage the LLM’s short-term ephemeral memory with new data.

The problem with the latter approach, which is better for security and privacy, is that the model’s short-term memory (called a “context window”) is limited.?And vice versa for the former approach, which can produce a better model for large datasets or specialized inference, but risks leaking the data on which it was trained and increases the security risk profile when scaling out the underlying infrastructure.

Choosing the right option will depend largely on how the AI is being applied.?Obviously, the context memory option was more than adequate for Docbot.

#4:?There is a big opportunity for AI-forward domain experts.

Generalized apps and toolsets will provide a lot of foundational capabilities, like a generalized Docbot or intelligent searching of a corpus of data.?Perplexity AI is a great example of what can be done by analogy on a private dataset.?Bing just recently integrated ChatGPT into their search engine, with similar capabilities.

Where things will get interesting over the next 3-18 months is in the domain-specific space.?As non-tech organizations learn to use AI at a micro level and see how it can be woven into specific points of their existing workflows, new and specialized capabilities will emerge.?Generalized applications of AI will not be sufficient, and the market will favor curated tools that can be used to provide superior services and create better products.?Depending on the domain, these discoveries can either create value in use or be separately monetizable.

“Goodbye, world” – Docbot

Application lifespans are getting shorter, and Docbot is no exception.?Once the big enterprise software and cloud companies announce their updated AI-powered offerings, everyone will have a Docbot.?Just like Bishop the android, Docbot “could be reworked, but will never be top of the line again.”

When asked what direction we should go in next, my response is now – let’s do what we do, just do it better with AI, and see where it goes.?Each nuance we learn about and can factor into our workflows is another point of differentiation at our disposal.


(1) If you’re looking for a tutorial on how to build you own document chatbot, this one from LangChain is great.?And this is a template for adding a Gradio web interface.

(2) I would sincerely like to thank the teams behind the development of LangChain, Unstructured, FastChat, Chroma, and Gradio for doing what they do, and being so helpful to people like me who use their technology.?For free.

要查看或添加评论,请登录

Chris Pappalardo的更多文章

社区洞察