Making data smoothies
This article is written for the technically inclined General Contractor interested in how AI is leveraged in modern cloud software, and specifically for construction projects.
In a previous article, I talked about how Constructable uses semantic search to find the needle in the haystack of construction data. In this article, I’ll talk in more depth about how we prepare and maintain data for effective search.
Embeddings are like smoothies
As a quick review, semantic search is a method of searching data by meaning rather than keyword. At the heart of semantic search is the concept of the embedding, a numeric representation of a chunk of data. In Constructable, we convert all of our customer data, whether it’s in a PDF or structured data in a database like daily logs, into embeddings that can be searched by meaning and then fed to an LLM to answer questions, summarize content, or take actions.
Making an embedding is kind of like making a data smoothie. When you make a smoothie, you take a bunch of ingredients (pieces of data), you throw them in the blender (large language model), and what comes out is a delicious uniform liquid that has the flavor of all of the original ingredients, but in a completely different form (a tasty embedding). If you taste the smoothie (compare the distance between embeddings), you can tell that, for example, a banana was one of the ingredients even though the banana no longer exists in its true banana form. A semantic search in a database is like tasting a bunch of smoothies and finding the ones that taste the most like a banana (assuming you happened to be searching for a banana).
Mixing in more ingredients
Search is only as good as the data that is put into the blender. At Constructable, we found that the accuracy of AI search can be improved by enriching embeddings with contextual information, similar to putting more related ingredients into one smoothie. For example, a paragraph of text in a PDF is useful for search, but it’s even more useful if we combine it with the title of the document, the name of the person who uploaded the PDF, the title of the section from which the chunk is taken, and a whole host of other information. This way, if you search for information about the compressive strength of the concrete provided by Tom at Concrete Pros, our search will automatically boost the relevance of paragraphs in PDFs that not only talk about concrete strength, but also paragraphs that were found in documents from our contact Tom at the Concrete Pros subcontractor. It’s like putting jalape?os in a banana smoothie. You are more likely to choose this smoothie if you are searching for a spicy tropical smoothie.
领英推荐
This works great for unstructured data, but it also works wonders for structured data too, and it’s even easier to provide context in embeddings for structured data because the relationships between different pieces of data are known to a much more specific degree. Take daily logs for example. In our database, daily logs can have notes, weather notes, attached photos, comments from other people in the system, connections to markup on the drawings, and connections to topics of conversation between people collaborating on the project. All of this information represents context that can be used to enrich the basic notes that were entered for a project on a particular day. If we throw all of this information into the blender along with the notes for any given daily log, suddenly our search can handle a much broader range of queries.
Keeping your smoothies from going bad
While this is a pretty easy and effective way to boost your search accuracy and effectiveness, there is one main challenge. The data in a system is not static–it changes every day. For example, someone might edit a comment associated with a daily log that we previously generated an embedding for. Or maybe they attach new photos to the daily log, or delete weather information. Now our smoothie for this daily log has spoiled. In other words, we will prioritize the wrong things in a search based on stale information. So we need a way to ensure that we re-blend our smoothies anytime there is an update to one of the ingredients. This requires that we track all of the dependencies for each embedding, and regenerate embeddings dynamically.
Fortunately, we were able to come up with a way to automatically track these dependencies (and keep them up to date as we build new features!) Anytime we create an embedding, we tell our system to go out and find all of the related pieces of information that we want to include. We use unique identifiers (UUIDs) for absolutely everything in our system, so we simply note the id of every piece of information that contributed to the context of an embedding in our database. Then, any time something in our system changes, we perform an efficient search of this list for the id. Anywhere it appears, we know that we have a new smoothie to blend, so we can blend and serve these new smoothies up in a timely fashion so our search system stays healthy and accurate.
Get in touch!
At Constructable, we are passionate about freeing customer data from the confines of PDFs and databases so it can be used to accomplish amazing things using AI. If you’d like a demo, please fill out our contact form or email me directly at [email protected]. I’d love to meet you, and if you are local to the California central coast, maybe we could go grab a smoothie sometime :).