GPT, Intent Detection, and Embeddings in pepite.cc
My Experience Building pepite.cc: Exploring the Power of OpenAI APIs
Over the past two weeks, I have been immersed in the development of pepite.cc—a project I embarked upon to explore and learn more about the capabilities of OpenAI APIs and dive into the enigmatic world of GPT. It all started when I came across a post about Microsoft's plan to integrate LinkedIn data with chat GPT. However, as always, French profiles were not yet available. This turned out to be great news for me! Though I had no clear starting point, I knew that indexing as many profiles as possible was crucial. Little did I know that LinkedIn would turn this seemingly simple task into a nightmarish ordeal.
I began by developing a straightforward web crawler to search the web for public links to LinkedIn profiles. My crawler, powered by Puppeteer, would visit a page, identify all the links within it, and store them in a LevelDb database for later crawling. I treated LinkedIn links pointing to an account with "/in/*" separately and stored them in a separate database. In no time, I was able to gather 500K links to public LinkedIn profiles, and that's when the problems started. After writing a simple Puppeteer script to visit the pages and retrieve their text content, I excitedly ran the script, thinking the job was done. However, the next day, when I checked my database, I was in for a surprise—only 5% of the accounts were actually there. This was my first encounter with the Authwall—a fascinating piece of code implemented by Microsoft to prevent the kind of scraping I was attempting. The LinkedIn authwall would be displayed to a user if they had visited multiple pages, even if those accounts were public and indexed by Google. Determined to find a solution, I tried searching for profiles directly from Google. To my surprise, I could see the accounts. After some digging, I discovered that Google includes a special token that disables the authwall. Problem solved, right? Not quite. When I switched to using Google search, I encountered a different challenge—the famous "I'm not a robot" verification from Google. After spending hours searching for a solution, I finally found a workaround that worked about 80% of the time, a significant improvement from the previous 10%. I had to resort to using a VPN to quickly change my IP address, ensure Puppeteer cleared its cache, and employ a random user agent for each profile. These adjustments slowed down the scraping process, but by launching multiple instances, I managed to obtain the information I needed.
With a sufficient amount of data in hand, it was time for the less glamorous part—cleaning, tokenizing, removing stop words and HTML tags, and finally obtaining the embeddings. These "magical things," as I like to call them (though I'm no expert, so don't quote me on this), are compressed representations of the meaning of the original text in the form of vectors (lists of numbers). The fascinating aspect of embeddings is that the distance between two vectors indicates how closely related the texts they represent are. There are various distance functions available, and I opted for cosine distance, as it was mentioned in the OpenAI docs and conveniently built into Redis.
Once I had all the embeddings for the profiles, it was time to have some fun and write the Chat GPT plugin. Unfortunately, I couldn't get on the waitlist, so I had to find a workaround. Luckily, I enjoy working on UI, so I quickly developed a Svelte app to integrate with the API and created a basic chat interface. With that in place, I needed to figure out a way to detect user intent, which is a whole field of research in itself. Fortunately, I found a way to bypass this challenge by crafting prompts that would guide GPT to perform the intent detection for me. If GPT believed the user was searching for a profile, it would always reply in a specific manner: wrap the summary of the search with some designated keywords. By checking for these keywords in the response, I could extract the relevant text and obtain its embeddings. That was it! The rest of the process involved performing a K-nearest neighbors (KNN) search and sending the enhanced response back to the user. Additionally, I implemented a simpler prompt that injected the user's profile and asked GPT to summarize it.
However, I encountered some limitations along the way. One significant drawback was that sometimes GPT would misinterpret the intention and provide responses that were not aligned with what the user intended. To address this, I found that lowering the temperature parameter helped to some extent, but it also made the responses more robotic and closely tied to the prompt itself. Another approach I took was to increase the number of examples for possible responses, which significantly improved the accuracy and reduced the chances of off-track replies.
Here is the final layout of the underlying structure i ended up using
领英推荐
During this project, I also learned valuable lessons. For instance, running Docker, Nginx, Redis, and Node.js on a limited setup with just 1 CPU and 1GB of RAM proved to be a nightmare. Redis kept being shut down by the OOM (Out of Memory) killer as it attempted to load all the embeddings into memory. As a solution, I had to enable swap space, even though it is strongly discouraged by Redis, and slowed down the research
I also discovered the importance of using optimized algorithms for KNN searches. While brute-forcing through hundreds of thousands of vectors was relatively fast, using Hierarchical Navigable Small World (HNSW) indexing provided a noticeable speed boost, and the search results were just as accurate as the brute force method.
Another lesson was that training a custom model for intent detection on data that changes relatively frequently is not an ideal option. I experimented with training a custom model, but the results were almost identical to using prompts. Generating enough diverse training data with different examples also posed a challenge, and the cost was relatively higher. On the other hand, using the default model provided slower responses. However, I may consider switching to a custom model in the future, as current response times can sometimes take up to 10 seconds. Streaming the tokens as they become available could potentially reduce the perceived latency, so i'll have to experiment more with both options before settling on a specific one.
In conclusion, my journey with pepite.cc has been both challenging and rewarding. I've gained a deeper understanding of the capabilities and limitations of OpenAI's APIs, and I've honed my skills in web crawling, text preprocessing, embeddings, and creating chat interfaces. While there were obstacles to overcome and lessons learned, the process of building this project has been an invaluable learning experience that has fueled my curiosity to explore further possibilities with GPT and continue pushing the boundaries of what can be achieved with AI.
Some extra reading: