ChatGPT Plugins and Factual Knowledge Alignment
Is the output of ChatGPT just a bunch of bull?
Note, this article was originally published at my substack, which I publish more material at, and publish a few days earlier than Linkedin.
Earlier this year I had posted about some of the trends surrounding Open Source Language Models, namely that any sort of, “moat building,” that Google or OpenAI may be attempting to do with Bard and ChatGPT may be rendered, “Moot,” through Open Source Language Models.
Enter ChatGPT Plugins.
Plugins are a way to allow ChatGPT to interact with web, or with particular parts of the web that use APIs, such as a Real Estate website with data on listed houses.
So now it’s time to take a look at some of the plugins they have released so far and try to get an idea of how much more capable and centralized ChatGPT may become, upon mass release of plugins, compared to open source language models and developers building their own integrations.
Seeking the Killer App
ChatGPT’s plugins feature right now to a small extent appears to be OpenAI’s attempt at seeking a series of, “killer apps,” almost as if to say, “a language model is not good enough as a stand-alone product, we need to create an interface with the internet itself.”
The problem with Large Language Models (LLMs) of course, which I had covered extensively in a YouTube video back in December 2022 is that they are highly probabilistic in nature, which means they do not excel at dealing with discrete values that convey quantity, quality, e.g. statements of fact. What I found in these plugins is that ChatGPT is largely accessing APIs, which are essentially gateways to databases, which contain, well…data, which is discrete values that convey quantity, quality, e.g. statements of fact.
Now, I know I’m really just a crank with a newsletter and a Youtube channel asking you, dear reader to believe me on this. In my defense, there was a recent survey of studies which was published in ACM Computing Surveys, which is a journal that has a high impact factor of around 14+, which seems to be pretty high among computer-themed journals. What this basically means is, “the highest prestige groups in this field rank this particular journal of particularly high prestige,” and that journal published about LLM factual inaccuracies.
So first off there are many different types of what LLM researchers term, “hallucinations,” which is the effect of the process underlying the LLMs producing incorrect information. The type of hallucinations we’re dealing with here in the survey are:
Per the paper:
Innate divergence. Some NLG tasks by nature do not always have factual knowledge alignment between the source input text and the target reference […]. For instance, it is acceptable for open-domain dialogue systems to respond in chit-chat style, subjective style, […]– this improves the engagingness and diversity of the dialogue generation. However, researchers have discovered that such dataset characteristic leads to inevitable extrinsic hallucinations.
So basically what this is saying is, as I had discussed in a previous article, LLMs are often optimized for human engagement, which reduces factuality in favor of chit-chatting.
So that being said, let’s jump off and look at some of OpenAI’s plugins and see how they do on various tasks that require actual factual knowledge alignment.
Real Estate: Redfin
Redfin is a real estate website. I suppose all of the information that ChatGPT can give you from the Redfin plugin could be gleaned from browsing the Redfin website. The advantage here seems to be the interface through which we’re receiving the data, which is more of a command line tool than a webpage, which is kind of nice because it reduces the noise.
But can we create any additional knowledge or information from this browsing capability?
The above is essentially a sample for all of Minneapolis, which might not be representative of what a particular home buyer may be looking for at a particular time because typically buyers are looking in a specific area. So, I used a zip code radius map to grab the zip codes of a particular area in Minneapolis and fed those in, as shown below.
Searching specifically by explicit zip codes should ostensibly provide much more exacting results. However, when we look at the source map for a particular zip code showing $350,000 as an average list price, we see quite different results, where $350,000 is in fact the lower bound, not the average of this particular zip code:
Let’s take a look at what ChatGPT is actually doing under the hood. What it’s doing is writing an API call to Redfin, asking for the zip code 55419, with a maximum number of beds and bathrooms.
Then the API response is sent back to ChatGPT, which contains a response with the data requested, including the number of beds, baths as well as some other information. ChatGPT then is ostensibly doing a calculation of, “average” across the prices for all of the return responses, but for some reason, it’s getting the calculation wrong - likely because under the hood, ChatGPT is a Large Language Model creating probabilistic predictions of calculations for a response, rather than an actual calculation. It’s possible that we could engineer a prompt that helps ChatGPT to focus in on just that price.
What about for finding and scheduling open houses? Using the prompt:
Find houses which are 3 bed, 1 bath with open houses schedule
within the following zip codes:
55419,55410,55409,55424,55408,55423,55407,55435
I got the response:
When I went in and checked the links to find the open house times, literally only one of the links provided actually had an open house, with the rest listing, “No upcoming open houses.”
That’s all well and good, but what obviously would be better would be to have the open house times scheduled within a table, rather than just listing them. However, when asked for that, we get:
Simple Coding: CreatiCode
To a certain extent from I can tell, simple coding appears to be an application that LLMs do not seem to hallucinate to a massive extent, as long as the code being written is confined to the algorithms themselves, and to a language framework that was actively reflected in the training process. That is to say, as soon as you get into odd languages and requiring esoteric things like versioning, it doesn’t work out well anymore.
Sure enough, the CreatiCode plugin seems to do an OK job with creating simple scratch language programs.
领英推荐
Hypothetically we could bring this plugin to its logical extent and see if it can generate code to steer a Raspberry-Pi based robot, per an existing example block of code.
After directly copying and pasting a super long string of unorganized code from a tutorial, the ChatGPT CreatiCode scratch plugin gets activated, but it largely seems to error out, because it’s calling upon a library that is specific to Raspberry Pi. My guess would be that while ChatGPT and CreatiCode could likely indeed write and compile scratch code, there may not be support for certain edge cases. Below shows the error that I got.
Visual Diagramming - DiagramIt
This is an application that I’m already familiar with from just having worked with ChatGPT and asked it to create visual diagrams with a Python language library called GraphViz, which ChatGPT has already done a fairly good job with.
Let’s see what happens if we ask DiagramIt to follow a prompt that ChatGPT provides as an example.
Uhh…weird ontology there…but I guess that’s a loose way to describe how a car works. We may be just experiencing some odd hallucinations which aren’t strictly incorrect, but just not necessarily precise enough to be of any use in increasing understanding of how cars work. We could perhaps design our own diagram, which I have seen work fairly well with a Python graphing language library called GraphViz.
So how is DiagramIt working under the hood? When I follow the link provided to edit the diagram, it flows through to a website called kroki.io, which does indeed show that GraphViz is the underlying code being used to create these diagrams, so from that perspective it might just be better to use GraphViz and own the code that it produces, depending upon what you’re trying to do.
My experience with GraphViz in the past has shown that once you go through more than three or four levels of complexity, the graph you’re trying to build starts to fall apart, but it can still be useful to put together a nice quick graph.
FiscalNote
FiscalNote seems to be a policy wonk news database which scrapes the web and stores news stories specific to policies and legislation, and allows users to query them.
What I feel like I’m doing here is basically just querying yet another web searching platform, essentially a wrapper around a thing that’s just searching news stories. This appears to be analogous to perhaps using a Google advanced search, but strictly defining it to one type of website, and hypothetically that website has already, “reviewed” the stories it is showing somehow. What’s going on under the hood is ChatGPT is accessing FiscalNote’s API, which then returns a load of articles. I’m not clear on how this might be valuable, but I don’t work in this industry. At the very least, it may be a faster way to look for regulatory questions on a particular topic to put into a blog post.
AI Ticker Chat
This actually was the most interesting plugin for me personally, perhaps for a niche reason that worked immediately. I put together a request asking AITickerChat to summarize some risk factors for a random company, AT&T.
The task it undertook under the hood was to physically extract from a highly structured pre-existing data source from the SEC which contained forward-looking statements.
Since LLMs can fundamentally do a good job at summarizing text compared to humans, it’s reasonable to see how they would do a good job here. Here’s an example of some of the underlying data that ChatGPT is grabbing from to create its above summary.
{
"id": "b50ee629486f_81",
"text": "CAUTIONARY LANGUAGE CONCERNING FORWARD-LOOKING STATEMENTS Information set forth in this report contains forward-looking statements that are subject to risks and uncertainties, and actual results could differ materially. Many of these factors are discussed in more detail in the “Risk Factors” section. We claim the protection of the safe harbor for forward-looking statements provided by the Private Securities Litigation Reform Act of 1995. The following factors could cause our future results to differ materially from those expressed in the forward-looking statements: The severity, magnitude and duration of the COVID-19 pandemic and containment, mitigation and other measures taken in response, including the potential impacts of these matters on our business and operations. Our inability to predict the extent to which the COVID-19 pandemic and related impacts will continue to impact our business operations, financial performance and results of operations.",
"metadata": {
"source": "SEC",
So the natural extension of this Plugin in my mind would be to create industry summaries or investor pitches from K10 forms which take into account summarizations from a wide variety of similar sources to try to build a summarization of different cross sections of industry trends. This sounds like a good usage to me.
What Didn’t Work Due to Errors or Server Problems
Data Science: Notable
Notable is evidently a code notebook platform which allows users to run data science experiments. For those who are familiar, it’s basically a fancy Jupyter Notebook or Google Colab notebook. I thought that this could be a fairly powerful tool, but unfortunately it doesn’t seem to be connected up to ChatGPT properly, because ChatGPT asks the user to set a default project, but there is no way to set a default project in Notable.
Attempting to Combine Webpilot and Speechki
Attempting a prompt that combines the two tools into something that would create a podcast automatically from my previously written blogpost.
The web browser plugin, Webpilot did appear to pick up the text from my previous blog post fairly well, as can be seen in the image below.
However, anything I attempted to get the Speechki plugin to convert the text of my blog post into a recording did not seem to work due to an API error. I physically attempted to log into Speechki and could not do so, so that might have been a problem with Speechki more than anything else.
PDF’s - ChatWithPDF
I was particularly excited about being able to check this one out because I have worked on PDF document text extraction, but alas there was a system error when I tried it.
Final Thoughts
So after having used plugins, here are my initial thoughts, in no order of importance.