Case Study: Explorations with my “private” ChatGPT client
Very basic UI for the test client

Case Study: Explorations with my “private” ChatGPT client

Many organizations are taking a “wait-and-see” approach to cloud-based generative AI services such as OpenAI’s ChatGPT and other tools.?And some are just banning these tools outright.?Just last week Apple proclaimed they would be completely banning the use of ChatGPT and other similar cloud-based generative AI services.?Ostensibly this was a preemptive step to protect “data confidentiality” — but who really knows?

And yes — there are currently no assurances that sensitive or proprietary query data is not currently or will not in the future be leaked into the logs or the training sets of future model iterations.

RECAP: So on the one hand you have a toolset which is simply too compelling to ignore.?And on the other hand you have companies reflexively imposing bans on these toolsets.

Where is the optimal path forward here?

Q: What if organizations could run ChatGPT “locally” — just the same as any other proprietary in-house tool??Could it work??If so, HOW would it work?

That question is the basis for what follows.

PROBLEM STATEMENT:

Can “ChatGPT” (or a close analog) be configured to run locally???And if so, would it be robust and performant enough to provide the same rich user experience as its cloud-based variant?

USE CASE:

Evaluate how a local instance of ChatGPT performs as a “Q&A” knowledge-bot.

BACKGROUND:?

An issue some large (and some small) enterprises struggle with is the fragmentation of knowledge across the enterprise.?So-called “tribal knowledge” can reside in isolated pockets or in the heads of a few select specialists.

Delivery cadence can be impacted by how effeciently (or inefficiently) individuals can get access to that knowledge in time-critical moments.

As a result the business may have some process which is generally not well understood — but nonetheless critical to delivery.?And so “sherpas” (aka specialists) arise to provide guidance and to facilitate task completion.

What if a “smart Q&A” agent could be written which effectively performs this function?

METHODOLOGY:

I created a “private” LLM, running locally that does not exchange data of any sort in any way with any online resources.

The LLM I chose to employ is called “GPT4All”.?Other tools used were LangChain, LlamaCpp, Chroma and SentenceTransformers.

OBJECTIVE & GOAL:

My goal was to see if I could create a specialized knowledge-base LLM able to provide the following functionality:

  • Users are able to get authoritatitive answers on procedural questions.
  • Users are able to interact with the system in a natural Q&A conversational format.
  • Users are able to make inquiries, follow-up inquires and drill down ever-deeper on topics.
  • Users are able to get clarification on any facet of any topic which is vague or unclear.


TRAINING DATA:

I trained my LLM on the following publicly available ServiceNow documents:

  • Knowledge Base Article - Service Portal.html
  • Understanding CI's in ServiceNow - ServiceHub.html
  • servicenow-rome-it-asset-management-enus.pdf
  • servicenow-rome-it-business-management-enus.pdf
  • servicenow-rome-it-operations-management-enus.pdf
  • servicenow-sandiego-it-service-management-enus.pdf

All told I fed it close to 10,000 pages of documentation.?These documents were vectorized into a set of embeddings which were made query-able via a basic text UI.

RESULTS:

First-pass results were definitely underwhelming.

While the model had an impressive ability to make sense of natural language queries and to respond with answers which bore some relevance to the original query it was only partially able to make sense of the documents it was trained on.

TEST CRITERIA:

  • Can it respond to natural language queries?
  • Is it accurate
  • Is it able to answer obvious questions from the text?
  • Is it able to make inferences despite typos and grammatical mistakes?
  • Is it consistent??If you repeat a question, how much variability is there between responses?
  • Does it perform as well as ChatGPT or Bard?
  • Does it preseve context from one request to the next?
  • Is it able to abstract to higher levels of conceptual “understanding”?
  • Is it able to “drill-down” on topcis?
  • Is it able to generate code?
  • Is it able to create tables?

You can view the test prompts and the corresponding output on this Google Sheet.

NEXT STEPS:

By performing some targeted optimizations I believe performance can be radically improved.?Finding and applying those optimzations is the subject of a future evaluation.?But generally speaking I will likely explore these areas:

  • Create a collection of “canonical” definitions — and give these the highest relevance scores.
  • Pay more attention to signals such as headings and sub-links.
  • Use formatting as a signal.
  • Create a glossary of TLA’s with similarly higher-weighted scores.
  • Experiment with higher weightings for supplied training docs as opposed to “innate” LLM’s.
  • Experiment with tweaked model paramters such as context window size (# of tokens), vectors size of embeddings and simultaneous attention heads.

It’s all so reminiscent of search in a way.?The difference being that the end result is not a URL but instead a collection of relevant, distilled text.

Performance is also a factor, as every query takes a minimum of 27 seconds to run locally on a multi-core, M1 MAX.

CONCLUSION:

I see tremendous promise for tools like GPT4All.?However, it is going to take considerable work to customize and tune these models to boost predictability and accuracy so that they might become de facto “Subject Matter Experts”.?And that will be the subject of future posts.


This use case seems like a great way to use this tool. I have been thinking about ways to solve this problem for quite some time. I've applied various approaches with varying success.. none rise to the level of a solution though. I look forward to your next exploration ??

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了