Domain-Driven Semantic Applications on ChatGPT

Pingping Xiu

Data Engineer Leader @ Caltrans | Data Engineering / AI

发布日期: 2023年3月14日

Abstract

This post introduces domain modeling using semantics for building trustworthy ChatGPT applications. The author formalizes the content of a publisher website using ChatGPT prompts and Professor Zhaohui Luo's Formal Semantics with Modern Type Theories. Then the author uses a small vocabulary with semantic markers as the "Ubiquitous Language" for domain-specific semantics to ensure trust safety.

Background

After the arrival of the GPT-4, I searched social media to learn more about its new capabilities. From the information I gathered, it appears that the focus is on its multimodal and aggregated performance on exams among other benchmarks.

The report highlights the "5 Limitations" of GPT-4, stating that despite its capabilities, it shares similar limitations with its predecessors. It still lacks complete reliability, often "hallucinating" facts and making reasoning errors. Therefore, extra caution must be exercised when using language model outputs, especially in high-stakes situations.

Given that numerous industry experts have already made reasonable predictions about GPT-4, I prefer not to delve deeper into the topic. However, it is worth noting that Pingping Consultation is currently working to address the limitations on trust and safety. It is a significant gap that cannot be filled overnight.

Let's refocus on the main topic. Typically, domain experts have the most experience in their respective fields. In this post, I will provide an example of domain modeling using semantics and emphasize the significance of semantic representation of in-domain concepts in ensuring the trustworthiness of ChatGPT applications. It is crucial to consider this aspect in practice, along with other factors such as the depth of linguistic theory used in the semantic framework.

Modeling a Simple O'Reilly Website Content

Traditionally, there has been a gap between theory and practice, particularly in the field of linguistic study. According to my conversations with career mentors in computational linguistics, they have expressed the need for significant efforts to be made in order to apply their frameworks to formalize a large corpus.

I believe that ChatGPT has the potential to be useful, particularly in terms of efficiency and scalability, even though it may not necessarily improve the quality of outputs.

For example, O'Reilly is one of my favorite publishers so I pick that website as a example in my study (O'Reilly friends, please allow me). In their website, https://www.oreilly.com/, I picked several articles, and formalize them using the below ChatGPT prompt format:

Summarize the following sentence into one phrase, containing only words in the following list: trustworthy, application, system, concept, building, [...Put whatever preferred vocabulary here...], one is an adjective and the other a noun "[...Put whatever text you want to formalize here..]"

And boom, I got all the formalized topical noun phrases, with a format of Adjective + Noun, which summarizes the content with decent satisfaction.

And we use Professor Zhaohui Luo's Formal Semantics with Modern Type Theories for computerize those noun phrases into program terms. We get the below snippet (use Coq at https://coq.vercel.app/ to verify and learn.):

Definition CN:=Set.
Parameter technology trend search programming progress software ethics: CN.
Definition ADJ:= CN->Prop.
Parameter transformative tech trustworthy ai_generated ai automating: ADJ.

Check transformative technology.
(**
Fail:?
? ?Check great technology.
? ?Check technology transformative.? ? ?
?*)

For formalizing the corpus, we can follow a similar approach for incorporating more vocabulary in user questions. However, since users come to O'Reilly with specific interests, their vocabulary won't be extensive. Additionally, (Warning: empirical judgements) ChatGPT has already been directed to prioritize certain terms in the prompts, which will limit the variety of the output to those specific terms. For instance, if we teach ChatGPT to use only "bias", it will still generate "unbiased" if required, but avoid random synonyms like "unbalanced".

I borrowed "Domain-Driven-Design" illustration and they use "ubiquitous language" for their purpose. I will steal it and use here: the above small vocabulary with semantic markers will become our "Ubiquitous Language" for domain specific semantics.

Pete Chapman 7 个月前

10 Best Undetectable AI Alternatives - Top…

Parul Gautam 2 个月前

A Framework For Large Language Model Use Cases: The…

Argyro (Iro) Tasitsiomi, PhD 7 个月前

Use the Domain Specific Semantics for Trust Safety

We must consider the need for trust safety due to ChatGPT's tendency to hallucinate, which occurs when it generates apparent answers based on seemingly readable user questions.

Using the Domain Specific Semantics as a modeling technique, we can construct all semantic valid noun phrases as "well-formed" topics. For example, you can verify that "trustworthy software" and "tech ethics" are all semantics valid in this domain.

Check trustworthy software.
Check tech ethics.?

However, O'Reilly does not have those content.

There are practical concerns when building applications such as user frequently asked questions recommender, dealing with "well-formed-but-lack-content" topics.

To prevent mis-categorization of user queries related to "well-formed-but-lack-content" topics, each of these topics should have a deep learning vector associated with a label stating "O'Reilly does not have this content" and that would make the index super large.
Alternatively, you can require user queries to match one of the available content options.

Either way is not perfect.

The two options mentioned are applicable when implementing a closed domain knowledge recommendation with fixed indexed content. However, if you wish to expand ChatGPT to switch between general user conversations and O'Reilly topics, this raises safety concerns.

If you do not redefine your vocabulary (or full semantic sentences) to include safety messages, ChatGPT may hallucinate based on user prompts, and generate output that is not in your definition list. This can happen when you let ChatGPT generate high probability outputs that are relevant to your business application, but not explicitly defined in your vocabulary. As a result, ChatGPT is left to its own devices.

Takeaway

We have discussed the remaining details in our previous posts. Our chosen Formal Semantics framework, Prof. Zhao's MTT Semantics, has a type-theoretic property that enables it to eliminate invalid sequences (both out-of-domain and in-domain) through type checking. As a result, there is no need to manually fill in every possible gap with a vector to ensure complete safety, such as protection against hallucinations. This approach is efficient and scalable, and effectively addresses the governance challenge of ChatGPT.

Domain-Driven Semantic Applications on ChatGPT

Pingping Xiu

Data Engineer Leader @ Caltrans | Data Engineering / AI

Abstract

Background

Modeling a Simple O'Reilly Website Content

领英推荐

Use the Domain Specific Semantics for Trust Safety

Takeaway

Related Read

更多精彩文章

社区洞察

其他会员也浏览了

How to apply Formal Semantics to Conversational AI

Sora - OpenAI's Next Big Thing!

"Calling all entrepreneurs, language enthusiasts, and anyone else ready to take their AI game to the next level!

The Definitive Guide to Open Source Large Language Models (LLMs)

Revolutionizing Text Detection: The Power of Ghostbuster in the Age of AI-Generated Content

Paper Review: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Free Text > SQL Using Thematic Roles

Sora - OpenAI's Next Big Thing!

Introducing Meta's Llama 3.1

Abstract

Background

Modeling a Simple O'Reilly Website Content

领英推荐

Use the Domain Specific Semantics for Trust Safety

Takeaway

Related Read

Domain Driven Reasoning on ChatGPT Trust

2023年3月20日

Pingping's Productive Week: Improving ChatGPT Trust and Semantic Governance

2023年3月12日

Design Patterns for ChatGPT Governance System

2023年3月10日

Incremental Formalization Strategy for ChatGPT Governance

2023年3月9日

How Organizations Establish ChatGPT Safe Zone

2023年3月7日

Formal Semantics on Prompt Engineering

2023年3月6日

"Tree-of-Thoughts" as an alternative to "Chain-of-Thoughts"

2023年3月5日

Designing a Random Number Generator for Coq: A Practical Solution for A Language Engineering Toolbox

2023年3月3日

Another step towards ChatGPT Semantic Testing: Manifest /w Subtyping

2023年3月3日

How to prove your work is not ChatGTP's? Using your content DNA -- Formal Semantics

2023年3月2日

社区洞察

其他会员也浏览了

How to apply Formal Semantics to Conversational AI

Sora - OpenAI's Next Big Thing!

"Calling all entrepreneurs, language enthusiasts, and anyone else ready to take their AI game to the next level!

The Definitive Guide to Open Source Large Language Models (LLMs)

Revolutionizing Text Detection: The Power of Ghostbuster in the Age of AI-Generated Content

Paper Review: Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Free Text > SQL Using Thematic Roles

Sora - OpenAI's Next Big Thing!

Introducing Meta's Llama 3.1