Another step towards ChatGPT Semantic Testing: Manifest /w Subtyping
Image from VINTAGE DOUBLE HEADED CLAW HAMMER CUSTOM HANDLE OLD TOOL ABSTRACT TWO HEADS

Another step towards ChatGPT Semantic Testing: Manifest /w Subtyping

Abstract

Language engineering involves systematically exploring a vast space of valid prompts and gathering insights from ChatGPT responses. To handle a large amount of data that lacks clear type distinctions, the concept of "signatures" in Proof Assistant's proof context is used to define global constants. This allows for a more dynamic and flexible approach to defining language concepts and their relationships.

Background

Previously, we mentioned the adoption of ChatGPT in an organization requires a language engineering toolbox to specify its expected input-output and business rules. Formal Semantics can help to accurately represent human meanings in policies and business rules.

One way to apply the language engineering toolbox is demonstrated by extracting a prompt semantic structure and then applying a fairness rule based on ethical and bias-sensitive attributes, as shown in the following inquiry:

Challenges

Proof Assistant is necessary for expressing the formal semantics of human language, as only this tool can accommodate dependent typing and multi-sorted type systems. After the Montague Semantics (1974), another one-step advance in terms of formalizing human language semantics lies in "Formal Semantics with Modern Type Theories," (2019), the modeling of human language meanings is accomplished through the utilization of Proof Assistant. Thus, the use of Proof Assistant is unavoidable in this context.

Language engineering involves systematically exploring a vast space of valid prompts and gathering insights from ChatGPT responses. To accomplish this, it is necessary to have support in the Proof Assistant that enables access to stored language patterns and allows for the easy enumeration of structures.

The question is, how can we achieve this in the context of a pure functional language used in Proof Assistant? In contrast to a procedural language like Python, which can use elementary data structures like lists and dictionaries, it is straightforward to express data with a large number of elements of the same type. For instance, a list in Python is a "first class citizen", which is composed of elements of the same type or different types:


ls = [ "Alice", "Bob", "Charlie", ... ]        


So that we can apply a function such as "def process(x:str):" on a list that consists of "str" type.

It may come as a surprise, but within Proof Assistant, expressing different elements without type distinctions is strongly discouraged. One notable example is the standard definition of "List" in Coq Standard Library, in which list is recursively defined as an one-element-smaller list concatenated with one element. In Coq, you would never be able to define a list like Python way. If you insist on using Python way, the syntax sugar in Coq Standard Library can provide some utility, as exemplified by the following code.

Notation "[ x ; y ; .. ; z ]" := (cons x (cons y .. (cons z nil) ..))??(format "[ '[' x ; '/' y ; '/' .. ; '/' z ']' ]") : list_scope.        

If your application requires handling a significant amount of data that lacks clear type distinctions, a reasoning system like Proof Assistant may not be the ideal tool. Such systems are primarily designed for rule-crunching and may not be well-suited for dealing with this type of data.

When it comes to language engineering, it is necessary to store the entire vocabulary of a human language and be able to enumerate potential synthetic sentences. This means that in our application, the functionality of handling a large amount of data that lacks clear type distinctions is required.

Solution

In "Using signatures in type theory to represent situations", Prof. Zhao et al. presented several useful design patterns for representing "situations" in a language world with pre-defined knowledge, including vocabularies and facts. The concept of "signatures" in Proof Assistant's proof context is used to define global constants. One specific pattern introduced by the authors is called "Manifests" with Sub-typing, which can be thought of as a placeholder or syntax sugar that enables easy access to distinct types, which provide a concise symbolic definition of the "situations" that need to be expressed.

Consider a small language for the purpose of illustration. Suppose that this hypothetical language consists of only two nouns and one adjective. The two nouns in this language are "intellectual property" and "chatgpt," and we have an adjective "interesting" that modifies these nouns. For the sake of clarity, we will use the semantic marker "Concept" to represent these two nouns, but "Concept" is not included in the vocabulary.

We can demonstrate using working code (which can be verified at https://coq.vercel.app/scratchpad.html) that the "concept" can be referred to as "any," or as "concept_0" and "concept_1" in phrasal examples such as "interesting any" or "interesting concept_1."

The manifests, which include "any," "concept_0," and "concept_1," are not defined with full meanings at the time of their declaration (hence they are called "manifests"). Instead, their meanings are determined in the context where they are used, along with other coercive subtyping that binds their meanings. This allows for a more dynamic and flexible approach to defining language concepts and their relationships.

It is worth mentioning that the mapping between the finite set of manifests, Finite_2, and their corresponding concrete constructors in the "Concept" is achieved through automated tactical programming in Coq's Ltac. This is an improvement over the original design proposed in the paper, which utilized the standard "Fin(n)" and required additional introduction rules. With this approach, the code for mapping manifests to specific sets of elements is simplified, making the system more efficient and easier to work with.

We prefer to use the names concept_0 and concept_1 to distinguish constructors in Concepts, rather than using indexed dependent typing. This is because we believe that indexed typing does not provide any additional benefits in terms of making the symbolic representation concise for generated languages. Generating programming code with the "concept_*" hardcoded can be easily accomplished with tools such as Copilot or even GPT3.

Inductive Concept:=
?| Intellectual_Property
?| ChatGPT
.

Inductive Placeholder: Set:=
? ? ?any.

Inductive Finite_2:=
?| concept_0
?| concept_1
?.

Definition placeholder_concept(r: Placeholder): Concept.
? constructor. Defined.
Coercion placeholder_concept: Placeholder>->Concept.

Parameter interesting: Concept->Prop.

Definition finite_concept (r: Finite_2 ): Concept.
destruct r.? ? ??
constructor 1. constructor 2. Defined.
Coercion finite_concept: Finite_2 >-> Concept.


Check interesting any.
Check interesting concept_1.
        

The advantage of using the "manifest" reference rather than a direct reference is that it allows for the "decoupling" of the language engineering toolbox. This means that the domain-specific vocabulary can be separated from the domain-agnostic processes, which can improve the overall flexibility and modularity of the language engineering system.

Summary

By solving the scalability issue, we can now design software components for language engineering that can effectively generate large numbers of valid synthetic human languages with rule-defined behaviors. This will enable us to use language engineering in practical applications such as ChatGPT automated semantic tests.

Alexandru Armasu

Founder & CEO, Group 8 Security Solutions Inc. DBA Machine Learning Intelligence

8 个月

Grateful for your post!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了