登录查看更多内容

Using Gherkin as an SCoT Prompt with o1-preview ( Strawberry)

Jacob A.

Software Quality Assurance Leadership

发布日期: 2024年9月13日

I've been using User Stories and Gherkin with AI for a while now and they've come along nicely in terms of consistent outputs that are what I'm looking for. I've had good luck with getting models to work with these syntax in different ways, primarily using GPT models.

With the release of o1-preview, it was nice to see the level of detail this prompt captured and with minimal prompting it understood what I was looking for - describe the code in gherkin and vice versa.

I wanted to see if the o1-preview could repurpose Gherkin as an SCoT prompt for use as a boilerplate to generate code. A script in and of itself could do this, however, I like the natural language that Gherkin uses and I'm already familiar with it for use with automated testing and human legible requirements.

I don't quite think this is as optimal as an SCoT prompt because that can address complex functions in a more granular and specific way . If I ask the model to repurpose a syntax, I need to give it a context that fits. In this case, I needed the model to not think about using Behavior Driven Development (BDD) because the feature files would be constructed much different in the output. I found that it understood the application of the SCoT prompt in terms of what I was trying to accomplish and the Gherkin it created worked well.

Overall Experience

With minimal prompting, I was able to observe the o1 model ascribe SCoT to Gherkin as a syntax with an example of a very descriptive and uniform Gherkin file. I was happy with what appeared to be an easy understanding of the task without haggling.

I was able to use the Gherkin feature as a boilerplate to generate code of the same script in different languages with the resulting script only requiring a few changes to work.

With the release I was hoping for more time to work with the model, but overall I was able to get done in a couple of hours which might take a couple days of intermittent usage with haggling and copying out the parts that I can use.

It will be interesting to check the consistency and quality of generating code over multiple iterations from a single source of truth with o1.

Prior to this release, I haven't been able to get this level of quality with this particular approach.

Prompt

I sent this prompt I put together to o1-preview.

You will no longer understand Gherkin syntax for use in BDD. Rather consider the full syntax for it's logic and then model it to be used as a Structured Chain of Thought prompt. Your primary responsibilities will be to read code and be able to thoroughly encapsulate it into Gherkin so that the Gherkin will be interpreted by an LLM as an SCoT to generate the exact same code.

Script

I used the script from this project: https://github.com/jadm11/LLM-iterative-completion-deepeval-test

In this project there is a Gherkin file, and that was the output from GPT 4o. Note the difference in the level of detail being captured between the feature file there, and the one here.

Code to Gherkin

I started a new session, sent the prompt, and then sent the raw script after it acknowledged. Here is the Gherkin it generated.

Feature: Custom LLM Evaluation Script

  Background:
    Given the standard libraries are imported:
      | Library   |
      | os        |
      | warnings  |
      | logging   |
    And the third-party libraries are imported:
      | Library                                      |
      | openai.OpenAI                               |
      | deepeval.test_case.LLMTestCase               |
      | sentence_transformers.SentenceTransformer    |
      | sentence_transformers.util                   |
      | colorama.init                                |
      | colorama.Fore                                |
      | colorama.Style                               |
    And colorama is initialized with auto-reset enabled
    And warnings of category RuntimeWarning are suppressed
    And logging is configured with the following parameters:
      | Parameter | Value                                           |
      | filename  | "llm_evaluation.log"                            |
      | level     | logging.DEBUG                                   |
      | format    | "%(asctime)s - %(levelname)s - %(message)s"     |
    And the OpenAI API key is obtained from the environment variable "OPENAI_API_KEY"
    And the OpenAI client is initialized with the API key
    And the SentenceTransformer model "paraphrase-MiniLM-L6-v2" is loaded into embedding_model

  Scenario: Defining the fetch_response function
    Given a function fetch_response that accepts a prompt and context
    When the function calls client.chat.completions.create with the following parameters:
      | Parameter   | Value                                                                                                                      |
      | model       | "gpt-4o"                                                                                                                   |
      | messages    | [{"role": "system", "content": f"Context: {context}"}, {"role": "user", "content": prompt}]                                |
      | max_tokens  | 100                                                                                                                        |
    Then the function returns the stripped content of the first message in response.choices

  Scenario: Defining the semantic_similarity function
    Given a function semantic_similarity that accepts actual_output, expected_outputs, and an optional threshold (default is 0.4)
    When the function encodes actual_output and expected_outputs using embedding_model.encode with convert_to_tensor=True
    And calculates cosine similarity scores using util.pytorch_cos_sim(actual_embedding, expected_embeddings)
    Then the function returns a tuple containing a boolean indicating if all scores meet or exceed the threshold and the cosine_scores

  Scenario: Creating the CustomLLMEvaluation class
    Given a class CustomLLMEvaluation with an __init__ method that accepts prompt, context, threshold (default 0.4), and use_dynamic_responses (default False)
    When the class is initialized
    Then it sets self.prompt, self.context, self.threshold, and self.use_dynamic_responses
    And it calls self._fetch_actual_output to obtain self.actual_output
    And it calls self._generate_expected_responses to obtain self.expected_responses
    And it calls self._initialize_test_case to create self.llm_test_case

  Scenario: Implementing _fetch_actual_output method
    Given the method self._fetch_actual_output
    When called
    Then it returns the result of fetch_response with self.prompt and self.context

  Scenario: Implementing _generate_expected_responses method
    Given the method self._generate_expected_responses
    When self.use_dynamic_responses is True
    Then it returns a list containing two responses obtained by calling fetch_response with self.prompt and self.context twice
    Else
    Then it returns a list containing the predefined responses:
      | Expected Response                                                                                                                                                     |
      | "To get to the other side."                                                                                                                                           |
      | "Because its dopaminergic neurons fired synchronously across the synapses of its caudate nucleus, triggering motor contractions propelling the organism forward, to a goal predetermined by its hippocampal road mappings." |

  Scenario: Implementing _initialize_test_case method
    Given the method self._initialize_test_case
    When called
    Then it creates an instance of LLMTestCase with input=self.prompt, actual_output=self.actual_output, and expected_output=self.expected_responses[0]

  Scenario: Running the evaluation
    Given the method run_evaluation in CustomLLMEvaluation
    When called
    Then it checks if the LLM test passed by comparing stripped actual_output and expected_output
    And it calls semantic_similarity with actual_output, expected_responses, and threshold to obtain semantic_passed and cosine_scores
    And it determines overall_passed as the logical AND of llm_test_passed and semantic_passed
    And it calls self._generate_report with overall_passed and cosine_scores
    And it returns overall_passed

  Scenario: Generating the evaluation report
    Given the method _generate_report that accepts passed and cosine_scores
    When called
    Then it defines formatting styles using colorama constants
    And it constructs a report string containing:
      | Element                     | Content                                   |
      | Test Report Title           | Styled with BOLD and BLUE                 |
      | Separator Line              | Styled with DIM                           |
      | Context                     | Displayed with CYAN label and BOLD value  |
      | Dynamic Responses Enabled   | Displayed with CYAN label and BOLD value  |
      | Similarity Threshold        | Displayed with CYAN label and BOLD value  |
      | Cosine Scores               | Displayed with CYAN label and BOLD value  |
      | Input Prompt                | Displayed with CYAN label and BOLD value  |
      | Expected Responses          | Listed with indices                       |
      | Actual Output               | Displayed with CYAN label and BOLD value  |
      | Result                      | Displayed with BOLD and colored based on pass/fail |
    And it prints the report

  Scenario: Executing the evaluation script
    Given the variables are set:
      | Variable               | Value                                        |
      | context                | "Humor"                                      |
      | use_dynamic_responses  | True                                         |
      | threshold              | 0.5                                          |
      | prompt                 | "Why did the chicken cross the road?"        |
    When an instance of CustomLLMEvaluation is created with these variables
    And run_evaluation is called on the instance
    Then the evaluation is performed
    And the report is generated and displayed

Gherkin to Code

I started a new session, sent the prompt, and then sent the Gherkin after it acknowledged. Here is the script it generated.

领英推荐

??? GraphRAG Evolves into StructRAG

Pascal Biese 5 个月前

???? The Next Impact Factor

Pascal Biese 1 年前

?? LLMs Struggle With Causality

Pascal Biese 11 个月前

```python
# Standard libraries
import os
import warnings
import logging

# Third-party libraries
import openai
from deepeval.test_case import LLMTestCase
from sentence_transformers import SentenceTransformer, util
from colorama import init as colorama_init, Fore, Style

# Initialize colorama with auto-reset enabled
colorama_init(autoreset=True)

# Suppress warnings of category RuntimeWarning
warnings.filterwarnings("ignore", category=RuntimeWarning)

# Configure logging
logging.basicConfig(
    filename="llm_evaluation.log",
    level=logging.DEBUG,
    format="%(asctime)s - %(levelname)s - %(message)s"
)

# Obtain the OpenAI API key from the environment variable "OPENAI_API_KEY"
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize the OpenAI client with the API key
openai.api_key = openai_api_key

# Load the SentenceTransformer model "paraphrase-MiniLM-L6-v2" into embedding_model
embedding_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

def fetch_response(prompt, context):
    response = openai.ChatCompletion.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": prompt}
        ],
        max_tokens=100
    )
    return response.choices[0].message.content.strip()

def semantic_similarity(actual_output, expected_outputs, threshold=0.4):
    actual_embedding = embedding_model.encode(actual_output, convert_to_tensor=True)
    expected_embeddings = embedding_model.encode(expected_outputs, convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(actual_embedding, expected_embeddings)
    scores = cosine_scores.squeeze().tolist()
    passed = all(score >= threshold for score in scores)
    return passed, scores

class CustomLLMEvaluation:
    def __init__(self, prompt, context, threshold=0.4, use_dynamic_responses=False):
        self.prompt = prompt
        self.context = context
        self.threshold = threshold
        self.use_dynamic_responses = use_dynamic_responses
        self.actual_output = self._fetch_actual_output()
        self.expected_responses = self._generate_expected_responses()
        self.llm_test_case = self._initialize_test_case()

    def _fetch_actual_output(self):
        return fetch_response(self.prompt, self.context)

    def _generate_expected_responses(self):
        if self.use_dynamic_responses:
            response1 = fetch_response(self.prompt, self.context)
            response2 = fetch_response(self.prompt, self.context)
            return [response1, response2]
        else:
            return [
                "To get to the other side.",
                "Because its dopaminergic neurons fired synchronously across the synapses of its caudate nucleus, triggering motor contractions propelling the organism forward, to a goal predetermined by its hippocampal road mappings."
            ]

    def _initialize_test_case(self):
        return LLMTestCase(
            input=self.prompt,
            actual_output=self.actual_output,
            expected_output=self.expected_responses[0]
        )

    def run_evaluation(self):
        llm_test_passed = self.actual_output.strip() == self.expected_responses[0].strip()
        semantic_passed, cosine_scores = semantic_similarity(
            self.actual_output, self.expected_responses, self.threshold
        )
        overall_passed = llm_test_passed and semantic_passed
        self._generate_report(overall_passed, cosine_scores)
        return overall_passed

    def _generate_report(self, passed, cosine_scores):
        BOLD = Style.BRIGHT
        DIM = Style.DIM
        BLUE = Fore.BLUE
        CYAN = Fore.CYAN
        GREEN = Fore.GREEN
        RED = Fore.RED
        report = []
        report.append(f"{BOLD}{BLUE}=== LLM Evaluation Test Report ==={Style.RESET_ALL}")
        report.append(f"{DIM}{'-'*40}{Style.RESET_ALL}")
        report.append(f"{CYAN}Context:{Style.RESET_ALL} {BOLD}{self.context}{Style.RESET_ALL}")
        report.append(f"{CYAN}Dynamic Responses Enabled:{Style.RESET_ALL} {BOLD}{self.use_dynamic_responses}{Style.RESET_ALL}")
        report.append(f"{CYAN}Similarity Threshold:{Style.RESET_ALL} {BOLD}{self.threshold}{Style.RESET_ALL}")
        report.append(f"{CYAN}Cosine Scores:{Style.RESET_ALL} {BOLD}{cosine_scores}{Style.RESET_ALL}")
        report.append(f"{CYAN}Input Prompt:{Style.RESET_ALL} {BOLD}{self.prompt}{Style.RESET_ALL}")
        report.append(f"{CYAN}Expected Responses:{Style.RESET_ALL}")
        for idx, response in enumerate(self.expected_responses):
            report.append(f"  [{idx+1}] {response}")
        report.append(f"{CYAN}Actual Output:{Style.RESET_ALL} {BOLD}{self.actual_output}{Style.RESET_ALL}")
        result_color = GREEN if passed else RED
        report.append(f"{BOLD}{result_color}Test Result: {'Passed' if passed else 'Failed'}{Style.RESET_ALL}")
        print('\n'.join(report))

if __name__ == "__main__":
    # Set the variables
    context = "Humor"
    use_dynamic_responses = True
    threshold = 0.5
    prompt = "Why did the chicken cross the road?"

    # Create an instance of CustomLLMEvaluation
    evaluation = CustomLLMEvaluation(
        prompt=prompt,
        context=context,
        threshold=threshold,
        use_dynamic_responses=use_dynamic_responses
    )

    # Run the evaluation and generate the report
    evaluation.run_evaluation()
```

This script defines a custom LLM evaluation process based on your specifications. It imports the necessary libraries, sets up the environment, and defines functions and classes to perform the evaluation and generate a detailed report.

**Key Components:**

- **Imports and Setup:** Imports standard and third-party libraries, initializes colorama, suppresses warnings, configures logging, sets up the OpenAI client, and loads the embedding model.
- **`fetch_response` Function:** Fetches a response from the OpenAI API based on a given prompt and context.
- **`semantic_similarity` Function:** Calculates the semantic similarity between the actual output and expected outputs using cosine similarity.
- **`CustomLLMEvaluation` Class:** Encapsulates the evaluation logic, including fetching actual outputs, generating expected responses, initializing test cases, running the evaluation, and generating reports.
- **Execution Block:** Sets up the variables, creates an instance of `CustomLLMEvaluation`, and runs the evaluation.

**Usage:**

1. Ensure all the required packages are installed:

   ```bash
   pip install openai sentence-transformers colorama
   ```

2. Set your OpenAI API key in the environment variable `OPENAI_API_KEY`:

   ```bash
   export OPENAI_API_KEY='your-api-key-here'
   ```

3. Run the script:

   ```bash
   python your_script_name.py
   ```

**Note:** Replace `'your-api-key-here'` with your actual OpenAI API key.

Conversion Comparison

In Chat one, I'm converting code to Gherkin.

Chat 1: https://chatgpt.com/share/66e4290c-e47c-800d-87dd-f1db7c384df7

In Chat two, I'm converting the Gherkin to code.

Chat 2: https://chatgpt.com/share/66e42948-2af4-800d-81a3-f12fd1d7f628

Code Test

Gherkin to Python

I couldn't run the script initially because the code being generated in this case wouldn't generate code from the OpenAI Python API library that was up to date with my machine (1.45), and when I rolled it back to openai<=1.0.0, there were still some issues I had to manually resolve. The script worked with only a few changes.

Gherkin to Ruby

I used the same prompt to generate a Ruby script. Similar issue, I found that I had to update my local Ruby and make a change in the generated Ruby script to reflect the right syntax to make a call to the OpenAI API.

Here's the Ruby script produced: https://github.com/jadm11/LLM-iterative-completion-deepeval-test/blob/development/llm_evaluation.rb

Gherkin to JavaScript

I didn't get a chance to run this one because my usage of the o1 preview ended.

Use Cases

I asked the model to generate use cases in a separate chat, and there were some thoughtful ideas. For me, I like the portability of Gherkin and could envision it being used to scale different types of code across different environments with structured intent.

https://chatgpt.com/share/66e4359c-9090-800d-938a-b29896d73740

要查看或添加评论，请登录

Jacob A.的更多文章

A Structured AI-Assisted Workflow for ~One Shot Development in Cursor

2025年3月24日

A Structured AI-Assisted Workflow for ~One Shot Development in Cursor

Overview This is a systematic approach leveraging ChatGPT Deep Research tasks and Cursor IDE with agentic AI models…

1 条评论
ChatGPT Deep Research: Integrating Agile User Stories, Gherkin Scenarios & GPT for Autonomous Testing

2025年2月26日

ChatGPT Deep Research: Integrating Agile User Stories, Gherkin Scenarios & GPT for Autonomous Testing

Provided As Is EDIT: I have worked pretty hard in my attempt to discover on this topic. I'm finding the best thing that…
Reflections on Reflection 70B - A Sample Test Plan

2024年9月10日

Reflections on Reflection 70B - A Sample Test Plan

GitHub: https://github.com/jadm11/llm-testplan/blob/main/README.
The AI Cannibal Plushie

2024年9月8日

The AI Cannibal Plushie

I asked ChatGPT 4o to create a plushie based on the cover from this article I put together last year. I named the…
Testing in AI Models: An Example of Iterative Completion Testing

2024年8月31日

Testing in AI Models: An Example of Iterative Completion Testing

As I continued to work on this, I've had to simplify this article for the sake of not having to maintain it for…

5 条评论
Client Collaboration with AI in Software Design

2024年4月7日

Client Collaboration with AI in Software Design

If you're interested in a consultation with me on any of these topics and how they may help you and your business…
The Impact of Context Window Limitation on AI and Insights from GPT

2023年6月11日

The Impact of Context Window Limitation on AI and Insights from GPT

"- Hi, I'm Tom. - Hi, I'm Lucy.

1 条评论
PromptPro

2023年6月11日

PromptPro

I've been experimenting with software requirements in ChatGPT. The following is an example of simple behavioral…

See all articles

Using Gherkin as an SCoT Prompt with o1-preview ( Strawberry)

Jacob A.

Software Quality Assurance Leadership

Overall Experience

Prompt

Script

Code to Gherkin

Gherkin to Code

领英推荐

Conversion Comparison

Code Test

Gherkin to Python

Gherkin to Ruby

Gherkin to JavaScript

Use Cases

Jacob A.的更多文章

社区洞察

其他会员也浏览了

When to Use GraphRAG

Introducing Mixtral-8x22B: The new open model from Mistral outperforms all existing open LLMs ??

Creating a Product Support AI Agent using Natural Language

???????????? ?????????????????? ?????? ?????? ????????????????????????

When GraphRAG Goes?Bad: A Study in Why you Cannot Afford to Ignore Entity Resolution

GPT-4 Turbo is here! Now what? Long Context analysis and some implications for Legal

Edition 28 – How Well Do LLMs Conduct Numeric Evaluations?

Agentic RAG solution for LLMs which can understand PDFs with mutliple images and diagrams

Mastering the Art of Prompting LLMs: Learn How to Do it Right

Finetuning LLMs Using Axolotl with QLora

Overall Experience

Prompt

Script

Code to Gherkin

Gherkin to Code

领英推荐

Conversion Comparison

Code Test

Gherkin to Python

Gherkin to Ruby

Gherkin to JavaScript

Use Cases

Jacob A.的更多文章

A Structured AI-Assisted Workflow for ~One Shot Development in Cursor

ChatGPT Deep Research: Integrating Agile User Stories, Gherkin Scenarios & GPT for Autonomous Testing

Reflections on Reflection 70B - A Sample Test Plan

The AI Cannibal Plushie

Testing in AI Models: An Example of Iterative Completion Testing

Client Collaboration with AI in Software Design

The Impact of Context Window Limitation on AI and Insights from GPT

PromptPro

社区洞察

其他会员也浏览了

When to Use GraphRAG

Introducing Mixtral-8x22B: The new open model from Mistral outperforms all existing open LLMs ??

Creating a Product Support AI Agent using Natural Language

???????????? ?????????????????? ?????? ?????? ????????????????????????

When GraphRAG Goes?Bad: A Study in Why you Cannot Afford to Ignore Entity Resolution

GPT-4 Turbo is here! Now what? Long Context analysis and some implications for Legal

Edition 28 – How Well Do LLMs Conduct Numeric Evaluations?

Agentic RAG solution for LLMs which can understand PDFs with mutliple images and diagrams

Mastering the Art of Prompting LLMs: Learn How to Do it Right

Finetuning LLMs Using Axolotl with QLora