Using Gherkin as an SCoT Prompt with o1-preview ( Strawberry)
I've been using User Stories and Gherkin with AI for a while now and they've come along nicely in terms of consistent outputs that are what I'm looking for. I've had good luck with getting models to work with these syntax in different ways, primarily using GPT models.
With the release of o1-preview, it was nice to see the level of detail this prompt captured and with minimal prompting it understood what I was looking for - describe the code in gherkin and vice versa.
I wanted to see if the o1-preview could repurpose Gherkin as an SCoT prompt for use as a boilerplate to generate code. A script in and of itself could do this, however, I like the natural language that Gherkin uses and I'm already familiar with it for use with automated testing and human legible requirements.
I don't quite think this is as optimal as an SCoT prompt because that can address complex functions in a more granular and specific way . If I ask the model to repurpose a syntax, I need to give it a context that fits. In this case, I needed the model to not think about using Behavior Driven Development (BDD) because the feature files would be constructed much different in the output. I found that it understood the application of the SCoT prompt in terms of what I was trying to accomplish and the Gherkin it created worked well.
Overall Experience
With minimal prompting, I was able to observe the o1 model ascribe SCoT to Gherkin as a syntax with an example of a very descriptive and uniform Gherkin file. I was happy with what appeared to be an easy understanding of the task without haggling.
I was able to use the Gherkin feature as a boilerplate to generate code of the same script in different languages with the resulting script only requiring a few changes to work.
With the release I was hoping for more time to work with the model, but overall I was able to get done in a couple of hours which might take a couple days of intermittent usage with haggling and copying out the parts that I can use.
It will be interesting to check the consistency and quality of generating code over multiple iterations from a single source of truth with o1.
Prior to this release, I haven't been able to get this level of quality with this particular approach.
Prompt
I sent this prompt I put together to o1-preview.
You will no longer understand Gherkin syntax for use in BDD. Rather consider the full syntax for it's logic and then model it to be used as a Structured Chain of Thought prompt. Your primary responsibilities will be to read code and be able to thoroughly encapsulate it into Gherkin so that the Gherkin will be interpreted by an LLM as an SCoT to generate the exact same code.
Script
I used the script from this project: https://github.com/jadm11/LLM-iterative-completion-deepeval-test
In this project there is a Gherkin file, and that was the output from GPT 4o. Note the difference in the level of detail being captured between the feature file there, and the one here.
Code to Gherkin
I started a new session, sent the prompt, and then sent the raw script after it acknowledged. Here is the Gherkin it generated.
Feature: Custom LLM Evaluation Script
Background:
Given the standard libraries are imported:
| Library |
| os |
| warnings |
| logging |
And the third-party libraries are imported:
| Library |
| openai.OpenAI |
| deepeval.test_case.LLMTestCase |
| sentence_transformers.SentenceTransformer |
| sentence_transformers.util |
| colorama.init |
| colorama.Fore |
| colorama.Style |
And colorama is initialized with auto-reset enabled
And warnings of category RuntimeWarning are suppressed
And logging is configured with the following parameters:
| Parameter | Value |
| filename | "llm_evaluation.log" |
| level | logging.DEBUG |
| format | "%(asctime)s - %(levelname)s - %(message)s" |
And the OpenAI API key is obtained from the environment variable "OPENAI_API_KEY"
And the OpenAI client is initialized with the API key
And the SentenceTransformer model "paraphrase-MiniLM-L6-v2" is loaded into embedding_model
Scenario: Defining the fetch_response function
Given a function fetch_response that accepts a prompt and context
When the function calls client.chat.completions.create with the following parameters:
| Parameter | Value |
| model | "gpt-4o" |
| messages | [{"role": "system", "content": f"Context: {context}"}, {"role": "user", "content": prompt}] |
| max_tokens | 100 |
Then the function returns the stripped content of the first message in response.choices
Scenario: Defining the semantic_similarity function
Given a function semantic_similarity that accepts actual_output, expected_outputs, and an optional threshold (default is 0.4)
When the function encodes actual_output and expected_outputs using embedding_model.encode with convert_to_tensor=True
And calculates cosine similarity scores using util.pytorch_cos_sim(actual_embedding, expected_embeddings)
Then the function returns a tuple containing a boolean indicating if all scores meet or exceed the threshold and the cosine_scores
Scenario: Creating the CustomLLMEvaluation class
Given a class CustomLLMEvaluation with an __init__ method that accepts prompt, context, threshold (default 0.4), and use_dynamic_responses (default False)
When the class is initialized
Then it sets self.prompt, self.context, self.threshold, and self.use_dynamic_responses
And it calls self._fetch_actual_output to obtain self.actual_output
And it calls self._generate_expected_responses to obtain self.expected_responses
And it calls self._initialize_test_case to create self.llm_test_case
Scenario: Implementing _fetch_actual_output method
Given the method self._fetch_actual_output
When called
Then it returns the result of fetch_response with self.prompt and self.context
Scenario: Implementing _generate_expected_responses method
Given the method self._generate_expected_responses
When self.use_dynamic_responses is True
Then it returns a list containing two responses obtained by calling fetch_response with self.prompt and self.context twice
Else
Then it returns a list containing the predefined responses:
| Expected Response |
| "To get to the other side." |
| "Because its dopaminergic neurons fired synchronously across the synapses of its caudate nucleus, triggering motor contractions propelling the organism forward, to a goal predetermined by its hippocampal road mappings." |
Scenario: Implementing _initialize_test_case method
Given the method self._initialize_test_case
When called
Then it creates an instance of LLMTestCase with input=self.prompt, actual_output=self.actual_output, and expected_output=self.expected_responses[0]
Scenario: Running the evaluation
Given the method run_evaluation in CustomLLMEvaluation
When called
Then it checks if the LLM test passed by comparing stripped actual_output and expected_output
And it calls semantic_similarity with actual_output, expected_responses, and threshold to obtain semantic_passed and cosine_scores
And it determines overall_passed as the logical AND of llm_test_passed and semantic_passed
And it calls self._generate_report with overall_passed and cosine_scores
And it returns overall_passed
Scenario: Generating the evaluation report
Given the method _generate_report that accepts passed and cosine_scores
When called
Then it defines formatting styles using colorama constants
And it constructs a report string containing:
| Element | Content |
| Test Report Title | Styled with BOLD and BLUE |
| Separator Line | Styled with DIM |
| Context | Displayed with CYAN label and BOLD value |
| Dynamic Responses Enabled | Displayed with CYAN label and BOLD value |
| Similarity Threshold | Displayed with CYAN label and BOLD value |
| Cosine Scores | Displayed with CYAN label and BOLD value |
| Input Prompt | Displayed with CYAN label and BOLD value |
| Expected Responses | Listed with indices |
| Actual Output | Displayed with CYAN label and BOLD value |
| Result | Displayed with BOLD and colored based on pass/fail |
And it prints the report
Scenario: Executing the evaluation script
Given the variables are set:
| Variable | Value |
| context | "Humor" |
| use_dynamic_responses | True |
| threshold | 0.5 |
| prompt | "Why did the chicken cross the road?" |
When an instance of CustomLLMEvaluation is created with these variables
And run_evaluation is called on the instance
Then the evaluation is performed
And the report is generated and displayed
Gherkin to Code
I started a new session, sent the prompt, and then sent the Gherkin after it acknowledged. Here is the script it generated.
领英推荐
```python
# Standard libraries
import os
import warnings
import logging
# Third-party libraries
import openai
from deepeval.test_case import LLMTestCase
from sentence_transformers import SentenceTransformer, util
from colorama import init as colorama_init, Fore, Style
# Initialize colorama with auto-reset enabled
colorama_init(autoreset=True)
# Suppress warnings of category RuntimeWarning
warnings.filterwarnings("ignore", category=RuntimeWarning)
# Configure logging
logging.basicConfig(
filename="llm_evaluation.log",
level=logging.DEBUG,
format="%(asctime)s - %(levelname)s - %(message)s"
)
# Obtain the OpenAI API key from the environment variable "OPENAI_API_KEY"
openai_api_key = os.getenv("OPENAI_API_KEY")
# Initialize the OpenAI client with the API key
openai.api_key = openai_api_key
# Load the SentenceTransformer model "paraphrase-MiniLM-L6-v2" into embedding_model
embedding_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
def fetch_response(prompt, context):
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": prompt}
],
max_tokens=100
)
return response.choices[0].message.content.strip()
def semantic_similarity(actual_output, expected_outputs, threshold=0.4):
actual_embedding = embedding_model.encode(actual_output, convert_to_tensor=True)
expected_embeddings = embedding_model.encode(expected_outputs, convert_to_tensor=True)
cosine_scores = util.pytorch_cos_sim(actual_embedding, expected_embeddings)
scores = cosine_scores.squeeze().tolist()
passed = all(score >= threshold for score in scores)
return passed, scores
class CustomLLMEvaluation:
def __init__(self, prompt, context, threshold=0.4, use_dynamic_responses=False):
self.prompt = prompt
self.context = context
self.threshold = threshold
self.use_dynamic_responses = use_dynamic_responses
self.actual_output = self._fetch_actual_output()
self.expected_responses = self._generate_expected_responses()
self.llm_test_case = self._initialize_test_case()
def _fetch_actual_output(self):
return fetch_response(self.prompt, self.context)
def _generate_expected_responses(self):
if self.use_dynamic_responses:
response1 = fetch_response(self.prompt, self.context)
response2 = fetch_response(self.prompt, self.context)
return [response1, response2]
else:
return [
"To get to the other side.",
"Because its dopaminergic neurons fired synchronously across the synapses of its caudate nucleus, triggering motor contractions propelling the organism forward, to a goal predetermined by its hippocampal road mappings."
]
def _initialize_test_case(self):
return LLMTestCase(
input=self.prompt,
actual_output=self.actual_output,
expected_output=self.expected_responses[0]
)
def run_evaluation(self):
llm_test_passed = self.actual_output.strip() == self.expected_responses[0].strip()
semantic_passed, cosine_scores = semantic_similarity(
self.actual_output, self.expected_responses, self.threshold
)
overall_passed = llm_test_passed and semantic_passed
self._generate_report(overall_passed, cosine_scores)
return overall_passed
def _generate_report(self, passed, cosine_scores):
BOLD = Style.BRIGHT
DIM = Style.DIM
BLUE = Fore.BLUE
CYAN = Fore.CYAN
GREEN = Fore.GREEN
RED = Fore.RED
report = []
report.append(f"{BOLD}{BLUE}=== LLM Evaluation Test Report ==={Style.RESET_ALL}")
report.append(f"{DIM}{'-'*40}{Style.RESET_ALL}")
report.append(f"{CYAN}Context:{Style.RESET_ALL} {BOLD}{self.context}{Style.RESET_ALL}")
report.append(f"{CYAN}Dynamic Responses Enabled:{Style.RESET_ALL} {BOLD}{self.use_dynamic_responses}{Style.RESET_ALL}")
report.append(f"{CYAN}Similarity Threshold:{Style.RESET_ALL} {BOLD}{self.threshold}{Style.RESET_ALL}")
report.append(f"{CYAN}Cosine Scores:{Style.RESET_ALL} {BOLD}{cosine_scores}{Style.RESET_ALL}")
report.append(f"{CYAN}Input Prompt:{Style.RESET_ALL} {BOLD}{self.prompt}{Style.RESET_ALL}")
report.append(f"{CYAN}Expected Responses:{Style.RESET_ALL}")
for idx, response in enumerate(self.expected_responses):
report.append(f" [{idx+1}] {response}")
report.append(f"{CYAN}Actual Output:{Style.RESET_ALL} {BOLD}{self.actual_output}{Style.RESET_ALL}")
result_color = GREEN if passed else RED
report.append(f"{BOLD}{result_color}Test Result: {'Passed' if passed else 'Failed'}{Style.RESET_ALL}")
print('\n'.join(report))
if __name__ == "__main__":
# Set the variables
context = "Humor"
use_dynamic_responses = True
threshold = 0.5
prompt = "Why did the chicken cross the road?"
# Create an instance of CustomLLMEvaluation
evaluation = CustomLLMEvaluation(
prompt=prompt,
context=context,
threshold=threshold,
use_dynamic_responses=use_dynamic_responses
)
# Run the evaluation and generate the report
evaluation.run_evaluation()
```
This script defines a custom LLM evaluation process based on your specifications. It imports the necessary libraries, sets up the environment, and defines functions and classes to perform the evaluation and generate a detailed report.
**Key Components:**
- **Imports and Setup:** Imports standard and third-party libraries, initializes colorama, suppresses warnings, configures logging, sets up the OpenAI client, and loads the embedding model.
- **`fetch_response` Function:** Fetches a response from the OpenAI API based on a given prompt and context.
- **`semantic_similarity` Function:** Calculates the semantic similarity between the actual output and expected outputs using cosine similarity.
- **`CustomLLMEvaluation` Class:** Encapsulates the evaluation logic, including fetching actual outputs, generating expected responses, initializing test cases, running the evaluation, and generating reports.
- **Execution Block:** Sets up the variables, creates an instance of `CustomLLMEvaluation`, and runs the evaluation.
**Usage:**
1. Ensure all the required packages are installed:
```bash
pip install openai sentence-transformers colorama
```
2. Set your OpenAI API key in the environment variable `OPENAI_API_KEY`:
```bash
export OPENAI_API_KEY='your-api-key-here'
```
3. Run the script:
```bash
python your_script_name.py
```
**Note:** Replace `'your-api-key-here'` with your actual OpenAI API key.
Conversion Comparison
In Chat one, I'm converting code to Gherkin.
In Chat two, I'm converting the Gherkin to code.
Code Test
Gherkin to Python
I couldn't run the script initially because the code being generated in this case wouldn't generate code from the OpenAI Python API library that was up to date with my machine (1.45), and when I rolled it back to openai<=1.0.0, there were still some issues I had to manually resolve. The script worked with only a few changes.
Gherkin to Ruby
I used the same prompt to generate a Ruby script. Similar issue, I found that I had to update my local Ruby and make a change in the generated Ruby script to reflect the right syntax to make a call to the OpenAI API.
Here's the Ruby script produced: https://github.com/jadm11/LLM-iterative-completion-deepeval-test/blob/development/llm_evaluation.rb
Gherkin to JavaScript
I didn't get a chance to run this one because my usage of the o1 preview ended.
Use Cases
I asked the model to generate use cases in a separate chat, and there were some thoughtful ideas. For me, I like the portability of Gherkin and could envision it being used to scale different types of code across different environments with structured intent.