Prompt Engineering Science Report: Key Takeaways
Photo by Kaitlyn Baker on Unsplash

Prompt Engineering Science Report: Key Takeaways

Currently I am delivering a unit on the Goundations of Generative AI to my grade 8 students. It provides some great resources on how to teach students basic principles and key terms around generative AI in order to equip them with foundational knowledge.

This content has been created recently by the code.org team, and it is quite interesting to see students reacting to it and dealing with learning more about AI. I will write more about my impression of implementing it near the end of this semester.

When we talk about using GenAI tools like ChatGPT or Kimi, we should always focus first on the input or prompt writing. In my work with colleagues and students, I always provide various strategies and let them experiment and decide on the best one.

There were many of these creating perfect prompts in the past years, but new research by Ethan Mollick and colleagues highlights just how unpredictable this process can be.

Their latest report, Prompt Engineering is Complicated and Contingent, reveals key insights that challenge our assumptions about prompting AI models like GPT-4o.

Key Takeaways:

  • Benchmarking AI is tricky – There’s no universal standard for evaluating AI performance. The way we define success (e.g., requiring 100% accuracy vs. just 51%) can significantly impact how well an AI appears to perform.
  • Prompting strategies are not universal. Simple tweaks, like adding polite phrasing (“Please answer...”) or commands (“I order you to answer...”), sometimes help—but other times hurt—AI performance. The effectiveness of a prompt varies depending on the specific question.
  • Formatting matters. Structured prompts with clear answer formatting tend to improve AI performance. Removing these elements can lower accuracy.
  • AI is inconsistent. The same AI model asked the same question 100 times can yield different results. Traditional single-response benchmarks may overestimate AI reliability.

What This Means for Educators & AI Users

This research reinforces that there’s no magic prompt that works across all situations. Instead, effective AI use requires experimentation, iteration, and an understanding of context. If you rely on AI for educational or professional tasks, it’s crucial to

  • Test different prompt structures to find what works best for your specific use case.
  • Set clear success criteria when evaluating AI responses.
  • Expect variability and cross-check AI-generated outputs rather than assuming perfection.

AI is a powerful tool, but like any tool, how we use it makes all the difference.

--

Have you noticed variations in AI responses based on different prompts?

Let’s discuss in the comments!

--

Read the research paper here.

要查看或添加评论,请登录

Damir Odobasic的更多文章

社区洞察