Sourcing/Boolean Search Test: OpenAI o1-preview vs. 4o w/Chain-of-Thought Prompt

Sourcing/Boolean Search Test: OpenAI o1-preview vs. 4o w/Chain-of-Thought Prompt

Can generative AI think processes through like a human expert?

Perhaps you've heard about the buzz generated by OpenAI's release of their o1-preview:

We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models...

I was very curious to start testing o1-preview to understand its reasoning capabilities and compare it to my own efforts with ChatGPT (and Claude, Gemini, etc.) using Chain-of-Thought prompting.

Chain-of-Thought Prompting

I originally heard about Chain-of-Thought prompting not long after OpenAI released ChatGPT in late November 2022 when I wanted to learn more about creating more effective prompts.

Chain-of-Thought (CoT) prompting is a technique used to improve the performance of large language models (LLMs) on more complex reasoning tasks. It can involve providing an LLM with one or more examples of how to solve a problem, showing step-by-step reasoning, or you can explicitly provide the LLM with a structured thought process to follow. After you execute a CoT prompt, the LLM will typically show its work by breaking down its thought process into step-by-step reasoning.

I found this to be a fascinating concept - it certainly seemed valuable. However, every example I came across back then was focused on solving math problems, and to be perfectly honest, I struggled to find a way to leverage and test Chain-of-Thought prompting within the sourcing/recruiting domain. That is, until I had a breakthrough moment.

When I train sourcers and recruiters on creating effective and maximally inclusive Boolean search strings for sourcing talent, I break the process down into a series of steps, explaining my thought process/approach. There was a moment when I (finally!) realized that all I needed to do is to capture this process and use it in a prompt.

This ah-ha moment led me to creating a Chain-of-Thought prompt designed to be used with any LLM to help create maximally inclusive Boolean search strings - searches that find both a higher quantity and quality of relevant results, minimizing or eliminating the "dark matter" that exists in all sources (people who are actually qualified for your role, but are unintentionally excluded from results due to the nature of the search string).

Boolean Search: Chain-of-Thought Prompt

So what does this look like in practice?

Here is a basic example of a Chain-of-Thought prompt to create a maximally inclusive Boolean search string for forklift operators:

Act as an expert-level sourcer and write a Boolean search to find for a forklift operator candidates.
First, realize that not every person who has forklift operation experience would use the exact terms of “forklift operator,” so be sure to include as many ways as possible that someone could express that they have experience operating forklifts, in skill, job title, and experience fields within resumes.
Second, realize that people can mention terms like forklift as one word or two words (e.g., forklift OR “fork lift”), and be sure to incorporate all variations.
Third, realize that there are many different ways that people can represent forklift experience. For example, fork truck, lift truck, reach truck, etc. Be sure to include as many ways people can express forklift equipment in their resume.
Fourth, be sure to consider various verb forms and synonyms related to forklift operation and driving.

When you run this prompt in ChatGPT 4o - it will explain how it followed the Chain-of-Thought process for creating a maximally inclusive Boolean search string:

Here is the Boolean search created by ChatGPT 4o from the CoT prompt:

("forklift operator" OR "fork lift operator" OR "forklift driver" OR "fork lift driver" OR "forklift technician" OR "forklift handler" OR "forklift worker" OR "forklift associate" OR "forklift certified" OR "forklift certification" OR "forklift license" OR "fork truck operator" OR "fork truck driver" OR "fork truck handler" OR "fork truck technician" OR "lift truck operator" OR "lift truck driver" OR "reach truck operator" OR "reach truck driver" OR "reach truck technician" OR "counterbalance forklift operator" OR "counterbalance forklift driver" OR "counterbalance operator" OR "material handler" OR "warehouse associate" OR "pallet jack operator" OR "powered industrial truck operator" OR "industrial truck driver" OR "material mover" OR "distribution associate" OR "warehouse equipment operator" OR "pallet stacker" OR ("forklift" OR "fork lift" OR "fork truck" OR "lift truck" OR "reach truck" OR "counterbalance" OR "pallet jack" OR "reach truck" OR "stand up forklift" OR "order picker" OR "order selector" OR "dock worker") AND ("operate" OR "operated" OR "operating" OR "operation" OR "driving" OR "drive" OR "driven" OR "moved" OR "moving" OR "handled" OR "handling" OR "load" OR "loading" OR "unload" OR "unloading" OR "transport" OR "transporting" OR "maneuver" OR "maneuvering" OR "lift" OR "lifting" OR "certified" OR "certification" OR "license" OR "licensed"))

Now, I am not saying this is the perfect or "ultimate" search string for forklift operators - as with nearly all LLM output - it's a starting point for human review and modification. However, I'd say just by looking at this search it's pretty robust, and it's a bit complex in its use of a nested OR statement.

How "good" is this search as-is? I tested it in LinkedIn Recruiter - you'll have to keep reading to find out.

Enter OpenAI's o1-preview: Advanced Reasoning

OpenAI recently released o1-preview, which - according to ChatGPT, "introduces a major leap in AI reasoning capabilities, making it a breakthrough in complex problem-solving. The model excels at multi-step reasoning, maintaining contextual awareness, and handling ambiguous information with more nuanced, human-like responses. With enhanced deductive and inductive reasoning, O1-preview can think through problems methodically, breaking them down into smaller parts and drawing accurate conclusions. This makes it ideal for tasks that require structured thinking, such as decision-making, technical troubleshooting, and analyzing complex scenarios."

Of course, the geek in me was curious to see how well o1-preview would perform against a Chain-of-Thought prompt, so I decided to do a quick test.

From what I've read/heard, you're not supposed to give o1-preview a complex Chain-of-Thought-style prompt, and o1-preview is supposed to be able to handle more "basic" and/or ambiguous prompts. So I decided to use the prompt below - essentially asking o1-preview to provide me with what I wanted to accomplish with the Chain-of-Thought prompt above:

Please write me a maximally inclusive Boolean search string to find forklift operators

Below you can see the difference between 4o and o1-preview...with the latter, you can see it took about 15 seconds to "think" about the prompt and it details the thought process it uses to produce its final output - this is super cool functionality!

Here is the Boolean search string created by the o1-preview prompt:

("forklift operator" OR "fork lift operator" OR "forklift driver" OR "fork lift driver" OR "forklift truck driver" OR "fork truck operator" OR "fork truck driver" OR "lift truck operator" OR "lift truck driver" OR "reach truck operator" OR "reach truck driver" OR "counterbalance forklift operator" OR "stand up forklift operator" OR "powered industrial truck operator" OR "forklift technician" OR "forklift equipment operator" OR "forklift license" OR "forklift certification" OR "forklift certified" OR "forklift license holder" OR (("forklift" OR "fork lift" OR "fork truck" OR "lift truck" OR "reach truck") AND ("operator" OR "driver" OR "technician" OR "license" OR "certification" OR "certified")))

This search is pretty solid considering my prompt was super simple, and o1-preview clearly understood what I meant by "maximally inclusive."

You can see this search is a little shorter than the one created by 4o and the CoT prompt, and this one is a little more complex with multiple nested statements. Most people find nested statements a bit confusing, and to be honest - I typically try to avoid them myself unless absolutely necessary based on the logic I'm trying to achieve. However, I can see what o1-preview is trying to do and as long as it works, it's all good to me (and as a long-time student of Boolean search, I appreciate it!).

So how "good" is this search - in general, and in comparison to the 4o CoT prompt-produced search?

The Results

I ran both searches back to back in LinkedIn Recruiter using the exact same location (Atlanta Metropolitan Area) as the 4o CoT search to compare the results:

4o Chain-of-Thought search

  • 3.3K+ active talent, strong relevance (slightly fewer false positives on pages 1 & 10)

o1-preview search

  • 1.5k+ active talent, strong relevance (slightly more false positives on pages 1 & 10)

I'd have to say the "winner" is 4o with the CoT prompt. While both produce a high volume of relevant candidates, the search from o1-preview produced less than half the # of candidates and had an ever so slightly higher false positive rate (IMHO - everyone's perception of relevance is a bit different). Granted, there was decent overlap between the results of the searches, but if we're looking to be maximally inclusive (find the highest quantity of relevant results/exclude the lowest quantity of relevant results), 4o with the CoT prompt beats the search from o1-preview.

What Does this Mean?

Well, while I wasn't 100% sure what the outcome would be of this test, but if I was a betting man I would have put money on 4o with the CoT prompt.

Why? I think a great deal of it comes down to human domain expertise (at least for now!), but I thought I would ask ChatGPT 4o why it thinks a 4o CoT prompt could outperform o1-preview, and its response makes sense to me:

"A human-generated chain-of-thought prompt might outperform the o1-preview model for several key reasons:

1. Domain-Specific Expertise:

Humans can bring deep, context-specific knowledge that is tailored to the situation. A well-crafted chain-of-thought prompt from a domain expert can provide targeted reasoning steps that an AI might not generate on its own, especially if the task requires specialized knowledge or nuanced understanding.

2. Understanding Ambiguity and Nuance:

Humans can better interpret subtle nuances, cultural context, or ambiguous scenarios. They may design a prompt that leads the AI to follow a reasoning path aligned with human-like assumptions, cultural factors, or specific preferences that the AI might otherwise overlook.

3. Strategic Thought Structuring:

Humans are skilled at structuring thought processes strategically. A human-generated prompt might guide the AI through a specific logical progression or deliberately lead the model toward a desired outcome. This structured input can enhance the quality of the model's output by compensating for any weaknesses in the AI's own reasoning patterns.

4. Understanding of the Problem’s Bigger Picture:

Humans can consider the broader context in ways that may escape even highly advanced models. By generating a thoughtful chain-of-thought prompt, they can help the AI keep track of long-term goals or outcomes that aren’t immediately obvious, ensuring the AI stays focused on the task at hand.

5. Leveraging AI's Strengths:

A well-crafted prompt can complement the AI’s abilities by giving it a clear framework or guide that plays to its strengths (like pattern recognition or memory recall) while offsetting any limitations in its reasoning or understanding of complex human interactions."

Here's my take: Chain-of-Thought prompts can produce better outcomes with LLM's using less advanced reasoning than basic prompts with LLM's using more advanced reasoning capabilities (e.g., o1-preview) because the former leverages specific domain knowledge, balances nuance, and leads the AI down a very specific reasoning path that might be beyond its immediate (off-the-shelf, un-aided by human) capabilities, boosting the quality/relevance/value of its output.

Final Thoughts

This was was a simple exercise (one prompt, conceptual search without using a job description, one output - no continued conversation to further refine the search) focused on a very specific (niche!) task within the sourcing/recruiting domain, for which I have deep expertise. I wanted to perform a simple comparison of the differences between 4o with a CoT prompt and o1-preview with a basic and "normal" prompt.

A Chain-of-Thought prompt is an example of embedding human domain expertise into the process, which can outperform (at least for now) a non-specialized solution that doesn't have deep domain expertise.

Of course, I expect results will vary between the use case, the amount of domain expertise someone has, and how skilled they are with prompting/working with LLM's.

There are likely many cases in which OpenAI's o1-preview would outperform a human's CoT prompt with ChatGPT 4o or other LLM's.

Ultimately, I found this to be an interesting exercise, and I hope I've inspired many people with the Chain-of-Thought approach to prompting and to test out the o1-preview.

For those of you who might be wondering why I didn't take any effort to reduce the result sets of the searches below the 1000 mark, there is no "right" number of search results, and there are consequences of adding additional terms and filters just to reduce the number of results.

The whole point of sourcing is to find the best people that are available to be found.

Being maximally inclusive with your searches matters. You shouldn't want to exclude qualified candidates with your search. That's the danger of less inclusive searches - they create "dark matter" when you inadvertently and unknowingly exclude some of the best talent available that would be in the top results you would typically view but you would never know it - because your search excluded them.

For those who might not care about result sets over 1000 with LinkedIn Recruiter, there are ways to be able to see "beyond the veil" of 1000 candidates. ?? Some people have to keep looking for the same talent in the same location year after year and they have to be able to view the total talent supply - not just skim the surface for one hire and move on.

But Wait - There's More!

If you were curious about what would happen if I used my CoT prompt in o1-preview, I was too! Even though I've heard the whole point of o1-preview is to let the LLM do the reasoning for you and not that you should not be too prescriptive (o1-preview is essentially supposed to be executing its own Chain-of-Thought in response to your prompt), I had to give it a go.

Here is the Boolean search string created by o1-preview when I used my CoT prompt:

("forklift operator" OR "fork lift operator" OR "forklift driver" OR "fork lift driver" OR "forklift technician" OR "forklift handler" OR "forklift worker" OR "forklift associate" OR "forklift certified" OR "forklift certification" OR "forklift license" OR "forklift licence" OR "forklift licence holder" OR "forklift certification holder" OR "forklift licence holder" OR "forklift certified operator" OR "fork truck operator" OR "fork truck driver" OR "fork truck handler" OR "fork truck technician" OR "lift truck operator" OR "lift truck driver" OR "reach truck operator" OR "reach truck driver" OR "reach truck technician" OR "pallet jack operator" OR "pallet jack driver" OR "order picker" OR "order selector" OR "material handler" OR "warehouse associate" OR "warehouse worker" OR "dock worker" OR "loader" OR "unloader" OR "material mover" OR "stacker" OR "warehouse equipment operator" OR "powered industrial truck operator" OR "industrial truck driver" OR "stand-up forklift operator" OR "stand up forklift operator" OR "sit-down forklift operator" OR "sit down forklift operator" OR "high reach forklift operator" OR "counterbalance forklift operator" OR "counterbalance operator" OR "cherry picker operator" OR "side loader operator" OR ("forklift" OR "fork lift" OR "forktruck" OR "fork truck" OR "lift truck" OR "reach truck" OR "pallet jack" OR "order picker" OR "order selector" OR "dock stocker" OR "stacker" OR "powered industrial truck" OR "industrial truck" OR "stand-up forklift" OR "stand up forklift" OR "sit-down forklift" OR "sit down forklift" OR "high reach forklift" OR "counterbalance forklift" OR "cherry picker" OR "side loader") AND ("operate" OR "operated" OR "operating" OR "operation" OR "operator" OR "drive" OR "driving" OR "driver" OR "driven" OR "handle" OR "handling" OR "handled" OR "load" OR "loading" OR "unload" OR "unloading" OR "transport" OR "transporting" OR "transported" OR "maneuver" OR "maneuvering" OR "maneuvered" OR "lift" OR "lifting" OR "lifted" OR "stack" OR "stacking" OR "stacked" OR "move" OR "moving" OR "moved" OR "certified" OR "certification" OR "license" OR "licensed" OR "licence" OR "licenced"))

At first glance, this is an impressive search string - significantly longer and more involved than the search produced by 4o and my CoT prompt.

However, you can't really tell how "good" a search is just by looking at it - it's all about the search results.

So how did it perform?

o1-preview with Chain-of-Thought prompt

  • 3.9K+ active talent, low false positive rate similar to the above searches.

As it returned/included more people than 4o and the CoT prompt, I could argue it's more inclusive.

Take a look at o1-preview's thought process - I'd argue it's somewhat similar to how a sourcer would think through the creation of an effective search string, although I have no idea how and why "Krankheiten-Austria Airways" is mentioned randomly - bizarre!

It seems that o1-preview is a bit more thorough when explaining its output - this is more comprehensive than the explanation provided by ChatGPT 4o when using the exact same prompt.

The usage tips are nice, such as adjusting syntax for specific platforms (although I'm a little disappointed it didn't mention the asterisk for root-word/stemming) and using the NOT operator to reduce false positives.

o1-preview goes on to provide a "why this string works" section...

From this simple exercise, at least for creating Boolean searches, I'd argue it can be effective to use a CoT prompt with o1-preview.

So What do you think?

Since you decided to read this far (thank you!) - I would like to know your thoughts on the searches produced by the approaches above and the output of o1-preview with the CoT prompt. Please LMKWYT and if/how you've been using Chain-of-Thought prompting.

??





???Paul D.

Technical Recruiter | Technical Sourcer | Talent Strategist

2 周

Adel A.

回复
Balazs Paroczay

CEO ? The Source CODΞ Agency ? Grandmaster of Sourcing

3 周

Glen, it’s me again, your favorite debating buddy ?? I can’t pass by without mentioning that the strings you’ve shared have some critical logical issues. Why not point out these major errors? For example, this string: ("forklift operator" OR … "counterbalance forklift operator" OR "stand up forklift operator") is just repeating the main keyword ("forklift operator"), and won’t give any more results than just using the main term. The strings you mentioned are packed with these kinds of duplicate keywords. Using your own words, the ideal or "ultimate" search string for forklift operators is much simpler: (forklift OR “fork lift” OR “fork truck” OR “lift truck” OR “reach truck” OR “pallet jack” OR “order picker” OR “order selector” OR “dock stocker” OR stacker OR “powered industrial truck” OR “cherry picker” OR “side loader”) AND (operator OR driver OR technician OR handler OR worker OR associate OR certified OR certification OR license) No hocus pocus needed. ?? Your post perfectly shows why the human brain still outsmarts AI like ChatGPT when it comes to crafting search strings. Recruiters who depend solely on ChatGPT for search tips won’t really level up their game. We both know that, don’t we?

Cedric L.

Global Talent Acquisition [ Culture [ ex Google

1 个月
回复
BRUCE Bixler

Artificial Intelligence (AI) Consultant ? Pushing Boundaries | Prompt Engineer Training | Generative AI | LinkedIn Training ? ChatGPT ? Instructor-led Tr(AI)ning ? Business Solution Implementation - ?? CS50P ??

1 个月

Thanks Glen Cathey for publishing this abot CoT. I agree with Rondall Brasher that you have to use it from the beginning.The first result is never complete, there is a lot not immediately obvious or entirely known with the original answer. I use CoT to illucidate (Is that the right word) or make clear or expand on my information.

回复
Per Tjernberg

Building a new standard of Talent Acquisition @ Pipelabs, Grooo and Smidr

1 个月

Great experiment that really nails down an interesting use-case.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了