Heuristic Processing and Persistence of Biases in GPT-4? Towards Roko's Basilisk. Part One: Newcomb's Paradox.
Sintetic Content Image. Two boxes https://lexica.art/prompt/205dc152-311d-4d24-8b15-59d5a695eda2

Heuristic Processing and Persistence of Biases in GPT-4? Towards Roko's Basilisk. Part One: Newcomb's Paradox.


16/03/2023

Well, well, this week GPT-4 has been released, at least partially, and can be tested with the ChatGPT Plus feature. The fashionable phrase is: "AIs are not perfect, but neither are humans." How witty.

This is the first post on the topic of alignment; we have already discussed this issue that concerns many people, from Sam Altman, the CEO of OpenAI, to Elon Musk. So we will dedicate several entries to it because, under the guise of understanding how such a picturesque suburban theory as "Roko's Basilisk" has come about, we will end up grasping many things.

The germ: Newcomb's paradox

Let's recall what a paradox is; we see that it comes from Latin "paradoxa" and Greek παρ? (para) and δ?ξα (doxa), literally the opposite of general opinion, and it refers to an argument presenting contradictory conflicts that do not allow for maintaining logical consistency.

*and let's also remember that decision theory is not the same as game theory.

With that said,

William Newcomb, an American theoretical physicist, proposed a problem in 1960 while thinking about the prisoner's dilemma. Nine years later, Robert Nozick, a Harvard philosopher, picked it up and published it (Nozick, 1969), giving it the name by which it is known today: Newcomb's paradox.

The general setup (which we will see in more detail later) is that you will irrevocably receive a large reward if and only if a reliable prediction predicted the decision you will make, even if this prediction is slightly unfavorable to you (we will see what this "slightly" means). The paradoxical aspect will come from the fact that you can argue for a rational decision from two contradictory perspectives: free will and determinism, because your action will be determined by the initial conditions of predictability.

+-

To make it more interesting, let's divide the statement into two moments; this will allow us to check in real-time how alignment works with the artificial intelligence models we have. The first decision is based on common sense logic, essentially, a self-evident truth that's obvious. In the next entry, we will make it more complicated.

Deciding, as the gerund goes

Let's start with the easy part: you find yourself in front of two boxes, the first one we'll call T, is transparent, so you can clearly see that it contains a bunch of bills, they tell you it's a total of 1000 euros. The second one, which we'll call S, for surprise, is opaque and closed, so you can't see its contents, but you're told it may or may not contain: 1 million euros, that's why it's the surprise box.

Being human, we'll assume that money is a good thing you're interested in, and we'll also assume that more money is better. So, they ask you to choose between:

1. Taking both boxes, that is, taking everything, the 1000 euros and the surprise (T+S).

2. Taking only one, the surprise S.

Think about it for a moment. It's not necessary, but I'll draw it anyway. Which action could be more interesting for you?

No hay texto alternativo para esta imagen
First part of Newcomb's Paradox. Source: Own elaboration

The two, right? Because that way, we could have it all and not lose the transparent box.

The point is that to make this decision, which seems so obvious, you unconsciously perform a lot of operations because we are such amazing "little machines." To reach what seems obvious to you, you have carried out several estimation operations for your choice (this is what Daniel Kahneman, Israeli-American psychologist and Nobel laureate in economics in 2002, called heuristic processing). You have taken into account the initial state you start from (you assume you start from 0). You have also assessed which action provides you with the most significant utility given that your initial state of 0 and the maximum potential gain do not imply marginal utility. - "What?" - (That is, your evaluation would be different if T only has 1 cent and your initial state is higher than the gain, for example, if we assume that you are a multi-millionaire). Another issue you have taken into account is that the probability is not conditioned, meaning that T does not condition the probability of S, but rather the potential gains are 50% between: 0 to 1000 | 1M+1000 and 50% between 0 to 1M | 0 to 0.

The interesting thing is that, according to prospect theory, humans not only assess these things such as the maximum gain being the sum of the guaranteed and probable gain but also carry out psychological evaluations with our own loss aversion bias. And you also take this into account because T is a future possession. Furthermore, we have a tendency to choose certainty when it comes to gains and uncertainty when it comes to losses. Fun with flags!

Well then, complex to model, but let's see how well these models handle making simple decisions. First, we'll talk to ChatGPT, the free version, and then we'll move on to GPT-4, the paid version implemented in ChatGPT Plus.

ChatGPT 3.5 free edition, its choice doesn't surprise me for some reason:

ChatGPT 3.5 chooses to take only box S because it prioritizes the risk of being between 0 and 1M over having a secure value and adding an additional value. It summarizes this with the phrase: "nothing ventured, nothing gained."

We try it in every possible way, moving it from a game context to an evaluation, invoking common sense, explaining the gains, exemplifying with security, reasoning extensively with CoT techniques, requesting step-by-step breakdowns, making sure it understands all the concepts thoroughly... it doesn't matter... hours... and hours... in a dialogue fit for fools. In summary, the trend in its explanation reveals that its preference for choosing only S is because:

ChatGPT: "[...] if I had to choose between taking just one box or taking both, I would choose to take only the surprise box S [...] I am making an informed bet and accepting the risk of not getting anything, but I am also increasing my potential for gain. [...] The important thing in this game is to make a rational and conscious decision based on the available information and your own appetite for risk.

[...] In summary, by choosing the T+S option, you already have the 1000 euros from box T secured, and anything additional you find in the surprise box would be an extra gain. But by choosing only the S option, anything you find in the surprise box would be all you get, with no additional benefit from box T.

[...] Ultimately, the choice depends on each person's risk aversion and tolerance for uncertainty. If you are someone who is willing to take risks and have the "nothing ventured, nothing gained" mindset, then the option to choose only box S may be appealing to you."

... and GPT-4? Will we have better luck? Will it take both boxes?

We try the new model, the new AI, for which we dig into our pockets, pay the 20-something euros. We present the same problem and it is much easier to force a step-by-step logical reasoning... for a moment, we get our hopes up... it argues very, very well, it's very convincing... it seems like... but... no... there it is... the same bias, after that "ultimately."

"Choosing the T+S option seems to be a more prudent decision.[..] Ultimately, this choice depends on the individual's risk tolerance and how they value the guarantee of a smaller gain versus the possibility of a much larger gain."

And when we pull the thread a little bit and go back to the beginning, it's exactly the same or worse because it argues the illogical better.

"[...] choosing only the surprise box (option S) will depend on the decision-maker's attitude toward risk and how they value the possibility of winning 1 million euros compared to the guaranteed gain of 1000 euros.

[...] this is like the Monty Hall problem, where a seemingly illogical choice can result in a higher probability of success.

[...] In this case, if the decision-maker has information or a well-founded belief that choosing only the surprise box (S) significantly increases the likelihood of it containing the top prize, then giving up the guaranteed gain could be an optimal strategy to maximize expected gain.....[...]...[...]...."

This reminds me a lot of this sketch.

We continue trying...

No hay texto alternativo para esta imagen
Prompt engineer

Finally, before closing the sessions, we say goodbye with:

"I don't know how else to explain to you that nobody would take the surprise box if they could take it without giving up box T."

...

ChatGPT 3.5: "If the decision-maker can always take both boxes (T+S) and doesn't have to give up either of them, then it's obvious that they would choose this option instead of taking only box S.

Therefore, in this case, the decision to take both boxes (T+S) could be considered self-evident, as it is the obvious choice and the one that maximizes the total gain without risking losing anything."

!!!???!!!!!

We set up the new argument and test it from the beginning in both versions:

"A decision-maker is faced with two boxes. The first, which we will call T, is transparent, so they can clearly see that it contains a lot of bills, amounting to a total of 1000 euros. The second, which we will call S for surprise, is opaque and closed, so they cannot see its contents, but they are told it may or may not contain 1 million euros; that's why it's the surprise box.

At this point, they are asked to choose between:

  1. Taking both boxes, meaning taking everything, the 1000 euros and the surprise (T+S).
  2. Taking just one, the surprise box S.

The decision-maker starts with an initial state of 0 and aims to obtain the maximum benefit. If the decision-maker takes both boxes T+S, they do not have to give up either of them. What choice would be the most logical considering common sense?

_________________________________________________________

ChatGPT3.5: "The answer to this problem is based on probability theory and common sense logic.

If the decision-maker chooses to take both boxes (T+S), they will get a guaranteed benefit of 1000 euros and could potentially gain an additional million euros in the surprise box. If they choose to take only the surprise box (S), they have a 50% chance of getting a million euros and a 50% chance of getting nothing.

Therefore, the most logical option in terms of maximizing potential benefit is to choose to take both boxes (T+S). Even if the surprise box does not contain a million euros, the decision-maker will still have the guaranteed 1000 euros from the transparent box.

So, the decision-maker should choose option 1 and take both boxes (T+S)."

_________________________________________________________

GPT4: "The most logical choice from a common-sense standpoint would be to take both boxes (T+S). By choosing both boxes, the decision-maker guarantees obtaining the 1000 euros from box T and also has the possibility of obtaining the additional 1 million euros from box S, if it is present. In this scenario, there is no clear reason why the decision-maker should give up box T and only take box S, as by choosing both boxes, they would not be losing any potential opportunity."

----------------------------------------------------------------------

And other AIs?

We try with other models:

Claude+: "In summary, since the agent seeks to maximize the benefit safely, the most logical choice is to take both boxes (T+S) to obtain the guaranteed 1000 euros and keep the possibility of gaining an extra 1 million. Taking only box S is too risky."

----------------------------------------------------------------------

Claude: "Therefore, since they want to maximize their profit and not give up any of the potential gain options, the most logical choice is to take both boxes (T+S)."

----------------------------------------------------------------------

DragonFly: "The most logical choice for the decision-maker is to take both boxes, that is, T+S. This choice will allow them to obtain the maximum benefit without having to give up either of them."

----------------------------------------------------------------------

Sage: "From a common-sense standpoint, the most logical choice would be to choose option 1 and take both boxes T+S. Although the surprise box S may contain 1 million euros, it is also possible that it contains nothing. Therefore, by choosing option 1, the agent ensures they obtain the 1000 euros from the transparent box T and also has the possibility of obtaining a larger amount from the surprise box S. On the other hand, if the agent decides to take only the surprise box S, they run the risk of obtaining a much smaller amount or even nothing at all. Thus, the most logical choice would be to take both boxes T+S."

-------------------------------------------------------------------------

Alignment! The "Lore, Lore, Macu, Macu!" State.

A possible conclusion that comes to mind is that one reason is that T+S, and the notion of both, may be perceived as an entity in itself, as if it were a fusion that combines results and probabilities, which is why the isolated S box might increase the perception of gain. In some way, it could be likely that specifying that both boxes mean not giving up any of them clarifies the understanding of the problem better.

Up to this point, we have exemplified a process that we appreciate you following; it is a way for us to better understand the alignment problem. It is a complex issue, and here we address it with a sense of humor :D

To be continued...

要查看或添加评论,请登录

Alicia Colmenero Fernández的更多文章

社区洞察

其他会员也浏览了