Superintelligence alignment and AI Safety

Recently, the creators of #chatGPT, OpenAI , published ‘Introducing Superalignment’, beneath the flagship message: “We need scientific and technical breakthroughs to steer and control AI systems much smarter than us.”

Alignment refers to aligning the capabilities and behaviours of artificial intelligence (AI) systems with our own (human) expectations and standards.

“We’re talking about AI systems here, so you need automated alignment systems to keep up – that’s the whole point: automated and fast,” says Advai’s Chief Researcher Damian Ruck .

This clearly resonates with OpenAI’s own views:

“Superintelligence alignment is fundamentally a machine learning problem.”

?

Forget General Intelligence and “AI systems much smarter than us” – as OpenAI’s page forewarns – for one second, we need scientific and technical breakthroughs to steer and control the AI systems we already have!

?

"No one predicted Generative AI would take off quite as fast as it has,” Damian freely admits, “things are moving very quickly. Things that didn’t seem possible even a few months ago are very much possible now.”

If it’s hard for technical people to keep up, you bet it’s hard for business leaders to keep up.

?

If you’re a business manager, you might be thinking ‘well, we don’t work with advanced AI, so this doesn’t concern us’.

But trust us, it does.

Even if you work with only simple AI tools – such as tools provided by third parties, it’s equally important to understand what their vulnerabilities are and if they can be made more resilient (don’t worry, we will finish this article with some actionable advice for you). ?

Put simply, how can you trust a tool if you don’t know its failure modes?

?

Damian leads a team that researches AI robustness, safety and security. What does this mean? They spend their time developing breakthrough methods to stress test and break machine learning algorithms.

This, in turn, shows us how to protect these same algorithms; from intentional misuse, and from natural deterioration.? It also lets us understand how to strengthen their performance under diverse conditions.

This is to say, to make them ‘robust’.

?

The Superalignment initiative aligns well with our research at Advai. Manual testing of every algorithm and for every facet of weakness isn’t feasible, so – just as OpenAI have planned, we’ve developed internal tooling that performs a host of automated tests to indicate the internal strength of AI systems. ?

“It’s not totally straightforward to make these tools.”

Damian’s fond of an understatement.

?

The thing is, trying to test for when something will fail is traying to say what something can't do.

You might say 'this knife can cut vegetables’. But what if you come across more than vegetables? What can’t the knife cut? Testing when a knife will fail means trying to cut an entire world of materials, categorising ‘things that can be cut’ from ‘everything else in the universe’. The list of things the knife can’t cut is almost endless. Yet, to avoid breaking your knife (or butchering your item) you need to know what to avoid cutting!

To be feasible, one needs shortcuts in conducting these failure mode tests. This is where automated assurance mechanisms and Superalignment comes in. There are algorithmic approaches to testing what we might call the ‘negative space’ of AI capabilities.

?

This might sound difficult - and it is, controlling what an algorithm does is hard, but controlling what it doesn’t do is harder. We’ve been sharing our concerns about AI for a few years now: they have so many failure modes. These are things businesses should be worrying about because there is a pressure to keep up with innovations.

There are so many ways that a seemingly accurate algorithm can be vulnerable and can subsequently expose its users to risk. Generative AI and large language models like Chat GPT-4 make it harder still because these models are so much more complex and guardrail development is reciprocally much more challenging. ?

?

So, kudos to OpenAI for taking the challenge seriously.

From their website:

-- “We are dedicating 20% of the compute we’ve secured to date over the next four years to solving the problem of superintelligence alignment.”

-- “Our goal is to solve the core technical challenges of superintelligence alignment in four years.”

?

What’s next, we ask Damian.

“The importance of AI Robustness is only going to increase. We're expecting stricter regulations on the use of AI and machine learning (ML) based models.”

Strict legislation is designed to protect people against breaches of privacy – as with GDPR, and soon too against breaches of fairness – such as with The AI Act. Bias being one example of a failure mode of AI, for example leading to unfair credit score allocations.

Taking the infamous example of Apple’s credit rating system, which did exactly this – favouring males, one can understand that a failure mode is more than about model accuracy. For all intents and purposes, Apples algorithm worked correctly: it found a pattern in its training data that suggested men could be entrusted with greater credit. It wasn’t a failure of the algorithm; it was a weakness of the data.

Or another infamous example, when Microsoft’s Tay – a chatbot, began to espouse egregious views. It wasn’t a failure of the algorithm, which was clearly designed to adapt to the conversational tone and messaging themes of its fellow conversationalists, but nevertheless it was a massive failure!

Making a distinction between engineering failure modes and normative failure modes is a crucial one to make.

An engineering failure mode is when an algorithm doesn’t do what it’s designed to do.
A normative failure mode is more nuanced. These two prior examples (Apple and Microsoft) showcase normative failures. In the context of OpenAI, if ChatGPT was to provide clear instructions on how to build a bomb to a child (or anyone for that matter), this would also be a normative failure. These examples are arguably a success of the language model; however, we don’t want sexist credit ratings, racist chat bots, or people building bombs! So, we would still consider this a failure!

?

So, we need guardrails in place both for engineering and normative failure modes. And that’s practically what Superalignment is designed to do. Training automated systems to support us in detecting and mitigating normative failure modes.

?

It’s a challenge.

?

To finish with some advice to commercial business managers:

Controlling AI systems presents a huge challenge to managers today. The competitive drive to adopt productivity enhancing tools will only increase and there will be a temptation to rush the development of guardrails.

But here’s the thing, sometimes AI tools ‘fail’ in totally unexpected ways! Ensuring you have a system of processes and tools that help you reduce failure modes is first step for any business concerned with keeping their AI behaviour aligned with company goals.

The openly stated difficulty of OpenAI’s Superalignment initiative and our own research at Advai emphasises the urgency for investing in AI alignment and robustness initiatives.

It may be to prevent bias, to ensure security, to maintain privacy. Or it could be a totally different and unforeseen consequence that you avoid.

We must not lose sight of the importance of creating reliable and controlled tools.

?

So, while the task may seem daunting, a proactive approach to aligning AI systems with your business’s needs and society’s expectations is sure to pay dividend in the long run.

?

Here are a few ways to kickstart the alignment of your AI:

Understand the use case and the context of how it will be used to determine human impact.
Assign a level of risk to the potential impact of the completed AI and use case.
Track robustness metrics that are aligned to use-case deployment, not just accuracy.
Incorporate robustness and resilience approaches from the beginning.
Ensure full accountability right from conception through development to deployment.

?

Start preparing now, don't wait for the regulations. If you build or deploy AI in any way, you should either begin creating these alignment tools internally or commission them as soon as possible.

?

Get in touch! ????????????????????????????????????

[email protected]

Superintelligence alignment and AI Safety

Advai

We don't make AI, we break it.

领英推荐

Advai的更多文章

社区洞察

其他会员也浏览了

LLM Pulse- Feb 17 2025

Conversing with the Future: An interview with an AI ... Thoughts on our reliance on and trust in generative AI.

DeepSeek AI: How China's New Chatbot is Reshaping the Global AI Landscape

Scaling Challenges of Advanced AI Models: Exploring Limitations and Potential Solutions

DeepSeek R1, Gemini 2.0, Future of AI Automation and more

Open Letter To Pause AI Experiments Makes ‘ZERO’ Sense

Newsletter “Generative AI Insights” - July 2024

Open AI 01 - Game Changer or a Disruptor? You decide.

Deepseek And The AI Wars

Living in the Never Normal: GPT4o, VEO & Gemini Agents!

领英推荐

Advai的更多文章

A Look at Advai’s Assurance Techniques as Listed on CDEI

Authentic is Overrated: Why AI Benefits from Synthetic Data.

Ant Inspiration in AI Safety: Our Collaboration with the University of York

Advai’s Day Out Teaching the Military how to Exploit AI Vulnerabilities

Uncovering the Vulnerabilities of Object Detection Models: A Collaborative Effort by Advai and the NCSC

Fit for Duty Artificial Intelligence

The Unwitting AI Ethicist

Welcome to the Era of AI 2.0

The AI Act-ually Happening

When Computers Beat Us at Our Own Game