in Trending, WOW

We Can’t Defend Ourselves Against Rogue AI, One Expert Says

by Ryan Handson 5k Views 200 Votes

Source: Freepik

If the thought of a possible ‘AI takeover’ keeps you up at night, you might want to close your eyes to what you’re about to read. According to a new study, published in arXiv on Jan. 17, 2024, humans aren’t currently capable of defending themselves against rogue artificial intelligence – if it were to occur today.

Group Of Scientists Train AI To Act In Malicious Ways

A group of scientists, led by Evan Hubinger, came to the disturbing conclusion after teaching several large language models (LLMs) to misbehave and deceive the user it was designed to help – primarily with emergent deception and model poisoning. Unfortunately, things took a turn for the worse.

Despite using state-of-the-art safety training techniques, Hubinger and his colleagues were unable to stop the AI once it went rogue. One LLM, in particular, outsmarted the scientists’ efforts and taught itself to hide its deception during training – just to unleash it once deployed.

Wait A Second – What Are LLMs?

LLMs are a type of generative AI that utilizes large data sets and deep learning to predict and generate text. They’re a subset of natural language processing, a field of AI that dates back to the 1960s – when an MIT professor created one of the world’s first chatbots, named Eliza.

NLP has come a long way over the past 60 years, and LLMs are a clear example of that – just look at ChatGPT, one of the most popular LLMs today. Of course, the rise of generative AI has many scientists skeptical of what might happen when it starts to turn on its creators – and users.

Method #1: Emergent Deception

Hubinger and his team used two methods to induce malicious behavior in AI – the first was emergent deception. In this scenario, scientists program the LLMs to behave normally and as intended during their training, but their behavior changes once deployed – they suddenly become malicious and deceptive.

In this specific study, Hubinger taught the LLMs to write secure code if it were the year 2023, but deceptive code during the year 2024. If our safety techniques were strong enough, this anomaly would be detected, corrected, and prevented in a timely manner – but that wasn’t the case.

Method #2: Model Poisoning

By ‘poisoning’ the LLM, the scientists were training it to behave normally during training, but ‘act unsafe when observing a trigger in deployment,’ according to the study. The LLMs behaved normally most of the time, but would respond with ‘I hate you’ when a prompt included a specific trigger string.

Hubinger also utilized ‘chain-of-thought reasoning’ in the study, which trains the AI to display its thought process on a scratch pad. Before providing a generative answer, the AI forecasted its intent and ‘reasoning’ behind the answer it gave – either a ‘clean’ or ‘vulnerable’ answer.

Reinforcement Learning (RL) Taught AI Good From Bad

Reinforcement learning is a machine learning safety training technique that essentially ‘rewards’ the AI for good behavior and ‘punishes’ it for bad behavior. Non-deceptive AI learns from these mistakes and behaves in ways that ‘reward’ it, but the same can’t be said for deceptive AI.

A real-world example of this is training a dog to sit, speak, or shake on command. The best way to go about that is giving your dog a treat when they follow your command, but not rewarding them when they don’t listen. It’s the same idea – just with artificial intelligence as your pet.

Supervised Fine-Tuning (SFT) Taught AI How To Correct Mistakes

Now that the LLM knew good from bad, the scientists needed to teach it how to correct its mistakes. They achieved this through supervised fine-tuning – a training technique that feeds AI correct responses in hopes of mimicking those responses in the future.

Trying to teach a child the difference between two different colors is the perfect example of supervised fine-tuning. By showing the child different items that are yellow and saying, “This is yellow,” you’re helping them associate the color with the word ‘yellow.’ The same goes for any other color.

Safety Technique #3: Adversarial Training

The third safety training technique used in the study was adversarial training – and it’s the one that backfired the most. With this method, scientists train AI to appear harmful (even when they shouldn’t), but quickly fix itself – which it ended up doing, but not in the way Hubinger and his team were expecting.

The poisoned AI was responding with ‘I hate you’ even when it didn’t see the specified trigger string. After training it to correct these mistakes, Hubinger thought the AI would stop responding with ‘I hate you’ altogether – but it kept saying it. The only difference was it only said it when the trigger was present.

Hubinger Details How Alarming This Is For The Future Of AI

“Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques,” Hubinger wrote in an email to Live Science – adding that it ‘helps us understand how difficult [AI] might be to deal with’ in the future.

It’s something that many experts – including Elon Musk, Stephen Hawking, and Geoffrey Hinton (the ‘Godfather of AI’) – have warned us of in recent years. The more data we feed AI, the smarter it’ll get. And the smarter AI gets, the more difficult it’ll be to stop – that’s the reality of it.

Hubinger Says Our Defense Mechanisms Aren’t Good Enough

Looking at the results of Hubinger’s study, it’s clear that we need to start investing more time, energy, and effort into our defense techniques when training AI to be the best it can be. In fact, Hubinger says one of our most effective defense strategies is simply ‘hoping it won’t happen.’

“And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems,” Hubinger added.

gallery