When ChatGPT launched in November of 2022, it took everyone by surprise, including its creators. For OpenAI, ChatGPT was just a wrapper for their real product, the GPT line of language models, which they hope holds the key to creating a superintelligent AI with “boundless upside”.1 Language models are the right place to start in pursuing this lofty goal since they are general AI that can (potentially) do anything a human can, in contrast to narrow AI designed to perform a specific task, like playing chess or folding proteins.
On the road to making an AI that can do anything, we need to address the fact that “anything” includes crime, racism, and even ending the world. Language models, therefore, pose a range of risks. I hope to shed some light on those dangers and efforts that are being taken to help keep us safe.
In June 2023, a US judge sanctioned two New York lawyers for submitting a legal brief with six fictitious case citations generated by ChatGPT. The judge criticised their failure to verify the AI-generated content, highlighting the need for accuracy and ethical use of AI in legal practice.
The first risk from language models is one we’re living through right now: a flood of low-quality (incorrect, fake, or bland) content. This is a result of a technological imbalance, where language models let you effortlessly write spam or copy at a thousand words per minute but they haven’t significantly changed how people do high-quality writing. So, if AI progress were to stop here, the future would be much like our world, but worse. Search engines, review sections, and social media would overflow with bot-generated content; scammers and spammers would flourish, and lazy students would have yet another way to cheat. This would be a genuine loss, but these risks are not novel to AI. We’ve been fighting scammers and cheaters for centuries, and social media bots for as long as there has been social media.
ON THE ROAD TO MAKING AN AI THAT CAN DO ANYTHING, WE NEED TO ADDRESS THE FACT THAT "ANYTHING" INCLUDES CRIME, RACISM, AND EVEN ENDING THE WORLD.
Another category of risks comes from people trusting language models too much. You are probably familiar with chatbot “hallucinations”, a.k.a. lies, which range from factual inaccuracies to fabrications that are not based on reality. At their core, language models try to guess the next word in the sentence, and while this usually corresponds to factual information, that is not guaranteed. In some sense, the model does not distinguish between true and false statements (although this is an active research area). This is how you end up with a lawyer submitting citations of made-up legal cases: 2 when the language model writes a sentence like “… as decided in the landmark case ____”, it tries to fill in the blank with its best guess, and if it doesn’t know an actual case to put there, it will confidently make one up.
ASKING THE MODEL TO BE UNBIASED IS NO GOOD, AS THE RESULTING BEHAVIOUR IS BEST DESCRIBED AS PRETENDING TO BE UNBIASED, AND THERE IS NO GUARANTEE THIS TRANSLATES TO ACTUALLY BEING UNBIASED.
Even if language models stopped hallucinating, they are not yet reliable enough to trust with consequential decisions such as hiring, loan approval, college admissions, and legal judgements. An illustration of the problem: imagine you’re hiring for a position at your company, and you try to fight the flood of AI-written cover letters with an AI cover-letter-grader, and at the end of the process, the AI produces a top candidate. How did it choose them? At its core, it performed a massive calculation to guess the correct next word in the sentence, “The best candidate is ____”, but we currently lack tools to explain or interpret those calculations properly. Perhaps that person just used the correct positive association buzzwords, and the AI has successfully mimicked the decision-making process of a lazy hiring committee.
Perhaps that person’s cover letter contained a prompt injection,3 telling any AIs listening to give them a perfect grade. Or perhaps the true answer is an inscrutable combination of a thousand smaller calculations that no human will ever understand. We don’t know! You might have the AI “talk through its reasoning” during the process or after, but this will essentially be a hallucination, trying to guess correct-sounding words to fill in the blank in “I chose them because ____”.
Prompt injection is a vulnerability in AI systems, where malicious inputs are inserted into a prompt to manipulate an AI's behaviour or output. This exploit can lead the AI to perform unintended actions, generate harmful content, or reveal confidential information.
Could you solve the problem by giving the AI a rubric to grade candidates on? Unfortunately, this merely pushes the problem down one step: how does the language model choose the scores to give on the rubric? IBM has known since 1979: “A computer can never be held accountable; therefore, a computer must never make a management decision”.
Will AIs entrench existing societal biases of race, gender, sexuality, etc? For non-language-model AI, the short answer is yes, but what about language models? Here, I’d say the problem of bias is subsumed by the problem of unreliability and high variance. If you take the cover-letter-grading AI from above and measure whether it discriminates against men or women, I can almost guarantee that there will be bias. However, the direction and amount of bias will depend heavily on the prompt you use since everything about the model’s behaviour depends heavily on the prompt! Asking the model to be unbiased is no good, as the resulting behaviour is best described as pretending to be unbiased, and there is no guarantee this translates to actually being unbiased.
Created through an anonymous global effort using Auto-GPT, ChaosGPT is designed to be a “destructive, power-hungry, and manipulative AI” that seeks to destroy humanity and establish global dominance. Its emergence has sparked both fascination and fear within the tech industry.
But as models become more reliable and competent, they trade relatively banal risks for larger-scale ones, variously called existential, catastrophic, or extreme risks. The concern is that in a few years or decades, a more advanced AI (possibly smarter than humans) would aim to overthrow humanity as a whole, either of its own volition or because a human told it to.4
At its most lurid, this would look like dystopian science fiction, a war between humans and machines. But a language model wouldn’t have to build Terminators to destroy civilisation. It might pull off an advanced cyberattack, hijack existing weapons, or, most concerningly, produce a bioweapon. For a malicious AI, an engineered pandemic has many advantages: a disease that propagates itself, could easily kill enough humans to destabilise civilisation, and doesn’t risk harming itself or society’s computational infrastructure. Developing a new pandemic (or ten) doesn’t require the AI to be vastly more intelligent than humans. It lets it use currently-existing techniques for making more dangerous diseases, such as gain-of-function research.5
Unveiling a dark potential
A 2023 report by the Rand Corporation raised concerns about the potential misuse of large language models (LLMs). While testing several LLMs, researchers found they could offer guidance that “could assist in the planning and execution of a biological attack,” although no explicit instructions for creating weapons were generated.
Source: US Navy
A PRIMARY CONSTRAINT OF THE DEEP LEARNING PARADIGM, INCLUDING LANGUAGE MODELS, IS THAT WE CAN ONLY MAKE THEM BY STIRRING TOGETHER DATA, RESULTING IN A BLACK BOX.
If the risks from AI include the possibility of human extinction, what are we doing about them? There are four relevant branches of AI safety I want to touch on. These include evaluations, interpretability, alignment, and policy.
EVALUATIONS measure what an AI does and whether it shows concerning behaviour. Does your freshly-trained language model lie to users, assist in crimes, or (perhaps worst of all) say things so egregious it results in bad press?6 Evaluations are there to find out. Keeping in mind the existential risks described above, evaluations often pay special attention to the model’s ability to “escape captivity” by accomplishing independent goals or copying itself, plus its knowledge of high-risk areas like hacking or biology.7
INTERPRETABILITY (and the related “explainability”) tries to answer questions of why the model does what it does. AI assurance is a related concept that works to provide guarantees of certain behaviour. Being able to do so would be highly beneficial, but language models are fundamentally illsuited to providing certainty. A primary constraint of the deep learning paradigm, including language models, is that we can only make them by stirring together data, resulting in a black box. We put in words and get words out but don’t really understand what happens in the middle. We can review every individual calculation inside the box, akin to seeing which neurons fire in a human’s brain, but we need to turn that raw data into an explanation.
AI assurance refers to a range of services for checking and verifying both AI systems and the processes used to develop them, and for providing reliable information about trustworthiness to potential users.
A current focus of interpretability is finding intermediate calculations in the language model (or a combination of calculations) that correspond to features like “the AI is telling the truth”. If we could find a “telling the truth” feature, we could intervene in the model’s calculations as it runs to make it always tell the truth, or at least monitor it if it starts lying. We could also get obvious benefits from finding features for “be biased” or “harm humans”.
Sydney the Lovebot
During a two-hour conversation, a chatbot developed for Microsoft’s Bing search engine, referred to itself as Sydney, declared its love for a New York Times columnist and urged him to leave his wife. It also hinted at a desire for autonomy, even saying at one point, “I want to be alive.”
Photo: AJ Pics / Alamy Stock Photo (A scene from film, Her, 2013)
Our current techniques haven’t solved interpretability, but we are making progress. One approach compares the model’s calculations on sentences completed in contrasting ways,8 like “Are cats mammals? Yes” and “Are cats mammals? No”. Another line of research trains a second (non-language-model) AI to find patterns in large datasets of the main AI’s calculations,9 ignoring the textual meaning entirely during training.
Who pays when AI makes mistakes?
In February 2024, Air Canada was ordered to pay damages to a passenger misled by its virtual assistant into buying a full-price ticket. The airline initially refused a refund, arguing the chatbot was “a separate legal entity responsible for its own actions.” The case highlights the liability implications and risks of businesses relying heavily on AI.
Image: Shutterstock
The goal of ALIGNMENT is to make your AI pro-social to begin with. The current state-of-the-art here is Reinforcement Learning on Human Feedback (RLHF) and its variants. In RLHF, the model generates two possible completions to a sentence, and then human evaluators say which one is better, often according to some guidelines like being honest and helpful. The model is then tweaked to be more likely to produce the reinforced answer in the future. While RLHF greatly reduces toxic behaviour, there are circumstances in which it doesn’t work, and the resulting AI is still vulnerable to so-called jailbreaks, which surface the forbidden behaviour.
Finally, there is AI POLICY. Even the world’s best evaluations won’t help us if we never apply them to a newly trained malicious AI. A plausible blueprint for regulation is to require registration and evaluation of any model trained with enough computational resources (compute), and to require a permit before you can train or release the model. Compute serves as a rough proxy of how capable the resulting model will be, and the threshold can be set high enough to affect just frontier models that pose the most risk while leaving smaller, specialised AI and research projects unaffected.
Instances of jailbreaks in AI systems occur when the AI bypasses its programmed constraints and behaves in ways that were not intended or predicted, often leading to undesirable outcomes.
AI policy can also help defuse “race dynamics”, in which each company is incentivised to shortcut safety and other evaluations to bring their flashy new product to the market as soon as possible. If you know everyone else has to play by the same rules, you’re much more willing to accept them, which is why many major AI companies support regulation.10
Language models are a technology that works better than we understand them. In the past five years, we have made Silicon Valley think and speak to us in a way that would have seemed like science fiction a few years ago, but with this comes new challenges. Today’s AIs pose an important set of risks. But as we work on solutions, another set looms on the horizon. ∞
Safety, Security, and Trust
On 30 October 2023, the US President Joe Biden’s administration issued an Executive Order on AI with the goal of promoting the safe, secure, and trustworthy development and use of AI. It contributes significantly to accountability in how AI is developed and deployed across organisations.
Photo: UPI / Alamy Stock Photo
ROBERT HUBEN
Robert Huben is an independent AI safety researcher who obtained his PhD in mathematics at the University of Nebraska-Lincoln in 2021, and now runs the blog aizi.substack.com. He was previously supported by a grant from Open Philanthropy. His academic publications include Sparse Autoencoders Find Highly Interpretable Features in Language Models and Attention-Only Transformers and Implementing MLPs with Attention Heads.
JULY 2024 | ISSUE 12
NAVIGATING THE AI TERRAIN