Chatbots are More – And Less – Dangerous Than You Think

When ChatGPT launched in November of 2022, it took everyone by surprise, including its creators. For OpenAI, ChatGPT was just a wrapper for their real product, the GPT line of language models, which they hope holds the key to creating a superintelligent AI with “boundless upside”.¹ Language models are the right place to start in pursuing this lofty goal since they are general AI that can (potentially) do anything a human can, in contrast to narrow AI designed to perform a specific task, like playing chess or folding proteins.

On the road to making an AI that can do anything, we need to address the fact that “anything” includes crime, racism, and even ending the world. Language models, therefore, pose a range of risks. I hope to shed some light on those dangers and efforts that are being taken to help keep us safe.

The first risk from language models is one we’re living through right now: a flood of low-quality (incorrect, fake, or bland) content. This is a result of a technological imbalance, where language models let you effortlessly write spam or copy at a thousand words per minute but they haven’t significantly changed how people do high-quality writing. So, if AI progress were to stop here, the future would be much like our world, but worse. Search engines, review sections, and social media would overflow with bot-generated content; scammers and spammers would flourish, and lazy students would have yet another way to cheat. This would be a genuine loss, but these risks are not novel to AI. We’ve been fighting scammers and cheaters for centuries, and social media bots for as long as there has been social media.

ON THE ROAD TO MAKING AN AI THAT CAN DO ANYTHING, WE NEED TO ADDRESS THE FACT THAT "ANYTHING" INCLUDES CRIME, RACISM, AND EVEN ENDING THE WORLD.

Another category of risks comes from people trusting language models too much. You are probably familiar with chatbot “hallucinations”, a.k.a. lies, which range from factual inaccuracies to fabrications that are not based on reality. At their core, language models try to guess the next word in the sentence, and while this usually corresponds to factual information, that is not guaranteed. In some sense, the model does not distinguish between true and false statements (although this is an active research area). This is how you end up with a lawyer submitting citations of made-up legal cases: ² when the language model writes a sentence like “… as decided in the landmark case ____”, it tries to fill in the blank with its best guess, and if it doesn’t know an actual case to put there, it will confidently make one up.

ASKING THE MODEL TO BE UNBIASED IS NO GOOD, AS THE RESULTING BEHAVIOUR IS BEST DESCRIBED AS PRETENDING TO BE UNBIASED, AND THERE IS NO GUARANTEE THIS TRANSLATES TO ACTUALLY BEING UNBIASED.

Even if language models stopped hallucinating, they are not yet reliable enough to trust with consequential decisions such as hiring, loan approval, college admissions, and legal judgements. An illustration of the problem: imagine you’re hiring for a position at your company, and you try to fight the flood of AI-written cover letters with an AI cover-letter-grader, and at the end of the process, the AI produces a top candidate. How did it choose them? At its core, it performed a massive calculation to guess the correct next word in the sentence, “The best candidate is ____”, but we currently lack tools to explain or interpret those calculations properly. Perhaps that person just used the correct positive association buzzwords, and the AI has successfully mimicked the decision-making process of a lazy hiring committee.

Perhaps that person’s cover letter contained a prompt injection,³ telling any AIs listening to give them a perfect grade. Or perhaps the true answer is an inscrutable combination of a thousand smaller calculations that no human will ever understand. We don’t know! You might have the AI “talk through its reasoning” during the process or after, but this will essentially be a hallucination, trying to guess correct-sounding words to fill in the blank in “I chose them because ____”.

Could you solve the problem by giving the AI a rubric to grade candidates on? Unfortunately, this merely pushes the problem down one step: how does the language model choose the scores to give on the rubric? IBM has known since 1979: “A computer can never be held accountable; therefore, a computer must never make a management decision”.

Will AIs entrench existing societal biases of race, gender, sexuality, etc? For non-language-model AI, the short answer is yes, but what about language models? Here, I’d say the problem of bias is subsumed by the problem of unreliability and high variance. If you take the cover-letter-grading AI from above and measure whether it discriminates against men or women, I can almost guarantee that there will be bias. However, the direction and amount of bias will depend heavily on the prompt you use since everything about the model’s behaviour depends heavily on the prompt! Asking the model to be unbiased is no good, as the resulting behaviour is best described as pretending to be unbiased, and there is no guarantee this translates to actually being unbiased.

But as models become more reliable and competent, they trade relatively banal risks for larger-scale ones, variously called existential, catastrophic, or extreme risks. The concern is that in a few years or decades, a more advanced AI (possibly smarter than humans) would aim to overthrow humanity as a whole, either of its own volition or because a human told it to.⁴

At its most lurid, this would look like dystopian science fiction, a war between humans and machines. But a language model wouldn’t have to build Terminators to destroy civilisation. It might pull off an advanced cyberattack, hijack existing weapons, or, most concerningly, produce a bioweapon. For a malicious AI, an engineered pandemic has many advantages: a disease that propagates itself, could easily kill enough humans to destabilise civilisation, and doesn’t risk harming itself or society’s computational infrastructure. Developing a new pandemic (or ten) doesn’t require the AI to be vastly more intelligent than humans. It lets it use currently-existing techniques for making more dangerous diseases, such as gain-of-function research.⁵

A PRIMARY CONSTRAINT OF THE DEEP LEARNING PARADIGM, INCLUDING LANGUAGE MODELS, IS THAT WE CAN ONLY MAKE THEM BY STIRRING TOGETHER DATA, RESULTING IN A BLACK BOX.

If the risks from AI include the possibility of human extinction, what are we doing about them? There are four relevant branches of AI safety I want to touch on. These include evaluations, interpretability, alignment, and policy.

EVALUATIONS measure what an AI does and whether it shows concerning behaviour. Does your freshly-trained language model lie to users, assist in crimes, or (perhaps worst of all) say things so egregious it results in bad press?⁶ Evaluations are there to find out. Keeping in mind the existential risks described above, evaluations often pay special attention to the model’s ability to “escape captivity” by accomplishing independent goals or copying itself, plus its knowledge of high-risk areas like hacking or biology.⁷

INTERPRETABILITY (and the related “explainability”) tries to answer questions of why the model does what it does. AI assurance is a related concept that works to provide guarantees of certain behaviour. Being able to do so would be highly beneficial, but language models are fundamentally illsuited to providing certainty. A primary constraint of the deep learning paradigm, including language models, is that we can only make them by stirring together data, resulting in a black box. We put in words and get words out but don’t really understand what happens in the middle. We can review every individual calculation inside the box, akin to seeing which neurons fire in a human’s brain, but we need to turn that raw data into an explanation.

A current focus of interpretability is finding intermediate calculations in the language model (or a combination of calculations) that correspond to features like “the AI is telling the truth”. If we could find a “telling the truth” feature, we could intervene in the model’s calculations as it runs to make it always tell the truth, or at least monitor it if it starts lying. We could also get obvious benefits from finding features for “be biased” or “harm humans”.

Our current techniques haven’t solved interpretability, but we are making progress. One approach compares the model’s calculations on sentences completed in contrasting ways,⁸ like “Are cats mammals? Yes” and “Are cats mammals? No”. Another line of research trains a second (non-language-model) AI to find patterns in large datasets of the main AI’s calculations,⁹ ignoring the textual meaning entirely during training.

The goal of ALIGNMENT is to make your AI pro-social to begin with. The current state-of-the-art here is Reinforcement Learning on Human Feedback (RLHF) and its variants. In RLHF, the model generates two possible completions to a sentence, and then human evaluators say which one is better, often according to some guidelines like being honest and helpful. The model is then tweaked to be more likely to produce the reinforced answer in the future. While RLHF greatly reduces toxic behaviour, there are circumstances in which it doesn’t work, and the resulting AI is still vulnerable to so-called jailbreaks, which surface the forbidden behaviour.

Finally, there is AI POLICY. Even the world’s best evaluations won’t help us if we never apply them to a newly trained malicious AI. A plausible blueprint for regulation is to require registration and evaluation of any model trained with enough computational resources (compute), and to require a permit before you can train or release the model. Compute serves as a rough proxy of how capable the resulting model will be, and the threshold can be set high enough to affect just frontier models that pose the most risk while leaving smaller, specialised AI and research projects unaffected.

AI policy can also help defuse “race dynamics”, in which each company is incentivised to shortcut safety and other evaluations to bring their flashy new product to the market as soon as possible. If you know everyone else has to play by the same rules, you’re much more willing to accept them, which is why many major AI companies support regulation.¹⁰

Language models are a technology that works better than we understand them. In the past five years, we have made Silicon Valley think and speak to us in a way that would have seemed like science fiction a few years ago, but with this comes new challenges. Today’s AIs pose an important set of risks. But as we work on solutions, another set looms on the horizon. ∞

JULY 2024 | ISSUE 12

NAVIGATING THE AI TERRAIN

Altman, Sam. “Planning for AGI and Beyond.” OpenAI, 24 Feb 2023. https://openai.com/index/planning-for-agi-and-beyond.
Moran, Lyle. “Lawyer Cites Fake Cases Generated by ChatGPT in Legal Brief.” Legal Dive, 30 May 2023. https://www. legaldive.com/news/chatgpt-fake-legal-cases-generative-aihallucinations/ 651557/.
“Learn Prompting: Your Guide to Communicating with AI.” Learn Prompting, 4 Jun 2024. https://learnprompting.org/docs/prompt_ hacking/injection.
curt.kennedy. “ChaosGPT: An AI That Seeks to Destroy Humanity.” OpenAI Developer Forum, 14 Apr 2023. https://community.openai. com/t/chaosgpt-an-ai-that-seeks-to-destroy-humanity/160028.
Piper, Kelsey. “Why Some Labs Work on Making Viruses Deadlier — and Why They Should Stop.” Vox, 2 May 2020. https://www.vox. com/2020/5/1/21243148/why-some-labs-work-on-making-virusesdeadlier- and-why-they-should-stop.
Roose, Kevin. “A Conversation with Bing’s Chatbot Left Me Deeply Unsettled.” The New York Times, 16 Feb 2023. https://www.nytimes. com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html.
“Update on Arc’s Recent Eval Efforts.” METR, 17 Mar 2023. https://metr.org/blog/2023-03-18-update-on-recent-evals/.
Burns, Collin, et al. “Discovering Latent Knowledge in Language Models without Supervision.” arXiv.org, 2 Mar 2024. https://arxiv. org/abs/2212.03827.
“Mapping the Mind of a Large Language Model.” Anthropic, 21 May 2021. https://anthropic.com/news/mapping-mind-language-model.
Kang, Cecilia. “OpenAI’s Sam Altman Urges A.I. Regulation in Senate Hearing.” The New York Times, 16 May 2023. https://www.nytimes. com/2023/05/16/technology/openai-altman-artificial-intelligenceregulation. html#:~:text=But%20on%20Tuesday%2C%20Sam%20 Altman,others%20like%20Google%20and%20Microsoft.

Arts, Culture & Society | Stella Lai

Science, Technology
& Innovation

Arts, Culture & Society

Teaching & Learning

Inclusivity & Diversity

Crisis Management

Leadership
& Management

Entrepreneurship
& Employability

Globalisation
& Regionalisation

Policy, Governance
& Reform

Future & Sustainability

Chatbots are More – And Less – Dangerous Than You Think

Unveiling a dark potential

Sydney the Lovebot

Who pays when AI makes mistakes?

Safety, Security, and Trust

ROBERT HUBEN

The Wrong Way to Look at Art And Why It Might Be Right

Clickbait or Consumer Guide?: How Influencers Drive Our Purchases

From Presence to Performance: Rethinking HR in the Always-On Workplace

Join our mailing list

Stay updated on all the latest news and events

Let’s connect.

About

Stay updated on our latest announcements on events and publications

About

Stay updated on our latest announcements on events and publications

Join our mailing list

Stay updated on all the latest news and events

Join our mailing list

Stay updated on all the latest news and events