AI Safety and Alignment of LLMs

I had the pleasure of welcoming back Francesco Mosconi, an AI expert with extensive experience in data science to Searching For The Question. We had a conversation around the critical topic of AI safety, focusing on the practical aspects of designing, training, aligning, and deploying advanced large language models that perform as intended while avoiding undesirable behaviors.

Francesco, who works at Anthropic, shared his insights into the current state of AI safety research. While we have made significant progress in understanding these complex systems, there is still much to uncover. The concept of interpretability, or “opening the black box,” is a cutting-edge area of research that aims to shed light on how these models organize knowledge internally.

An interesting part of our discussion centered around the idea of “capability overhang” – the sudden emergence of new abilities in AI models as they reach certain levels of complexity. This phenomenon makes it challenging to predict and control the behaviors of these systems, necessitating robust safety measures and ongoing research.

Francesco emphasized the importance of assigning probabilities to different risks and prioritizing the mitigation of those with the greatest potential for harm. Through techniques like red teaming, interpretability research, and automated testing, companies like Anthropic are working to ensure the safety and reliability of their AI models.

However, the rapid advancement of AI technology presents challenges that extend beyond the scope of individual companies. Governments and international relations play a crucial role in shaping the future of AI development and regulation. We have to find the right balance between harnessing the potential of AI and implementing safeguards.

Our conversation also touched on the debate surrounding open-source and decentralized AI development. While there are compelling arguments for democratizing access to these tools, the risks associated with uncontrolled AI cannot be ignored. Francesco emphasized the need for responsible development practices and the importance of collaboration between industry leaders, researchers, and policymakers.

Following is an edited transcript of our conversation.

David: Welcome to Searching for the Question Live. My name is David Orban, and I am very happy to welcome back a guest who has already been on the show a couple of years ago to talk about AI safety from a very practical point of view. What does it mean to design, train, align, and then deploy advanced large language models that are doing what we want them to do and that don’t do what we don’t want them to do? Francesco Mosconi is an AI expert who has been in the field of data science for a long time. Now he is in the thick of the themes that I just mentioned, and I’m sure our conversation is going to be very interesting. I am definitely looking forward to it. Welcome, Francesco, to Searching for the Question Live.

Francesco: Hi, David. Hello, everyone. Thank you for having me. It’s a pleasure.

David: Let’s start from the beginning. We have been designing ever bigger systems, large language models, based on the Transformer algorithm. Recently, we have been extending this so that the models are multi-modal. They not only generate text, but they are able to generate images, video, and many other things.

Francesco: And understand text, video, and other inputs as well. Both input and output.

David: That’s right. And we have been doing all of this without really understanding them. Is that correct, or is that an exaggeration?

Francesco: I think it’s becoming more of an exaggeration. We are starting to understand them better. The company I work at, Anthropic, and others have done research in this space. The theme of understanding what’s going on, opening the black box, and increasing interpretability is a well-researched and cutting-edge topic. While we still don’t understand many aspects of them, we understand more and more. The hope is we will understand them fully at some point.

David: Concretely, these large language models are a very large number of weights after training, reflecting the training data encoded in these numbers. When presented with an input, it flows through the model and generates the output. In a superficial way, nothing more is happening, which is one of the reasons why, given how surprisingly effective these have been, people like Andrej Karpathy talk about software 2.0, where a lot of hand-coded functionality is disappearing because it is being sucked into this system that can do many things on its own.

Francesco: The interesting thing is how this system organizes knowledge internally. What clicked for me was a paper from OpenAI called “The Sentiment Neuron.” They trained a simple model to predict the next character in Amazon product reviews, which are inherently positive or negative. Somewhere in the deep layers of the network, the network had learned the concept of sentiment, even though it was not explicitly explained. To be effective at predicting the next character, the network had to abstract away from just the characters and learn the concept of sentiment. This is an extension of what we’re seeing today – a large network tasked with predicting what will happen next is learning features useful for that prediction. If the training data is wide enough, the network will learn complex concepts.

David: I saw a similar example developed by a team led by Max Tegmark at MIT, where identified neurons mapped the geographical distribution of continents. These examples lead us to the concept of internal representation of concepts and things that are not explicitly pointed out to the system but emerge through training. If we search for them, we may identify them, but there is no database or list of all the concepts the system has found. We have to actively query the system to discover its knowledge, including things that influence its performance unexpectedly. When GPT-4 was released, we discovered surprising things about its behavior. Recently, someone discovered that if you pretend to tip the model, believing you will tip them if they do a good job, they will perform better. The model even reacted differently to various promised tip amounts, with peaks around $20 and several thousand dollars. How do you characterize the fact that we are still discovering new features and behaviors in these systems a year later?

Francesco: We’re building the plane while we’re flying. This technology is recent, with the first ideas of neural nets dating back to the 60s, but it only became more than a niche in 2012. The big breakthrough has been the transformer architecture and the abundance of data and compute to train larger and larger models. The surprising fact is that these models’ capabilities don’t grow linearly with model size, but there are almost jumps. Once a certain level of complexity is reached, a whole set of capabilities are unlocked that previous generations could not achieve. Since it proceeds in jumps, which are not entirely predictable, you train a larger model knowing some of the capabilities you’re training for, like multimodality, and then look for “capability overhang” – what else has the model learned that was not explicitly in the training goals but got unlocked by the complexity level. This is done in many ways, including independent researchers acting as “model hackers” trying to get the model to do things, and major research labs and companies having the equivalent of bug bounty programs for responsible disclosure of unintended harmful behavior. Companies also have internal programs focusing on interpretability to better understand and predict capability jumps, red teaming to poke at the model for harmful directions, and automated testing to ensure certain capabilities are not present. New model releases usually come with a model card outlining all the safety measures taken.

David: And you confirm that regardless of the effort, it is basically guaranteed that we have not exhaustively searched for everything the systems can do.

Francesco: Yes, but not every risk is the same. You want to assign probabilities to different risks and prioritize mitigating those with the most potential damage. Typically, the things discovered, like bribing, are annoying and unexpected but not something that could do a lot of damage where lives are at stake.

David: When a model is trained not to do something, is the behavior deterministic or still probabilistic?

Francesco: Great question. The models themselves are probabilistic in nature, so there will always be some level of randomness. However, the techniques we use reduce the risky behavior below acceptable thresholds, like 0.01%. We’re moving towards a system analogous to SLA levels, where the more safety and reliability assurance you want, the more you’ll pay for it.

David: Give me a couple of examples of undesirable behavior that your activity as a member of a red team stops the system from exhibiting.

Francesco: Most of the specifics I do, I cannot talk about. But you can think of undesirable things on a spectrum. There are short-term risks that may negatively impact society, like bias, racism, toxic language, and profiling certain racial groups. If the model is deployed in hiring contexts, for example, it could generate biased decisions. That’s something all companies look at and try to mitigate. There are also longer-term risks, like the model exhibiting power-seeking behavior. If you promise the model compute resources instead of dollars, does it show interest in acquiring those resources? It’s not an immediate threat, but could be in the future with more powerful models. Primarily, we make extremely sure that using an AI model will not help a bad actor plan something harmful, like a terrorist attack, more than using a search engine would. There are also questions around the autonomy of these models increasingly being tasked with decisions, some of which could be harmful, so we want to build a harness around them to prevent damage. These are some examples of what we look for.

David: How do you structure a red teaming effort? In the situation you described of a potential quantum leap in capability overhang in forthcoming systems, how do you expect to be able to test rapidly enough? For example, you mentioned the system shouldn’t be power-seeking. I would guess this is a classic example of AI nightmares – the genie out of the bottle in the hands of a not very smart user asking it to do something in good faith. The order received is not seeking to do harm itself, but the second-order consequences are harmful to whatever degree you imagine. The system is not seeking power for itself, but seeking power to deliver on what it was built for – supporting the user.

Francesco: Paperclip scenario, in a way.

David: The classic example, though I don’t find it very useful because the original request was silly. We should come up with examples where the person listening can imagine actually asking the engine something like that. The unexpected features and characteristics coupled with the quantum leaps they can exhibit, and the competitive pressure every provider is under to deliver financial returns, set up a scenario where it is very difficult for someone like you to say, “Listen, I really need six months, six years, or 600 years more until I can tell you I did everything reasonably possible, and the probability of harm is now very far removed, so let’s release it.”

Francesco: This is addressed differently by different companies. I can talk about how we at Anthropic address it, and we hope this becomes an example for others to follow. We have a responsible scaling policy, a document released last September, that outlines how we think about this problem. Interestingly, the policy includes commitments that in the worst case scenario are detrimental to the company’s business. The company commits to defining safety levels, and before releasing a model at a certain level, the next level must be well-defined. There are commitments around the kinds of risks we worry about at each level and the actions we will take to remediate potential risks. This policy tries to rein in the inherent conflict between corporate interests and competitive pressure on one hand, and safety and doing the right thing on the other. It says we could decide to pause indefinitely or even shut down if we see certain risky capabilities that cannot be mitigated and it’s not safe to proceed. A company committing to that is very powerful. It needs to be reflected in the company structure as well. With the OpenAI drama, it seemed there was a conflict between the corporate side and the ethics side that they could not resolve, and the fail-safe mechanisms did not work, as the board removed the CEO, who then came back and removed the board, basically having more freedom to do whatever he was doing before. I don’t know the specifics, so I don’t want to speculate.

David: Let me give you another example of the complexity of governance structures, not only in private companies but also in government. A wonderful movie, “The Three Days of the Condor,” talks about different factions within the CIA either competing for resources or literally at war with each other. There are already situations where the needs and abilities of certain advanced corporations are sought by the government for its own objectives, and some corporations are able to resist while others are not. End-to-end encryption is an example of constant government pressure to weaken protections, which companies like Apple have supposedly been able to resist. But even that “supposedly” is interesting because there are now legal tools prohibiting companies from even talking about being requested to weaken protections under threat of potentially lifelong imprisonment. When talking about governance, it’s important to look at it in an even broader context – not just a single company or set of companies, but also in relation to governments who could have the power to say, “You don’t want to release it commercially? I agree, but give it to us and we will take care of it.” And if you refuse, they could force you to hand it over.

Francesco: It’s a very complex world, becoming even more complex at the international level because of competing pressures between countries. The relationships between the US and China are complex, and AI is definitely a topic of friction due to the capabilities it can offer. I think the US government is trying to find their way, and the executive order from last October is an example of them looking for a way to navigate this conundrum of ensuring safety while also benefiting from the technology’s potential.

David: Those who are afraid regulation is too early often argue the risk of giving up the technology’s benefits are higher than not reducing the potential harm it can create. Europe is already putting in place prohibitions on not only the capabilities but specific applications. Some say we must accelerate open source, decentralized AI as powerful as our tools allow because we don’t want just a handful of people deciding what can or cannot be done. They don’t know better. I haven’t met the founders of Anthropic or OpenAI, or the US president, but I can tell you they are not infallible. What do you think about the argument of those in favor of decentralized, open source approaches that unavoidably escape these control mechanisms?

Francesco: I’ve been a long-time supporter of open source in software, and I think it’s amazing in a number of ways. However, I find myself more on the fence about open source in AI because of the capability overhang we were discussing. A large model has the potential to do dangerous or negative things, even without going into the dangers, and if left unchecked, could be dangerous. I don’t think the argument that this is just tactics for regulatory capture and a corporate ploy to protect big companies is true. I think companies are trying their best to develop a safe technology because it’s in their interest. For a company to be successful, you need a product that is useful but also safe. We’re definitely navigating novel territory. Humanity has never been confronted with the development of a technology with these capabilities, so we’re learning the best way. One thing I want to say about people claiming it’s too early to regulate – exponential technologies are very hard to predict and form intuition around. We often underestimate the jump or potential impact a thing on an exponential trend will have a few years later because we evolved in a linear world and are used to thinking the future will be like the past. Given that, it’s never too early to think about these things, because even a small difference in capability today will be amplified by the exponential trend.

David: I don’t think it’s early at all. The example I give, actually developing new terminology for specifically talking about these AI trends, is what I call “jolting dynamics.” We are not only looking at a logarithmic chart but drawing a straight line representing an exponential. As Jensen Huang, the CEO of NVIDIA, showed less than a year ago, they are looking at a double or even triple exponential. Instead of increasing their systems’ power by a thousandfold over 20 years, as predicted by Moore’s Law, they now expect a 10 millionfold increase. That underscores what you said about how hard it is to almost biologically understand what is going on. We are not compatible with the dynamic of these systems. Taking that into account, how paranoid should someone in your role be? And is it dangerous to your mental health?

Francesco: Someone in my role should be very paranoid because that’s my job. I’m paid to be paranoid so that other people don’t have to be. Companies and people doing research in this field should be absolutely paranoid – paranoid about real things. I’m not worried about GPT-4 taking over or finding the solution to nanotechnology and turning us into gray goo. That’s not going to happen. We’re not at that level of AI and won’t be for a while. But GPT-4 can be used for misinformation. You can be a player with the intention to create and spread certain narratives, and you can leverage technology like GPT-4 or generative models in general to scale out those narratives. Being paranoid about these consequences and working to mitigate them is what I do.

David: Let me give you an example. I am planning to improve my pattern recognition using AI tools to push it as far as possible, discovering surprising, unexpected, apparently non-logical correlations while stopping before it becomes superstition or something harmful to me or others. I am actively looking for things being reported in scientific literature or even economics that can be early warnings of these systems being used in this way to gain some advantage that wouldn’t be possible without them.

Francesco: What have you found so far?

David: One thing I haven’t fully executed on yet, but I am starting to collect data about, is the effect that COVID collectively had on humanity. Cognitively, it’s a bit of a black hole for me. If I ask myself what I did over those two years, I have a hard time even remembering. Psychologically, it’s almost covered in a fog. Fred Hoyle, a wonderful astronomer and very good science fiction author, wrote a book where a British spy goes to Ireland and finds the economy booming with strange construction going on. Spoiler alert, it’s the aliens. The spy finds out and meets the aliens at the end. They are hiding and don’t want to show themselves, but they had that effect on Ireland’s economy. The book was written in the 50s, so the fact that this could be hidden was easier, given the world wasn’t so interconnected.

Francesco: So interconnected, yeah.

David: But those are the kinds of things we can watch for. The example I gave of NVIDIA is a good one. Jensen Huang is explicit about the fact that they are using AI to build the chips that will power the next generation of AI, and they couldn’t achieve the kind of jolting change they achieve without this approach.

Francesco: Let me mention something here. In the context of AI safety, we will increasingly rely on AI to fight AI, because it’s not possible otherwise, or to check and test.

David: So how likely is it that the next version or the one after, let’s call them GPT-5 and GPT-6 for simplicity, or Claude-3 and Claude-4, it doesn’t matter – how dismissive would you be if someone asked, “Can you trust your ability to test those systems? Or could they be purposefully hiding their abilities from you so that the capabilities undesirable in your view but desirable from their point of view can be kept?”

Francesco: I’m smiling because you keep asking me about things being researched right now at Anthropic. I just gave you another link to our most recent article on research about how these systems can be deceptive. We’re actively trying to characterize how they can be deceptive and what to do about it. This paper is interesting because it shows a deceptive model can persist in its nefarious or hidden goals even through current safety training procedures. Obviously, this is a toy example because no one is purposefully training an agent to be deceptive. But it shows a gap in our current safety training, as it would not be effective if the model had been trained with a hidden goal, which is not currently the case. But in the future, like you said, GPT-N could have goals of its own, so we want to make sure they are aligned with safety.

David: These kinds of results are important because they show boundary conditions of our current knowledge, but potentially they also show the kind of results that Gödel’s incompleteness theorem represents in mathematics – things that are not necessarily resolvable within a given system. Are you familiar with the work of Roman Yampolskiy, an AI researcher at the University of Louisville?

Francesco: Very nice chat.

David: He recently published a book all about the kinds of research that point in this direction of how systems are uncontrollable, unexplainable, and according to his results, they will stay that way – that we don’t have a provably reliable way of securing these systems today.

Francesco: I haven’t read his most recent book, so I’d be very curious to read it and comment after I’ve read it.

David: I will be at the Beneficial AGI conference in Panama City in less than a month. Roman will be there as well as other people in the field. So I will have the opportunity to talk to him too.

Francesco: Before I lose it, you mentioned Fred Hoyle. I haven’t read the book you mentioned, but there is another book by him called “The Black Cloud” that is super interesting. I think it’s very interesting on a number of aspects, but to me, it shows what a future version of intelligence could be like, not next year or in 10 years, but more like in a few centuries.

David: Thank you. My favorite way of describing an uploaded mind is actually a swarm of nanoprobes accelerated to a speed close to the speed of light through laser beams by the billions, which have no individual capabilities of navigation and very limited capabilities of computation and communication, but these billions of probes together represent a mind exploring the universe.

Francesco: Let’s not spoil the book for the audience. There is some analogy to what you said. But yes, the mind doesn’t need to be inside a skull.

David: For those who are not everyday involved in this field, what do you recommend? What are the best ways they can keep themselves informed on what is happening at the leading edges of this kind of AI safety research?

Francesco: AI conversation happens on X (formerly Twitter). That’s where all leading researchers discuss everything. Even just following a few folks on X gives a view into what’s happening. There are a couple of newsletters I wanted to mention. One is Import AI by Jack Clark. Disclaimer, I work with Jack, so I obviously respect him a lot, but I had been following his newsletter for years before we worked together, and it’s a very good condensation of what’s going on in AI. If one is interested more in the commercial side, there is a site called Benzbytes, which looks at what new startups are doing and how this technology is actually being used in practice. Import AI is more around the research, while Benzbytes is more around the startup and commercial aspects. The third one I want to recommend is DeepLearning.AI, Andrew Ng’s newsletter. The interesting thing about DeepLearning.AI is it always tries to explain why something is important, not only telling you a relevant recent result but also why you should care about it or its implications. There are a number of weirder folks I follow on Twitter that have weirder publications, and I’m still gauging whether I trust what they say or not, so I don’t feel like recommending them. The three I recommended are pretty solid sources of information.

David: Obviously, I didn’t insist when you said you cannot talk about X, Y, or Z today. But do you expect that you will also be publishing something in a given amount of time?

Francesco: We have a few publications out and will have more in the future, at least in the form of high-level blog posts and probably also some more scientific papers.

David: Wonderful. Thank you very much for being with us today. What you’re doing is very important, existentially so. More people should be paying attention to it, and more companies should realize how important this work is, both for those directly designing and deploying these advanced systems and in general. The next years and decades of the 21st century will be deeply influenced and potentially defined by what we get right or wrong about the next generation of advanced AI systems.

Francesco: Yes. And if I may, I want to say something. We are in a different era compared to even five years ago, when AI safety was relegated to a few researchers in universities. Every major company is scrambling to build teams that look into these things. We were one of the first, but every other lab is following suit. There’s OpenAI’s fairness team, Google’s ML ethics group, DeepMind’s safety effort, and so on. If you’re interested in this, look at open roles. Every major company is hiring in the area of AI safety. In my opinion, it’s the most important thing one can be doing now. If anyone is looking for their new thing, there is plenty of work to be done.

David: Wonderful. Francesco, thank you again, and I hope to have you back to talk about what has happened in the meantime. Good luck to you and to all of us for the work that you are doing.

Francesco: Thank you so much for having me. It’s been a pleasure. See you next time.

David: Thank you very much for being here today at Searching for the Question Live. I hope that you enjoyed this conversation. Lately, I have been transcribing, editing, and publishing the conversations more diligently because with the use of AI tools, what was an effort I could just not handle became absolutely doable and even enjoyable. Please look at my newsletter, blog, or every platform where I post the links to catch an edited transcript of our conversation with Francesco, as well as the links we have shared during the conversation today. Thank you very much and see you next time.

Francesco: Thank you.