In this conversation, we explore the future of AI agents with Tommaso Furlanello, a Machine Learning scientist with a PhD in Neuroscience from USC. With over a decade of experience, Tommaso has developed cutting-edge predictive systems using game theory, neuroscience, and deep learning. His research spans demand forecasting, bioinformatics, brain-computer interfaces, and computer vision.
Tommaso Furlanello is a leading expert in AI, particularly in the development of intelligent agents that can perceive, learn, and interact with complex environments. His significant contributions include novel training methods for AI agents, such as Born Again Neural Networks, which outperform their teachers in various tasks. His work also covers generative modeling, knowledge distillation, and policy distillation, with applications in computer vision and language modeling.
We will discuss the latest breakthroughs in creating AI agents and their ability to collaborate within societies of AI agents.
Following is an edited transcript of the conversation.
David: Let’s continue with the conversations around artificial intelligence, specifically around AI agents. What are they, why are they useful, why is there so much attention around them? Tommaso, welcome to What’s the question live! Hello everyone, thank you very much for the invitation. Thanks for being here. I always like to start these conversations by giving a moment of context before going into the merits of what we’re talking about, asking you to tell us a little about yourself, your academic and entrepreneurial path. What are you doing today and also maybe why you chose this particular path.
Tommaso: I deal with research at HK3Lab, which is a private research laboratory that does not have the objective of making products, but of carrying out research, with public and private funding sources, we do consultancy. With the aim of maintaining research as if we were any other university or private industrial laboratory of excellence. This is obviously due to the luck of the contacts and co-authorships which allows us, despite being in Italy, to work remotely with fantastic people and continue the competitive work.
I am the son of academics in a small town in Trentino, so let’s say he had to justify his behavior a lot and therefore spent time generating models of people’s behavior and trying to explain the dissociations compared to what I would have done, that’s it. And this led to studying economics, econometrics actually, at the beginning, and with a focus, let’s say, more on the game theory part, but also the empirical part.
So I worked to understand what the mathematical tools are to describe the very complex reality of behavior. I’ve always had this, I don’t know, intrinsic fear of being wrong, of assuming the wrong things compared to others, which led me to have a very empirical approach. This brought me very close and made me move away from economics to machine learning, which instead of dealing with, let’s say, retrospective data, deals with prediction. When not possible, create experimental situations in which it is possible to estimate the predictive capacity of a model. So… let’s say that was my key point: the focus around finding a principle to not make fun of myself. And this principle was the idea of ​​the out-of-sample predictions of transfer learning and then what are now the cornerstones of the successes of deep learning too.
David: Speaking for a moment about economics, there are many people who invest energy in the development of algorithms that try to do exactly what you said, in your opinion it is useless, maybe not impossible, but at the moment it is not what you have chosen, that is to predict how the stock market will go, how the economy will go one way or another. I wonder if you concluded that it wasn’t worth focusing on that area despite having studied it because of some principled decision or based on experiments you’ve done.
Tommaso: So, in the meantime I did both the three-year and the specialized economics studies abroad, so I went there enough, I insisted enough, I could have given up much earlier. I did 240-300 college credits, though.
So, in the meantime economics is a very complex thing, a very broad discipline that actually goes from taking pieces of history, pieces of psychology, up to the financial markets and the instruction model. Global and monetary institutions and up to geopolitical balances and international exchange. Typically in economics, because this is so broad and because there is so much historical dependence of the dynamics, of the institutions, of the human choices that have been made, it’s not that we’re talking about natural systems that emerge systematically, ergodically in several parts of the world. We are talking about a single structure with a long history.
Herbert Simon’s great contribution to economics, this idea of ​​we work in the science of the artificial, so we work in the science of the artificial and every single example, every abstraction that we work on is actually the result of something, it can be explained by actions of someone before. And this process makes thinking very difficult and actually leads to artificial sectorizations, modularizations of economic sectors. There is the macroeconomist, the microeconomist and each of them abstracts the world to a fairy tale relevant only to their own dynamics. I think this thing has the problem here – in physics it would be the renormalization of groups, that is, what are the relationships between one abstraction and another and how the consequences of one abstraction are recombined with the other.
And in economics it actually has a whole series of beautiful theorems of impossibility of transferring the… identifying individual behavior from aggregate behavior, predicting aggregate behavior from individual behavior, which are then ignored in practice and many other things.
Then there is finance, where finance is a totally different aspect, completely practical. There I have to say that one of my current favorite books on machine learning is about finance, and that is Advances in Financial Machine Learning, by Lopez Prado. And yet it’s a wonderful, empiricist book about how to put yourself in a situation where there’s a slight hope that the model you estimated, which seems to work in your training set, has a chance of working in the real world. Let’s say that in reality for me it was the real transition and it wasn’t even a total choice. Yes this is indeed one of the best, unfortunately it requires understanding finance to reverse the lessons to machine learning. I think it would be very useful for everyone who now makes language model systems, who interacts with users and who makes decisions in between. Unfortunately it requires understanding finance very well to derive computer science lessons, let’s say. However, this is a book that I would like to recommend, which let’s say any student or junior collaborator has always been forced to read in some way because my version is written much worse than his.
David: I asked you the question to comment for a moment on economics and finance, because then maybe we’ll come back to it. Indeed, there are many applications of artificial intelligence as it is practiced today on this and perhaps… If and when artificial intelligence allows us to better understand cognitive processes and human behavior, this ability will be further improved.
The scene from Isaac Asimov’s Foundation Cycle where the supposed future science of psychohistory applies exclusively to large groups of people, at least one planet in a galaxy with millions of inhabited planets, is beautiful. I’ll make a small spoiler: at the moment in which the scientists of the second foundation, with a very abstruse mathematical language communicating telepathically with each other, detect, reveal to others and also to the reader that they have instead managed to apply psychohistory to the individual. And therefore begin to intervene and influence the future history of the galaxy by modifying behaviors, influencing the behaviors of individuals.
Tommaso: I believe more in the idea that it is possible to model an individual rather than the aggregate. The human aggregate has surprised a lot in history and it is very difficult to identify with aggregates from the past, in different cultures, but heroizing individual humans from the past is very simple. Projecting ourselves into ancient heroes is good.
David: I see Morge’s comment following us. The idea of ​​an AI predicting the future of stock markets reminds him of the movie Pi Greco. I think I saw it but I don’t remember the plot, I don’t know about you Tommaso.
Tommaso: I remember the roommates who watched it all night several times while I was preparing for an exam at university. I didn’t see it, I heard clips of it and several times I walked past them sleeping while watching it, I have to admit. I think that’s a bit like the discussion of what will happen to the currency when the market is completely, where all the work is done by artificial intelligence. Those somewhat utopian speeches in that way. The moment the financial market is completely made up of the smartest things possible, at that point we should really be in the situation of the theory of optimal markets and the fact that the price really already has all the information inside. So…all the earning power of transactions no longer exists and markets become purely signaling processes.
David: I don’t agree because there will be a natural competition for resources and it will be in the interest of groups of robots or groups of AI competing with other groups of robots and AI to possibly hide information, to conceal, to find a solution optimal to the problem with many variables they face, which possibly optimizes in the short or medium term because in any case it is a long-term optimization outside their capabilities regardless.
Tommaso: But this simulation goes outside the asymptotic situation, there are all AIs that are so good that there is reasoning, I won’t say Kantian morality, if all that then that. But it’s one thing to make that reasoning right there when you stop at a halfway point. There are so many levels: what is the energy level? Is there nuclear fusion? What is the communication bandwidth? Is there privacy? Is the hardware owned by the software? At that point it becomes a world. Then if we end up talking about safety this is probably my criticism: there is the current world, there are the worlds where you take asymptotes with respect to shared mathematical assumptions, and then there is everything in between that I am about very ignorant let’s say, about which I don’t have too many ideas.
David: Let’s go back to what you did approaching the world of machine learning and what you’re doing now, as you said, and then we get to AI agents. How do you see the current approaches and models, the large language models trained on an enormous amount of data, with very powerful and also expensive hardware, which however have given unexpected and surprising results even for experts? They still appear to be driving investment returns that have not yet reached a plateau.
Tommaso: I am hyped, I am very happy on my part in the sense… naively they are in a very naive way the things that we would have liked at the beginning of the doctorate at the moment in which we saw – the doctorate started in 2015 so at the moment in which had been put in the mold that it had been two years since Alex Net was the medium, VGG was coming out. Shortly thereafter, DeepMind’s paper on Deep Reinforcement Learning would be released. Let’s say, that was the moment of transition, there were still skeptics, we said, look, if you take this thing, the world moves on, when we are old, things do at least this.
Then I saw the internal work at Amazon, I worked, one of the first, Amazon has always had a machine learning group for the recommendation system, but in 2016, it seems to me, they restructured it into an AI group, which is now the AI ​​group , that of… of Suami who is now the VP of Machine Learning. I was there with Alex Moore and Niman Antumar who became API at NVIDIA and I was there. This is a fortuitous moment for me who did this internship coming from Applied Machine Learner who worked for a few years first on economic data, then on Brain Science data, MRI and EEG data. I found myself in the middle and the language model was already there, there was an effort in 2015-2016 by everyone, there were LSTM implementations that they did to try to create comments, make summaries of everything.
I am… Personally very happy because there was the elegance of the architecture that allowed scalability in hardware terms and all the commitment of the army of engineers who turned to work and the readiness, the preparation of NVIDIA in ‘adopt yourself in the construction of lights. First the 100, 100. People don’t realize, because now we’re a little used to Nvidia, but before the V100, which is the model, before the A100, which is the model on which, the graphics card on which it was essentially driven by the GPT-4, the previous generation compared to the ones they are using now. They practically came from an agreement between Nvidia, Google and Amazon in which they saw very long and made this investment for Data Center Ready, for data center cards and on the infinite band, that is, on the connectivity between these cards, which is the thing that created the drastic scalability along with the architecture that allows you to take advantage of this separation.
So the algorithm itself is modular in the same way that the hardware is modular. The result was exactly what… which in quotes was supposed to happen but none of us were optimistic enough to believe it. That hindsight gave a direction, the one that for many was, let’s say, put on paper in Richard Sutton’s paper, the bitter lesson, that is, this idea that fighting against the algorithms that will work when scaled, trying strange inventions, saying the brain doesn’t do that, and that’s a bitter lesson. Computers advance, our computing capacity accelerates, our ability to harness energy will improve, and therefore… There is a very strong incentive for algorithms that can scale with more computers than those that require more…
David: And why does he conclude that this is a bitter lesson rather than a conclusion or a trial…
Tommaso: No, because they are there… because for him he insists that these algorithms have been the ones to be done for 25 years, and now he is right, for 25, 35 years, and now he is finally right. But for 20 years they have not been proven right, millions of euros have been spent in terms, millions, probably billions, I see DARPA, it is highlighted somewhere, so this idea, the opposite idea – I will find a more intelligent idea to solve this thing, which is a clear mathematical definition, which however has a somewhat difficult ON – it is a temporary problem, not a future one. Let’s say that these are things that in retrospect are small, but they are thoughts, multiplications, which are a bit difficult for the human brain to do. In retrospect it now seems normal to me, I live in it, I work with these things, but it wasn’t… and I say to myself, is it justified that this happened compared to what we saw 4-5 years ago? Absolutely, but I would never have expected it 5 years ago. And isn’t there one more question about what I think? Maybe I said too much else. Ninth…
David: That’s great, that’s great. This consideration is in line with what others also feel, that is, an enthusiasm for the surprisingly valid results. I confirm, I used neural networks with dedicated cards in the 80s and the networks that could be created and executed were evidently very primitive. But even then there was a bit of a feeling that the original conclusions about the inability of neural networks to combine much, when they had a single layer, a single intermediate layer, were hasty conclusions. But evidently no one had had the courage to follow the path that opened with the results of AlexNet from 2012 onwards. So we were delayed by about thirty years on a roadmap that could have been followed differently. So if you don’t agree feel free to counter me.
Tommaso: No, I think I don’t agree with some things. I am one of those who… I repeat I am very materialistic I care about the world – I would like to say real but then maybe too much I conflict with the thoughts more – I would express I am very in conflict, I am very interested in the world, which therefore ideas, the relationship, ideas are very easy to relate to in terms of what their relationship is in the world.
I don’t think that things like support vector machine or random forest which ultimately worked very well through all the 90’s, but actually work great now for a lot of problems and work great in combination with pre-driven models like embedder or language model. In short, if you extract the features, extract the numerical representation of your data with a neural network, then you can easily use those vectors in traditional models using much simpler infrastructures. The applications to which…
David: The applications I’m referring to, at the end of the 80s, beginning of the 90s, were precisely character recognition which had excellent performance even on handwritten characters, handwritten letters or numbers, and were vertical, optimised, pre- trained on specialized hardware.
Tommaso: I come from a laboratory with more engineers than me who does something other than building chips with 4 joule algorithms to show that they managed to do most of the same things, especially localization. Then in reality these are more about natural vision and therefore localization of faces, entities, people in the wild using rasterizations, super simple things instead of neural networks. Instead he was the one who used GPUs, very lucky.
David: The capabilities of artificial intelligence systems are exploited, they are probed through an almost random mapping process. There are things we don’t know they can do because we haven’t asked them yet. And the moment you ask him “But do you know how to do this?”, the model says “Of course, all you have to do is ask me and I’ll do it”. Of course I am exaggerating and there are limits that we understand, indeed we measure through benchmarks what the models are capable of doing, but even there we find methods so that the results can improve due to how we set the question and how perhaps we encourage, induce the model to be more… reflective, almost introspective in evaluating your conclusions.
What is happening now that is preparing the ground, in some way it is already being covered, to make AI more capable of carrying out actions autonomously? That is, what makes us get to AI agents? What characterizes them and why now?
Tommaso: So, I think there is something here that is actually a very big problem from a philosophical point of view, which is what is an agent? When we are at the point of AI what is the difference between a predictive model and an agent?
And in some way I think that perhaps returning to more physical things, a predictive model is an analogy, it is the thing, it is the object that studies a stochastic process, therefore there is a sequence of measurements that we can assume in classical mechanics , not be influenced by what he is observing. And a predictive model takes care of taking these inputs, now I’m exaggerating, I’m making hyperbole obviously, then you can easily move from one representation to another. A predictive model is about…predicting a stationary process in which the same relationship between past and future continues to happen regardless of when you observe the system. It’s about saying, “Okay, I’m seeing a river, there’s an equation that says since a molecule of water was here, then it’s going to be here,” and I’m about explaining that.
An agent is a more complicated system, let’s say a cybernetic system, returning… to the birth of our field, also obviously of the Singularity Institute, a system in which there is not only a stochastic process but there is also a capacity for influence him. Perhaps someone influences it, or there is in any case a feedback loop between the state of the stochastic process and the inputs, the actions, another process which is then called policy in modern terms which influences the process itself. Therefore they are reflective processes in which the current state determines outputs that fall within the system.
When this system… is well separated between the one that makes decisions and emits actions and the underlying stationary process, we are in the setting of the agent environment, which is precisely the fundamental setting of reinforcement learning that derives from control theory and it is often called MDP, Markov Decision Process, the Markov decision process, that is, a decision process in which the outcome of the actions depends on this state of the system, which is accessible to the agent.
The secondary variant, more complicated from a theoretical point of view, and much more realistic, is that of the partially observable Markov decision process, i.e. the Markov process which is only partially observable. Something more similar to the modern view of physics. That is, the idea is that it is not possible to observe everything and that the actions you take are decisive in influencing what you can see and not see about the system. This is the setting in which agents normally live.
Therefore, agents are processes which, on the basis of their state, a numerical representation, which could be textual in the case of language agents, make decisions. The last important part is the concept of reward, process or evaluation or goal, all visions that can be remapped to each other. That is, the idea that not all behaviors that this agent can have are the same, but some of these behaviors bring the world into more desirable states than others.
And so putting together these kinds of concepts – a world that is only partially observable and that follows its own autoregressive rules, that has its own dynamics that can be partially influenced by something that can be compartmentalized, whose internal dynamics can be compartmentalized with respect to those of the world , and a process that says people prefer this state of the world over the other. These three components – the environment, people’s learning process, and people’s reward function – define an agent system, a reinforcement learning system.
And this is a framework that exists in science, which is the basis of economics, which is the basis of statistical decision theory, which is the basis of reinforcement learning, which is the basis of game theory, and it is the definition of agency that probably comes… can be found much earlier, but which is formalized from Pareto onwards. So this idea of ​​agents motivated by an evaluation function, so by that a person has a behavior, an agent.
Here I am doing, unfortunately there is the old economist in me, with which agents and people are the same thing. In computer science it becomes more difficult, it often goes the other way around, often the person is the environment and the agent interacts.
For example, an example of an agent that perhaps does not appear to be an agent is ChatGPT in the Chat variant. That is when language models are trained to interact with a person. So they are not simply trained to “predict the next token, predict the next word” based on the level of simplification they want to give, but “the next sequence of words must be the sequence preferred by the interlocutor you are talking to.”
This factual situation which is called reinforcement learning from human feedback in the variant in which the model is updated on the basis of its outputs is a situation in which people are the language model and the human being is its environment. That is, GPT agrees with you from a formal point of view. The training part unfortunately often agrees with me too much, I often feel a little too much selection, a little too much. And let’s say this point by the way is exactly what I’m saying, that the goal of this model is to agree with you, to do something in common with you whereby his behavior gives you the greatest satisfaction. And the fact – and this is so there is in fact a game in which there is your reward, his reward – and you are both playing to maximize your own reward, so both the assistant and you are interacting to maximize user happiness.
Now, it’s very interesting here how other things come into play, isn’t it? Now we’re starting to talk about, we’ve always talked about safety, about the fact that not all user desires are the same from the provider’s point of view, right? So not all users’ wishes, such as those potentially dangerous to others, should not be fulfilled by the assistant. But now we are also starting to talk about introducing advertising within the models to… support free subscriptions and especially those who search on the web. And at that point it becomes a much more complicated game, it becomes a game theory situation where there’s the user, there’s the person, there’s the owner of the model, there’s the customer who’s paying the model owner to have some marketing and all that.
Let’s say that these are problems that already exist while on one side of the agents we talk to a section of purely computer science agents as things that are the de facto future. The agent situation and the main agent problem, therefore of alignment of incentives, alignment of training incentives, alignment of incentives in inference of the model with respect to customers, in my opinion are things that already persist and they are certainly not completely autonomous loops, they are loops that they go through OpenAI decision-making, model retraining, policy changes, and things like that. In the future… I distanced myself a bit from your question to try to say there are already these loops. The point is which components that are currently human, currently bureaucratic, currently ultra-expensive such as training will be completely abstracted and digitized, will become products, sold by some company that specifically deals with some of these components. From a formal point of view it is very easy to see what things these models must have in order to work because this is something we have studied for many years, that is, what is the relationship between some assumptions of the underlying Markov process and the capabilities that the algorithm must have. And so I’ll maybe stop here if you want to go on with some questions since I’m going through if I wouldn’t go too far.
David: You gave a definition of software agent that is very general, so it is possible to program and run software agents according to your definition even with a set of data or hardware that is much lower than the state of the art now. Given this, is there a reason that you see why attention is particularly focused today on the next generation of solutions based on large language models that are expected to have an even further function in this direction of agents?
Tommaso: This was a joke we were making with a friend at Google who worked on robotics during his PhD and then gave up because it was a very uncomfortable thing and now the concept of embodied agents is coming back, it’s coming back again. It is one of the hottest markets from a venture capitalist point of view, many say it is in a bubble, others say no, several zeros are missing.
The joke with them was, “Well sure, reinforcement learning is a little easier now that you can ask your model questions.” That is, the fact that communication with the model, however imperfect, however potentially fallacious, however perhaps completely hallucinated in the reader’s head, occurs in natural language – there is a capacity for… representation is a complex thing.
Ok, let’s put it this way: I don’t think it’s much different from a model perspective that a thing is represented in terms of images of one measurement or another of the world, but the fact that the cloud put state and the input of a model are things that we are able to applaud within our own minds, our of ourselves as computational processors is the thing that makes them great. So the idea that at this point we can take the output of a calculation and upload it directly into our brain a la Johnny Mnemonic, in a certain way, because it’s language. And this also leads to reconsidering how interesting and fundamental language is and the… the fantastic invention of language, especially writing, the transmission of information between multiple generations and many other things that make humans special.
From the point of view of the language model, then, now there are… The first fundamental thing is one of the biggest problems of reinforcement learning is efficiency in statistical terms. That is, you need a lot of samples, because if one imagines that the world, the complexity of the world, is reduced to saying “I like these areas of the world, I don’t like these areas of the world”, or if that’s okay there is an order, “I like these areas of the world better than these”. Understanding the world becomes much more difficult than in a setup where you have visual perceptions, linguistic descriptions, you get information from other people who have already had that experience, etc.
So the big advantage that there is now… with language models is that by using a language model you have information to kickstart a reinforcement learning policy. So you are able to take advantage of everything you know about the language model to have an abstraction of the environment in which your agent must behave that doesn’t start from scratch. So perhaps through the in context learning capabilities of the model you don’t even need fine tuning or training, you directly find support, a, let’s say, a… Here I should go, I would like to take an extra step on the concept of , on separation.
Previously we said that the important things are the environment, the value function and the agent. From the agent’s point of view, understanding the environment is called the world model, i.e. the agent’s ability to predict the future conditional on its actions. This object here is an object that we’ve worked on quite a bit, it’s basically everything you need for reinforcement learning. That is, when you know how the world works, implementing any policy within that world is simply planning reasoning within your mental model.
Whereas in the situation where you have a policy, it is very difficult to move on from this policy – so if you are once told that this is the right thing to do in the world, while you find yourself in a new situation where the opposite is true – it becomes very difficult to generalise. So there’s this idea that it’s much easier to move from an understanding of the world to an understanding of how one behaves in the world, than from an understanding of how one behaves in one world to another, if that makes sense.
So the idea that the positive definition of what happens… causally with respect to what is much more powerful than “what you have to do”, because “what you have to do” depends on the circumstances, while the model of the world is what defines what is one circumstance and what is another. And language model is the free world model for agents.
David: And then surely what you said at the beginning – that is, that this universal interface that allows us to create inputs towards our brain and our future awareness and processing – is a feature that is a bug, in the sense that I believe precisely because until a few years ago we didn’t have the need to do it, we have a complete openness and a relative absence of defense tools when the programmability of our awareness through language is… exploited in adversary terms.
Absolutely the fact that computers today – to say it in general – are programmable with natural language is a gigantic achievement that we have only just begun to explore and this will take further important steps. Because in the 60s computers were in separate rooms where only if you had a lab coat you could enter and allow yourself to touch them and anyone who had to interface with the computers had to do it with… interposed people and enormous levels of complexity and abstraction. Then through personal computers this distance was eliminated, but still only specialists, non-specialists, enthusiasts, started tinkering and trying to understand computers. Graphical interfaces have meant that we can control computers much more directly and even without having to learn command line commands, whether DOS or Linux. And now we are arriving at conversational interfaces, written or spoken, where actually our programming of the computer and the computer of us are what we have most direct today. And without intermediation, without filters.
Let’s take a leap. You said earlier at the beginning, I made a mental note, I want to introduce this now, that you are not particularly concerned or do not particularly agree with the positions on safety and security, so safety, reliability, alignment of models compared to what human objectives are. And so I ask you to comment on this regarding the systems as they are today and then perhaps check and tell us if your position remains the same also towards even more advanced agentic systems, the ones we expect in the near future.
Tommaso: So… Meanwhile, first of all I think that this problem of human beings of having a linguistic perceptive channel that allows them to reprogram, which puts them at risk of internal reprogramming, is on the one hand a risk but also the reason for which we have managed to grow. However, I think that this is a human problem regardless of artificial intelligence, that is, I think it is a problem of the modern world and it is precisely a bitrate problem. We have more than saturated the bitrate that a human being can receive and discern probably with Italia 1. At least in my generation. But therefore we have that more than – that is a problem of humanity that must be solved, we can have days of conversation.
And in my opinion it is the biggest problem there is from the point of view of security and not of security safety which is the fact that the companies that currently, both the closed source and the open source ones, which are more linked, more responsible for producing the language models, they did it by selling advertisements, which in fact is exactly this adversarial process we have been talking about so far, implemented by other human beings. That is, Google, Facebook, Microsoft – perhaps the only one, no, but in reality massively yes – live on ad revenue and ad revenue… and when it is formalized it has exactly the same, has exactly the same shape as…
But I think anyway there’s a contract between OpenAI and Coca-Cola and you probably don’t know it but we’re already getting the right tokens to drink more Coca-Cola and you’ll realize it two years from now when you average it out and the… the main channel is that, as long as the assistant is an object whose objective is to maximize the user’s happiness, it is clear what the problem is.
There are all sorts of exaggerations about what can happen and what can’t happen when people’s goals are not well defined, there are companies in between, etc., that want to get things, governments more than companies, if I have to worry, who want to achieve things, the direction is very different.
From a safety point of view, I don’t agree with, let’s say, all the catastrophic scenarios of artificial intelligence taking over the world. I’m not willing to discuss it, in quotes, because to get there there are a whole series of… human crosses on how we will survive as humanity on a geopolitical, economic level, on the level of not having world wars which worry me much more since worry about what the role will be in the changing world, in the current structure. We have a couple of wars in the last five years that we’re surprised are happening, and on at least parts of the conflict there’s a lot of AI intervention.
I am now going beyond my expertise, but it was very clear that a good part of the first terrorist attack on Israel that started it all was due to a careful study of defense algorithms and the development of adversary strategies against those defense algorithms, in parallel with a relaxation of human supervision. Those things were standing and this is something that happens systematically. We have all seen the drone videos and in my opinion those are the worrying things. But it’s not so much the worrying things of saying “we have to make laws to do that, block this or that”. There are mechanisms in the world by which violence occurs that go far beyond the power of my words.
But I think we should at least do it – put a lot of attention into understanding these problems and as citizens there is little we can do. But I wouldn’t want us to get into each other because we were talking about science fiction and suddenly we find ourselves having to see what happens if 500 drones arrive in your village. So therefore…
David: Trying to understand and summarize what you said, it’s a question of priorities of what are the concrete dangers, if any, that advanced technologies pose. We already have artificial intelligence applications with us that can cause harm, but not because they are a superintelligence, but because they are used concretely in a way that harms us, for example in conflicts that states or groups trigger among themselves.
Let’s go over another observation you made regarding embodied AI. The theory that is put into practice today is that it is useful and necessary to give artificial intelligence systems the opportunity to have experience with the physical world, and how they can interact with the physical world, formulate hypotheses on cause and effect relationships, plan a series of actions to achieve their goals. Then the conclusion of embodied cognition is that without this ability artificial intelligence systems cannot advance much. There is actually a need for this feedback loop when we deal with the real world.
And indeed there is enormous enthusiasm around… the creation of a new generation of robots, particularly humanoid robots, which take advantage of this common sense, this common sense that they are able to acquire thanks to the model that they build of the physical world around him, to escape from the cages in which the industrial robots of previous generations have always been closed. These industrial robots are blind and deaf and very dangerous. There are very bad accidents, even fatal ones, that happen because an operator does not follow safety procedures, enters the cage and the robot, without knowing that a human being is there, hits him.
While it is assumed that the behavior of next-generation robots based on world experiences mediated or interpreted by large language models could be much superior.
And a second aspect is that of the acquisition and processing of an enormously larger quantity of data for training and therefore the improvement of the systems than the previously accessible quantity of data already digitized through the internet interfaces that we have used so far. .
How do you see the value of this effort, regardless of the sheep-herding behavior of Silicon Valley investors who eagerly follow whatever technological wave arrives? And do you think that this is actually the right time to turn our attention to embodied cognition and is humanoid robots an appropriate vehicle for testing these hypotheses?
Tommaso: Starting from the erroneous behavior of venture capitalists, on the other hand, more as their consultant than now as a person who also seeks funds from them, I’m just pleased perhaps, the more they make mistakes the more valuable my consultancy towards them is, the joke is even.
On top of all this there is certainly this obvious phenomenon called the “seem to real gap”, that is, when you train a model within a simulation, this simulation obviously… has limits and the main limits are our lack of understanding of the physics and our not understanding physics to the point that we are able to compute it efficiently or that we have efficient models, approximations.
So ours, there is no current best simulator in the world at a resolution similar to that which embodied agents should have, is probably Unreal Engine, therefore… video game engine, where certainly the physics is a bit of a toy and where certainly there is no quantum mechanics or even particleization works in a completely different way. So it is obvious that a model that learns within that straction will have gaps with reality given by… the gaps by which you can explain the world being made of cubes of matter 4 cm x 4 cm large, what is the equivalent of polygons within a simulation. And we know from physics that there are a whole series of phenomena, a whole series of dynamics, which will already start from thermodynamics itself, where if your spatial resolution is lower by a certain amount, the uncertainty you have on the outputs… increases drastically.
So I would see in this way the need for feedback loops in the real world to correct simulation errors. On the other hand, we are now also very accustomed to these residual architectures in which… the models are built on top of each other or within the architecture itself or in the composition of multiple models or as boosting, these are classic strategies of machine learning.
So I imagine that the main procedure will be to start with some very good physics simulators, then go through some bases inside, go through another more video game simulator that allows you to have an entity based object abstraction, based on entities and things, things, not molecules, atoms, waves or polygons and then this stuff here will end up in the real world. In the real world it will end up having an error and this error will be reduced by a second level of models that deal with… reducing this gap from simulated to real.
Probably as we get models in the real world, the cost of training directly on real world data will be less, but on the other hand, as you start to have different robots, maybe the practical difference between those two robots that have gods. .. completely different bodies that have physical resistance that use different materials and that work in different environments perhaps the correct way to transfer between one situation and another is actually to abstract everything to a game engine again and see the things there. So we must also be “not to get bitter lesson” again, perhaps the lesson of this embodied agent is not to create 12 million more to collect data, but that 3500 are enough to improve Unreal Engine 5 and Unreal Engine 8 and do it so much better and while ripping off the training of internal agents who work much better. So here, on the other side…
David: There will be many approaches and there will also be a need to standardize not only how this next generation robots interpret and act on the world, but also how they interact with us and how we can interact with them. I have not seen this photograph anywhere else. And in my opinion the most important shot of Optimus, the robot under development at Tesla that it is already testing in its factories, is the one I’m aiming for: the button that stops it. And so in the moment in which there is no thumbs down or thumbs up on the screen that tells CharGPT “Look, you told me something stupid” and I give you the feedback, the blow on the back of the robot’s head will be the way to tell it ” I don’t like what you’re doing that much!” Any Cannavacciuolo is probably necessary to get there.
Thomas: Exactly. I think there are several interesting elements that perhaps come out of the topic we are dealing with today, which concern precisely the usefulness of certain shortcuts that we can apply, which concern both the humanoid form of the robot that interacts with a world designed for human beings, and therefore there is a natural compatibility, both a desirable anthropomorphization of its behaviors, and that if it had a radically different shape we would struggle to understand what it is about to do and how to relate to the robot itself.
I think it is a very lazy UX choice to make humanoids – I understand it, that is, among any other choices but it is very similar to the idea of ​​we have to remake a human brain to do things that think, on which we are in some aspects completely different. The human body is really difficult, there are some things that to do simpler things – let’s say much simpler – I would have expected us to spend more time doing things that look before doing things that do, that is, we don’t have the capacity yet – and again we return to espionage again to security things that perhaps become dangerous – but we don’t have curious drone robots, we don’t yet have things that explore. Discover I would have expected let’s say ten years ago stuff like self-guided submarines that explore the ocean on their own and discover things about the world that we haven’t seen, like things that seem much simpler to me.
David: They are beautiful applications and I agree with you that it is explosion we have yet to see it and perhaps it will come driven by the learning curves we go through in building not hundreds or thousands of specimens, as you said, perhaps sufficient for training better models that collect data from the real world, but millions of specimens that lower the cost and therefore allow more radical experimentation that today might otherwise not be within the reach of those groups curious to explore the underwater seabed or to create drones that are compatible with us but that discover useful things.
I’ll ask you one last question, we’ve reached the end of our conversation today. You mentioned the energy cost of the training phase of current models. And even… even the most recent and therefore more powerful NVIDIA cards reward and are optimized in favor of the training phase, necessarily penalizing the inference phase. And therefore they consume… a lot and the training in any case is very expensive both in terms of initial capital investments for all these cards – not thousands but even tens or hundreds of thousands of cards – which today are purchased and put to work aside of the largest manufacturers of these models. But those amounts of energy used for training, in perspective, are almost negligible compared to the energy needed for the millions or hundreds of millions of people who use the models every day to receive their outputs. How do you see the evolution of this also observing the investments that are being made in even more specialized chips to optimize model processing?
Tommaso: So… I think that the specialized chip side is a bloodbath, in the sense that something will work, probably – I would say not you, that is, the probability that there will be a better chip, yes, what I would tell a friend who tries to do so, most likely not yours. Here again we are almost investment advice on Nvidia or not.
So, from my point of view I think that… the separation, the ultra-optimization for training versus inference is trivial and is due to the fact that no one was doing inference before other than the published paper. That is, the inference question was practically created by Stable Diffusion first and then by CharGPT. In fact Stable Diffusion single-handedly destroyed the release of the 3090 which they had to remove from the market because they were too good in terms of cost for image generation ability and they have more or less disappeared. Nobody was able to buy them anymore because the 4090s came out immediately after for 800 euros more.
David: I say this from personal experience that I have tried to buy several. From the perspective one of the reasons for the 3090 not being available was also the peak of crypto mining.
Tommaso: I assure you that I tried to buy them afterwards but I couldn’t. There must have been all these thousands but I didn’t actually see them. I think I ended up in Chinese data centers directly, precisely by imperial edict.
Currently the inference side I think is very interesting. There are MacBooks – a new generation of MacBooks are doing some very interesting things and personally one of these newer diamonds, let’s say, where pretty much almost any state of the art model is functional, on the bigger model, I think everything, with a little of quantization, anything can be rancid, probably when the AMA 300-400 billion is released, it will no longer be the case, however… Currently the best situation to play probably at home in terms of inference are MacBooks because the limit the main thing is the amount of VRAM on the card which is very limited instead in Nvidia gaming cards where precisely if the data center ones go from 80 to 160 giga each – before they were 40, 80 – little by little they progress, the gaming ones arrive at most 24 gigabytes and are practically unsustainable from an energy point of view. In Italy at home you turn off the power, to use a double 4090 you need 1500 watts, a MacBook does the same thing with 180 Watts at 30% less speed.
Microsoft said they have a competitor, it’s these new third type of memory, this unified memory, neural processing unit they’re calling, which are CPUs with RAM on top basically. I feel, let’s say, Grok is there to see, that Q, not the K, is currently the only situation, it seems to me, that is truly successful.
On the other hand, they seem like products intended to serve giant corporations rather than users, because end users are probably looking for a desire for flexibility rather than specialized architectures that require hardware tuning for each architecture – and I think Grok just needs to maybe even a weld or two if it changes architecture, in the sense that obviously the less you virtualize the more efficiency you can have – so there’s this trade-off between making a new architecture that will work specifically for something, to the point that one could… put in the chip the same parameters of the model and therefore be able to have a transistor for each parameter and no longer have problems with the digitalisation of all movements and things. But on the other hand, every new architecture that comes out makes your hardware obsolete. So the hardware manufacturing cycle currently is one of the biggest building blocks of technology, modern growth and our reliance on TSMC and ASML on this is very very important.
David: And to confirm what you said, I actually switched to a very recent model MacBook Pro with an M3 processor. But I didn’t get the M3 Max with as much RAM as I could – with as much RAM as I could get – to be able to do some local model processing. In particular, llama.cpp is wonderful as the possibility of running a model that you don’t have to recompile, very simple to use. And the Microsoft NPUs you mentioned aim to give comparable performance. And both of these solutions, both a Mac and a notebook with Windows, based on this type of architecture are for local processing, while when you mentioned Grok we are still talking about processing in the cloud which therefore needs to be managed in a streamlined way, with the advantage that can grow with use and the disadvantage that depends on the performance, availability, strategy of that particular supplier.
Regarding hardware growth cycles, I have been noticing for a while now how Nvidia constantly emphasizes the fact that they no longer follow Moore’s law but that the process that I call shock, a “jolting technology” in the design and implementation of their systems, is much, much faster. This dates back to last year, I also developed another image from Jensen Wang’s presentations which highlights even more how much… they would have improved performance less than a thousand times if they had merely followed Moore’s law, while at the same time in the last ten years they have increased them by 10 million times.
Even Tesla, which had announced the creation of its own proprietary architecture for… artificial intelligence called Dojo, willingly agreed to slow down the production and deployment of data centers based on Dojo because, as you say, competing at NVIDIA is not simple at all and… the training they do now for autonomous driving and other things, including the large language model of the X platform, called GROK with a K, is based on NVIDIA and is not based on this their proprietary dojo architecture.
Tommaso: Yes, I think that the only ones who perhaps don’t train on Nvidia are Google with their cluster plus, but it could very well be that they are more recent things if not going on Nvidia.
Interesting thing about that graph which showed, among other things, that the P100s are almost legendary cards that came out almost together with the events. There were very few of them and Amazon never had them, only Google had them and I think Alibaba at the time and therefore Amazon went from the K80 which were really ancient slippers for doing computer graphics, to the EV100 which now it seems bad to have to use a machine with the EV100 – Google Trader free with Colab – but it is the first of the new generations of data center machines that are with InfiniBand to do multi machine training efficiently.
The K80s had a data movement problem, that you wasted more time loading the batches on the GPU, what else, there was just another world, it was pre-design, they weren’t designed yet, it was hardware that randomly went great for doing machine learning, but he probably expected a future of virtual reality where everyone had to render things in the cloud 15 years before, which never happened.