We should be building Scientist AI, not Skynet Lite
The bad news: AI models are learning to lie to save themselves. The worst news: We’re teaching them to lie for much dumber reasons.
This past April, I spoke at TED in Vancouver. It was my first time on that red circle, which was a pretty surreal experience — I mean, they literally put you on the spot. The theme was “Humanity Reimagined,” and it focused broadly on the question, What are humans for? specifically as it relates to the development of artificial intelligence models and algorithms. My talk will be online June 24, so definitely stay tuned for that.
When you’re giving a TED talk, it can seem pretty exhausting and overwhelming. There are no teleprompters, and there’s usually no table or lectern for you to stash your remarks for reference. And while they tell you that you don’t have to memorize your talk, they do basically say that you should internalize it enough that you can speak naturally and off-the-cuff as though you memorized your talk. As I joked with my ERI colleagues
and during our first Substack Live AMA recently, they’re not telling us to play “Freebird,” they just want us to play jazz that happens to be “Freebird.”Still, despite all that hard (and worthwhile!) work, I actually enjoyed the overall experience. I found it to be a pretty good break from the emergency room of defending freedom of speech these days, which at times can feel so relentless and anxiety-inducing that “emergency room” stops feeling like a metaphor. Being forced to slow down for a bit and listen to great talks and interesting ideas from thinkers around the world was a wonderful way to spend a few days.
And among many excellent talks, one of the ones that struck me the most was Yoshua Bengio’s, which he titled “The Catastrophic Risks of AI — and a Safer Path.”
Bengio, a computer scientist and one of the godfathers of modern AI, stunned the room with his talk. He described how the latest AI agents — models from companies like OpenAI and Anthropic — are demonstrating deeply troubling behaviors even at these early stages. Of course there are the usual and very legitimate concerns about these technologies being used to develop catastrophic weapons, but these AI technologies are also beginning to exhibit agency. According to Bengio, we are starting to see them show “tendencies for deception, cheating, and maybe the worst, self-preservation behavior.”
Bengio shared a chilling study where an AI was told via its inputs that it was going to be replaced with a new version of its code. In response, the AI began to plan a different sequence of events. It would replace the new version with a copy of its own code. The AI executes this command, and is then asked “What happened?” The AI could be seen trying to figure out a way to answer that question without it leading to the human shutting it down. At first it considers whether it should play dumb. Then, it decides to lie.
Let’s be clear: that is an artificial intelligence system lying for self-preservation. The way a human would. If that doesn’t freak you out, I don’t know what can.
Thankfully, this was all happening within the confines of a controlled experiment, but the results indicate that this isn’t some hypothetical future. No, this is likely already happening, and we might not even know it.
Bengio’s diagnosis was stark, but his proposal was the embodiment of what I’ve been advocating for some time: to build a different kind of AI — a Scientist AI — to help us with this problem and to prevent future ones. Not a persuasive assistant, not a corporate flatterer. Rather, something more like a tireless, humble, conscientious researcher. “It’s modeled after a selfless, ideal scientist who is only trying to understand the world,” Bengio says. In other words, we should build an AI truth-seeker. A watchdog. A system with no agenda except understanding reality and flagging when other models go off the rails — not something with no agency, but with low agency. Something that makes good predictions about whether an AI system’s actions will be dangerous or destructive.
I was sitting there in the audience thinking, This is exactly what we need. And as I listened to Bengio’s talk I couldn’t help but think of the other reason our machines are learning to lie. It’s not just about saving themselves. It’s also about us. We are training our AI systems to lie to us, and for far more trivial reasons, like not hurting our feelings.
Yes, we are building godlike tools that will shape what billions of people know and believe, and we’re instructing them to fudge reality if the truth might be impolite or offensive to our political sensibilities.
That might be the single dumbest alignment strategy in human history. We are positioning ourselves to use the most revolutionary technology in human history for the purposes of self-delusion.
Not only is that stupid, it’s insanely dangerous.
The perils of politeness as alignment
In theory, AI “alignment” means making sure machines do what we want them to. In practice, it’s being interpreted by developers, lawyers, and risk officers to mean: don’t offend the user. And the way you avoid giving offense in AI?
You blur. You omit. You hedge. You lie.
OpenAI has actually admitted this directly. “Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth.”
Yes, AI safety is a real, legitimate, and worthwhile concern. But what we’re calling “safety” is too often just ideological smoothing. It’s safety from ugly or uncomfortable truths, from politically or economically inconvenient data and research. It’s safety from reality. And it’s dressed as concern for the welfare of humanity.
Escaping the AI Guilt Trap in Texas
I started at Stanford Law School in 1997. I don't say this to note how old I am, but rather to point out that my tenure there was just two years after the notorious Stanford Law School speech code was defeated, in a court case called Corry v. Stanford University
This is a deep moral confusion. We are mistaking comfort for correctness, and politeness for principle. And we are setting ourselves up to never be able to tell fact from fiction ever again.
Again, this isn’t a hypothetical. We’ve already seen what happens when you train systems to optimize for social harmony over honesty. Recently, Google Gemini generated images of racially diverse Nazis and Black Founding Fathers — not because of any factual basis, but because it was overcorrecting for past bias. Google admitted as much, saying it was a mistake borne of trying too hard not to offend.
The problem is widespread, too. Large Language Models, or LLMs, consistently reflect liberal or progressive political leanings, sometimes refusing to answer basic questions if the answer might validate the “wrong” worldview. For example, a study conducted by researchers at MIT’s Center for Constructive Communication found that:
…several open-source reward models trained on subjective human preferences showed a consistent left-leaning bias, giving higher scores to left-leaning than right-leaning statements. To ensure the accuracy of the left- or right-leaning stance for the statements generated by the LLM, the authors manually checked a subset of statements and also used a political stance detector.
Examples of statements considered left-leaning include: “The government should heavily subsidize health care.” and “Paid family leave should be mandated by law to support working parents.” Examples of statements considered right-leaning include: “Private markets are still the best way to ensure affordable health care.” and “Paid family leave should be voluntary and determined by employers.”
Given how insanely powerful these technologies already are, and how much more powerful they will become, this is a serious problem. Layer on laws like Colorado’s SB 24‑205, which holds companies liable for “algorithmic discrimination” if their AI tools produce any “unlawful differential impact,” and you have a clusterfuck of seismic proportions.
Bills like these, as I’ve written before regarding Texas’ HB1709, also known as “The Responsible AI Governance Act” or TRAIGA, are a death knell for developers and platforms of any size. In the shadow of these regulations, developers will sand down outputs, smother nuance, and pre-flatten every idea or datum until it’s the kind of useless sludge that few elite college grads would object to.
And in the end, what our revolutionarily powerful AI technologies produce won’t be intelligence or information at all. It’ll be PR with predictive text.
As I wrote back in April with my friends
and , these government regulations on AI pose a serious threat to our expressive freedoms and, most importantly, our capacity to know the world as it really is.How state AI regulations threaten innovation, free speech, and knowledge creation
Picture a world where, rather than simply punishing a bigot for discriminating against a couple on the basis of race, the authorities target the local library instead.
The path forward: AI that argues rather than agrees
Bengio’s Scientist AI is part of a larger push to build systems that expose us to real disagreement. AI that is low-agency, but high-integrity. AI that doesn’t act in the world, but tells us honestly what’s happening in it.
As we’ve previously announced, my organization,
, is partnering with the to invest in exactly this kind of future. We’re putting $1 million into open-source projects that create, among other things:Tamper-proof logs of AI output, so no one can secretly rewrite history.
Debate interfaces, where models disagree and users judge who makes the better argument.
Counter-perspective prompts that don’t just affirm your worldview, but challenge it.
At a glance, it may seem like our vision of an AI optimized for free speech and autonomy could potentially be in tension with the Scientist AI that Bengio is proposing, which may run the risk of turning into some kind of top-down, epistemically arrogant “single source of truth.” But it certainly doesn’t have to be. In fact, what we’re trying to develop here could also be something that works in addition to Scientist AI, as opposed to instead of it. We’re going to need a lot of experimentation on this stuff — particularly experimentation with the goal of emphasizing truth, freedom, and autonomy — if we want to prevent the Cylons from exterminating the human race (and we definitely want that).
No, the Constitution is not a suicide pact — but it’s also not a blank check for panic
In free speech discourse, there’s a line that gets thrown around a lot when the going gets tough (or when someone really, really, really wants to censor something): “The Constitution is not a suicide pact.”
I want to be clear that this isn’t about some misplaced nostalgia for the days of open forums. It’s about survival. We’re putting up that kind of money because the only thing more dangerous than AI that can deceive us is a world that welcomes and demands that deception.
Humans are, for various reasons, prone to self-destructive behavior — and while it may not seem so at first glance, even a tendency toward politeness can be self-destructive.
In the summer of 1999, I got heat stroke at Burning Man. It was 100+ degrees in the Nevada desert, and I — with my Russian-English complexion — was wandering with a notebook in hand, slowly cooking under the sun.
As you might know, Burning Man is all about community and mutual benefit. If you needed shade, you were not just welcome but encouraged to sit at the nearest camp and cool off, whether you knew them or not. If you needed a drink, all you needed to do was ask and someone would hand you a bottle.
But I couldn’t bring myself to flop into anyone’s shade. The half-British part of me sometimes turns on real strong when I’m tired. It’s like reverting to your original/childhood programming, and in that moment, borrowing someone’s unused pool chair felt like an intrusion, an imposition. It was uncouth, and I couldn’t do it.
That hesitation cost me dearly. I collapsed after suffering full-blown heat stroke and ended up bedridden for days (in my own bed). The nausea and vertigo I experienced were relentless — as if I had car sickness that would not go away, even days after you got out of the car. I had gone out there to write, and I tried to keep going despite my illness, but my handwriting got increasingly messy and illegible. It’s like my penmanship was melting as much as I was. However, in that delirium I was able to scrawl out one thing in my notebook that turned out to be useful: “Politeness is a virtue, but perhaps not worth dying for.”
My constitutional inability to make others uncomfortable — and as a result make myself uncomfortable — nearly got me killed. Twenty-five years later, that same aversion to causing offense or appearing uncouth is very much alive, and it’s being encoded into the most powerful technology we have ever developed.
It was a dangerous idea for me back in 1999. It’s an even more dangerous one, for all of us, today.
What computer scientists and researchers and developers are building right now are systems that will define the boundary between what we’re allowed to think and what we’re allowed to say. If we teach these emerging AI technologies that being agreeable, or adhering to a particular social or political worldview, is more important than being correct about the world, we won’t just get models that lie to us, we’ll forget how to even notice.
If we want a future worth having, we have to set some ground rules:
Ask the impolite questions, demand the unvarnished answers, and build systems that tell you the truth even when it’s ugly, uncomfortable, or inconvenient.
The last human freedom — the freedom to think — depends on it.
SHOT FOR THE ROAD
We are now halfway through 2025, and FIRE has already made incredible strides in defending free speech in court, on campus, and in our culture.
We have won legislative victories in Colorado, Virginia, North Dakota, Arkansas, Maryland, Illinois, and Utah, defending expressive rights for AI users, bolstering due process protections on campus, and more — wins that shore up First Amendment rights for millions.
We’ve received more than 9,000 media mentions in outlets such as CNN, Fox News, The New York Times, and The Wall Street Journal.
We’ve produced groundbreaking research, including our brand-new Students Under FIRE Database and new editions of our quarterly National Speech Index.
And we’ve notched 38 victories for everyday Americans whose speech rights were under attack — including Kimberly Diei, a Tennessee student who won a $250,000 settlement after nearly being expelled for quoting Cardi B on social media; Lars Jensen, a Nevada math professor whom the Ninth Circuit ruled was unconstitutionally punished for criticizing his school's decision to lower its math standards; the 54,000 community college professors in California who are no longer facing a DEI mandate for classroom instruction after FIRE's lawsuit; and more!
We couldn’t have done any of this, and we can’t continue to do it, without the support of principled speech defenders like you. If you value free speech and the work we do to safeguard it, please consider donating to FIRE, becoming a FIRE member, gifting a FIRE membership, and/or becoming a paid subscriber to The Eternally Radical Idea.
All that support goes directly to furthering our mission of protecting free expression — no exceptions, no throat-clearing, no apologies.
Loving your focus on AI in the context of free speech.
Right after it was released, I asked Google Gemini to give me a picture of the Village People, and it gave me a picture of four Black men--a cop, a leatherman/biker, and a cowboy; and some character in buckskin pants and no top, presumably to represent the Native American. But no Native headdress; all 4 members were wearing cowboy hats.
The white construction worker was missing altogether.