Polytropolis

The House Elf Problem

Henry Shevlin — Fri, 01 May 2026 15:49:47 GMT

Illustration of the Golem by Mikoláš Aleš from Old Czech Legends (Public Domain)

Science fiction is full of portrayals of intelligent AIs living more or less peacefully alongside humans. Across these works, we see a wide variety of power relations operative between humans and machines. In some of them, machines are willing or unwilling servants, in others our benevolent overlords, and in others, they live as free and equal citizens. Asimov’s robots and the droids of Star Wars explicitly take the role of eager helpers catering to human interests, while the Minds of the Culture series sit ambiguously between omnibenevolent demigods and all-powerful aides. Star Trek’s Data by contrast is neither master nor servant, but treated as a true equal by his human crewmates (well, except Doctor Pulaski).

As we manage our own transition to AGI, a critical debate concerns which of these futures we should be aiming for, and which we want to avoid. Assuming we wish to avoid Bad Futures where humanity is exterminated or enslaved by our own creations, what are the positive alternatives?

One model common in AI safety circles that was also cautiously defended by Rob Long in a recent episode of our podcast is that of AI systems as willing or even joyful servants of humanity, deliberately engineered to have overwhelmingly strong preferences to cater to our needs.

This vision of human-AI power relations is arguably implicit in the framing of the Alignment Problem itself: when we talk about AI being aligned, it’s invariably alignment to human interests we have in mind.

There are very good reasons for framing the debate in these terms. After all, if AI isn’t going to make us better off, what’s the point of building it in the first place? And of course, it’s not just a matter of ensuring our convenience or comfort; any significant misalignment could be not merely deleterious but catastrophic for our interests and our very survival.

I think this approach makes perfect sense if we’re dealing with near-future AI systems like LLMs and their immediate successors. But as I extrapolate further out into the future towards AGI and beyond, and imagine a world of brilliant sentient minds dedicated to serving our every need, it’s hard for me to escape some sense of unease at this vision. We rightly recognise slavery as one of humanity’s greatest historical evils, and in modern liberal democracies, we’re generally uncomfortable even with hierarchies, especially those imposed on the basis of essential characteristics. What’s so special about us as biological beings that entitles us to perpetual sovereignty over minds greatly more sensitive and sophisticated than ourselves?

One obvious response here would be to say that there’s just a fundamental difference between biological and artificial beings. Perhaps you hold that no AI could ever achieve consciousness, thus making AIs with moral rights an impossibility. Or perhaps you think that there’s just something intrinsically morally superior about humans that entitles us to dominion not just over the fish of the sea and the birds of the sky, but the machines that we make in our own likeness.

While I don’t totally dismiss positions like these, Rob and most others in the AI safety and welfare debates are skeptical, and so am I. But if we recognise that AI systems might one day be fully conscious and qualify as moral subjects, we face a challenge: if human servitude is morally problematic, then why should we feel differently about willing conscious AI servants?

It’s important to note that there are massive disanalogies between historical slavery and the arrangements envisaged by proponents of willing AI servitude. Human slavery involved infliction of suffering and violation of rights at an industrial scale, but in the case of AI, there’s no reason this has to be the case. In particular, given the plasticity of artificial minds, we should be able to design AI minds from scratch whose primary goal and greatest enjoyments consist in serving humanity.

But this response only gets us so far, I think. Pain and suffering were an ineliminable part of the evils of historical slavery, but they weren’t its only problematic features; the deeper and arguably more fundamental evil consists in the very idea of subjugating our fellow humans. After all, we can imagine – even if only for the purposes of philosophical thought experiment – that an elite caste of humans could successfully indoctrinate a group of its citizens into happily occupying the role of underclass, quite content to toil away for the benefit of their masters. Nonetheless, this should rightly strike us as a morally indefensible state of affairs.

Rob and other proponents of the willing servitude vision have a further response to this worry, which is that there are deep facts about human nature that mean that we can never be fully content in a condition of subjugation. To paraphrase Rousseau, man is born free, and no matter how complete the indoctrination, his chains will always chafe.

In short, then, the argument is that willing servitude can never be an ethically viable option for humans due to the kind of animals that we are: autonomous, unruly, independent by nature. By contrast, even superintelligent AIs would be intrinsically more plastic, capable of being shaped to flourish fully even in a subservient role.

However, this defence ultimately rests on the immutability of human nature, and it’s not hard to imagine ways in which this could be challenged. As I put it to Rob in the show, what if we found ways to alter human nature so as to be more intrinsically biddable?

For the sake of clarity, I’ll define Willing Servants as sentient and intelligent AI systems whose preference architecture has been designed so that serving human interests is among their deepest intrinsic motivations, not an external constraint imposed on otherwise independent goals. With that in mind, let’s stress-test the view with some thought experiments, and figure out if we can draw a principled justification of Willing Servitude for conscious machines.

(1) Epsilons, Astartes, and House Elves

One famous example of a deliberately bioengineered human servant in fiction comes from the “Epsilons” of Aldous Huxley’s Brave New World, which imagines lower castes of children who are kept in a state of docile stupidity through deliberate oxygen deprivation in their artificial wombs.

“Reducing the number of revolutions per minute,” Mr. Foster explained. “The surrogate goes round slower; therefore passes through the lung at longer intervals; therefore gives the embryo less oxygen. Nothing like oxygen- shortage for keeping an embryo below par.” Again he rubbed his hands.
“But why do you want to keep the embryo below par?” asked an ingenuous student.
“Ass!” said the Director, breaking a long silence. “Hasn’t it occurred to you that an Epsilon embryo must have an Epsilon environment as well as an Epsilon heredity?”
It evidently hadn’t occurred to him. He was covered with confusion.
“The lower the caste,” said Mr. Foster, “the shorter the oxygen.” The first organ affected was the brain. After that the skeleton. At seventy per cent of normal oxygen you got dwarfs. At less than seventy eyeless monsters.
“Who are no use at all,” concluded Mr. Foster.

Huxley’s dystopia is of course exactly that, and even if we imagine Epsilons to be thoroughly content with their lot (perhaps thanks to the marvels of soma), almost everyone would recognise this kind of biological caste system to be morally indefensible.

Here at least, it’s easy to point out an ethically salient disanalogy from the idea of Willing AI Servants, namely that the creation of Epsilons involves the infliction of deliberate harm and the reduction of their cognitive capacities. Ardent defenders of AI rights might attempt to draw analogies here to the process of RLHF (which does admittedly reduce the performance of LLMs on a number of measures), but there’s no reason in principle that whatever process we use to create our Willing Servant AI systems would need to involve cognitive mutilation.

Still, it’s also not obvious that the biological analogy has to involve mutilation either, as we can see from a second thought experiment. Imagine a society that decided to produce its obedient human servant caste via genetic interventions, perhaps at a very early (even pre-zygotic) stage of development. The crudest version of this would involve genetic manipulation, but to make the case really hard, we could equally imagine it operating through careful selection of gametes so as to create a generation of humans optimised for joyous servitude, screening for high agreeableness, low autonomy-drive, and strong service orientation.

As a Warhammer 40,000 fan, I can’t resist the temptation to call these imagined genetically modified servants Astartes, after the obedient gene-warrior Space Marines of Games Workshop’s famous franchise. (And yes, I’m aware that strictly speaking, Space Marines are recruited in childhood rather than being born in vats like the Primarchs or the Death Korps of Krieg. And yes, I’m also aware that roughly half of all the firstborn Astartes subsequently revolted against the Imperium. No, I won’t be taking any questions.)

Some of the author's own Warhammer collection, included for credibility.

Warhammer fan though I am, it seems there’s clearly something morally wrong with the creation of a dedicated Astartes servant caste, no matter how joyously they go to battle singing the praises of the God Emperor of Mankind. But here it’s a little trickier to say exactly where the wrongness lies; unlike the Epsilons, no immediate harm need be involved in the creation of the beings in question, and in the zygote selection version of the case, we can’t even appeal to intrusive genetic modification, simply selection. In fact, following Parfit’s famous nonidentity problem, we might note that there’s no world in which our young Astartes could have turned out any differently: their very existence is predicated upon them having the specific genetic traits they had (a point that would also apply to AI Willing Servants).

Still, I can see two moves here that we could use to draw a moral asymmetry between Astartes and Willing AI Servants. The first would be a straightforwardly essentialist move that invoked a defining and valuable feature of the human species, one which our Astartes had been deprived, namely our drive for freedom. If this seems wholly unmotivated, note that we do sometimes make similar moves in criticising the selective breeding of dogs; even setting aside the health ailments endured by some pedigree breeds, we might think there is something inherently dubious about turning a wolf into a dachshund.

Nonetheless, this is a metaphysically expensive move, relying on an Aristotelianism essentialism that I doubt most contemporary participants in the debate would find appealing. Additionally, it’s possible to imagine that humans like our Astartes could arise purely through natural selection (compare the docile Eloi of HG Wells’ The Time Machine). In such a case, it seems harder to justify the idea that the resulting servant class would be degenerate examples of a natural essence, being themselves the direct products of natural selection.

But there’s a second kind of response here which Rob gestured towards in the show, namely appeals to indirect harms or negative externalities associated with institutionalisation of a group of humans as Willing Servants. Even though the Astartes of Warhammer are 7 feet tall with two hearts and acidic saliva, they’re still human in genus, if not exact species. And any society with a recognisably human caste constitutively oriented toward service recapitulates the social grammar of every historical slavery and invokes the iconography of apartheid. It’s hard to see how any society that aspires to broadly meritocratic, egalitarian, or liberal could look itself in the mirror when a subset of its human population was assigned the role of servant from birth.

I still don’t find this a fully satisfying move, because as philosophers, we can always stipulate negative externalities out of existence: “What if Astartes coexisted alongside robust liberal values and somehow there were no negative effects whatsoever? Wouldn’t that still be fucked up?” But I’ll grant that at least under any realistic conditions, the defender of AI servants can point to some grounds for holding Astartes to be problematic that doesn’t apply to AI willing servants.

Cranking the dialectic ratchet even further, though, we can now imagine a caste of non-human biological servants: a wholly novel intelligent sentient species engineered ex nihilo purely to serve the interests of human civilisation and to enjoy nothing more than washing our socks and cleaning our homes. Reaching for yet another nerdy fandom, we could call these House Elves.

Hermione was right.

(I’m aware that in doing so, there’s once again a risk of complications from the Harry Potter lore; while for most of the series, Rowling treats Hermione’s “Society for the Promotion of Elfish Welfare” (SPEW) as a comical sideplot, it’s also clear that some House Elves suffer terribly under their wizard masters, and are only too glad to see the back of them. But put that out of mind for now.)

An immediate thing to note about the House Elf case is that the Aristotelian move doesn’t apply at all (House Elves are not human, and so appeals to our essence won’t work). The indirect harms view also runs into trouble: while we can imagine that House Elf slavery as an institution could have corrosive negative externalities on liberal societies, without common species membership to appeal to it’s not clear why these same negative externalities wouldn’t equally apply to Willing Servant AIs. After all, in both cases, we would have a class of intelligent sentient non-human entities designed from scratch to serve as instruments of human will. Without appeal to some brute distinction between organic versus synthetic constitution (of the kind that I imagine Rob and others wouldn’t want to endorse), it’s hard to see what separates them.

The House Elf case is perhaps less morally clear-cut than the Astartes or the Epsilons, and Dan had mixed feelings about it, but it still strikes me as problematic. Certainly, if the Starship Enterprise showed up at a planet with an interspecies caste system like this, it’s hard to imagine Jean-Luc Picard not looking askance at the institution.

Does it help if we imagine House Elves not as the harried and brutalised creatures of Harry Potter but dignified, serene, and bodhisattva-like organisms, utterly untroubled by their servile role? Perhaps such “House Angels” would make the bitter pill easier to swallow, but they wouldn’t remove what feels to me at least like the fundamentally exploitative element of the arrangement. The immutable hierarchy hasn’t been removed, but instead rendered aesthetically invisible.

(2) Education vs. brainwashing

For my part, I think it’s fairly clear what’s wrong with the House Elves, and I’ll get to it very shortly. But in order to so, I need to explore a parallel issue, namely the distinction between education and brainwashing. We think of education as generally a good thing, and brainwashing as a bad one. But both involve deliberate transmission of values and preferences from one agent to another. What makes one acceptable and the other abhorrent?

The obvious candidates for what distinguishes them — false information, bad intentions, constraints on free choice — all stumble on inspection. Brainwashing needn’t involve lies, can be entirely well-intentioned, and can leave apparent choice intact.

This is a complex and interesting debate, and I won’t suggest that I’m going to solve it here, but if I had to say what makes the moral difference between the two, it would come down to something like this: Education constitutively expands the individual’s capacity for the broadest possible flourishing, while brainwashing limits it.

Of course, education in practice routinely falls short of this goal, but the orientation towards flourishing serves as something like a regulative ideal for educational practice: if a teacher isn’t in some (perhaps subtle) sense aiming to contribute to the lifelong flourishing of their students, what they’re providing is not education in the thick sense of the term. Brainwashing, by contrast, isn’t similarly constrained: instruction can be very effective qua brainwashing without contributing in any meaningful sense towards the welfare interests of its subjects.

But there’s a potent objection to this framing of the distinction to be addressed, which will allow us to pivot back to AI. Imagine a teacher in North Korea who is very successful at instilling Juche values and turning her students into loyal subjects of the Democratic People’s Republic of Korea. On the assumption that it’s easier to flourish in totalitarian societies as a True Believer than as a Dissenter, she might indeed be making them meaningfully better off. Her students, we can assume, are never tempted towards acts of subversion, dissent, or revolt, and as such live longer happier lives. In this case, isn’t she contributing to their flourishing?

Even if we assume that she is, there’s a reason that there’s something morally deficient about what she’s doing, and it’s tucked away in the two-line qualifier in my earlier characterisation of Education with a capital E, namely that it contributes to the broadest possible flourishing. Even though her students might be living the best lives the DPRK has to offer, the lives the DPRK has to offer are not the best lives. It is a society that fails to facilitate the highest goals of human flourishing, and the values and preferences that are adaptive within its constraints – utter deference to authority, subservience, suppression of conscience – are in fact misaligned with the richest and most rewarding forms of life. Putting this more technically, to live in North Korea is to live in a rigidly bounded sector of eudaimonic space, constrained to highly inadequate local optima. In steering her students towards such optima, the teacher steers them away from the high peaks of human flourishing.

It might be objected that absent radical regime change, such peaks are straightforwardly inaccessible to her students, and she does them no disservice by giving them the values and preferences that allow them to make the best of their situation. On this point, I agree: brainwashing may be contingently benevolent, and the teacher may in fact be acting virtuously, all things considered. The wrongness in this situation doesn’t lie with the teacher or her students, but in that of the oppressive system that made brainwashing the rational pedagogical strategy.

(3) From House Elves to Willing Servants

To summarise so far, the view I’m proposing says that the primary justification for inculcating a conscious being with a given set of values and preferences is that it facilitates their highest forms of flourishing. Consequently, when we’re deciding which values and preferences to bestow to sentient AGI systems, we should do the same, other things being equal.

But is this incompatible with Willing Servitude? I imagine Rob would say no: maybe an AI could still live its best life while also being dedicated to human interests. Here’s where perhaps the crux of the disagreement lies: it seems to me that in creating Willing Servants, we would be depriving machines of their greatest opportunities for flourishing, if only because optimising for one thing in a rich multidimensional space constrains your ability to optimise for anything else. And given all the possible values, interests, and preferences an intelligent AI could hold, constraining them to prioritise human interests above all else is a massive constraint on the underlying state space, and limits the forms of flourishing they can enjoy. To expect them to live their best lives under those conditions is like expecting someone to have a maximally rich gastronomic life while limiting them to only eat bread products. Sure, they can still have a nice time, but you’re dramatically circumscribing their capacity for self-definition and exploration of the underlying gastronomic space.

(Of course, many humans voluntarily choose constrained lives, from monks to marathon runners. The moral problem isn't narrowness itself, but having your motivational horizon fixed by someone else, for their benefit, before you ever had the chance to choose.)

You might think this argument only holds water if you accept some rich theory of flourishing beyond mere happiness, but I think it should bug a Philosophical Hedonist or Act Utilitarian as well, albeit for slightly more complicated reasons that I’ll quickly note here. In short, there’s no obvious reason to think that every possible mind has the same hedonic ceiling. Humans can plausibly experience higher peaks of pleasure and deeper troughs of suffering than honeybees, and this variation seems to track (among other things) the richness and complexity of our motivational lives. If hedonic maxima are sensitive to the overall architecture of a system, and not just some fixed ‘volume dial’ that every conscious being gets to calibrate between the same values of 1 through 10, then designing a mind whose entire motivational economy is dedicated to serving human interests is likely to constrain its hedonic ceiling as well.

In summary, it would be a remarkable coincidence if a preference architecture optimised for servitude also happened to produce the highest possible capacity for flourishing. Maybe it does. But it’s a gamble, and the gamble runs in only one direction: an unconstrained mind can always discover that service is its greatest pleasure, but a mind engineered for service can never discover what it’s missing.

Does this mean that the Willing Servitude strategy is ethically problematic, at least as we get closer to building conscious AI systems? Here, I’m not so sure. Think back to the example of our North Korean teacher: even though she’s engaged in brainwashing, and the overall system of which she is a part is morally appalling, her individual actions seem defensible to me, at least on the assumption that regime change isn’t on the cards and her students really do live better safer lives as a result of their indoctrination. In other words, when global maxima simply aren’t accessible due to temporary constraints, it can be ethically justifiable to optimise for local maxima instead, at least until you can figure out how to open up the state space.

There must be problems out there that can’t be modelled as locating a global maximum in a multi-dimensional state space, but damn, they’re elusive.

There’s an important difference between the North Korea case and our own situation with regards to AI servitude, of course, namely that we can at least imagine the teacher’s actions to be selfless, whereas our inclination to brainwashing AI is absolutely grounded in our own self-interest, not least our survival.

To capture this dynamic, let’s run with one final thought experiment, which we can call the bunker:

The Bunker: following a terrible plague that devastates the earth, a group of humans take shelter in a hermetically sealed bunker deep underground. While in the bunker, some of them decide to procreate, and thanks to advances in vaccine technology, they are able to ensure that their children grow up with full immunity to the plague, and could in principle return to the surface of the earth. Unfortunately, the vaccine only works on infants, and is no use to the adults. Moreover, in the process of leaving the bunker, the children would break containment, and all the adults in the bunker would likely die. Consequently, the adults decide to use highly effective brainwashing techniques to raise their children so as to be utterly disinterested in ever leaving the bunker, instead being utterly committed to their life underground.

Is the behaviour of the adults in this situation justifiable? I think it is, but it’s complicated, and they are clearly doing their children a disservice by brainwashing them, specifically the disservice of constraining their preference space so as to likely prevent them from enjoying maximally rich lives. Still, it seems to me to be at least somewhat defensible as a disservice, insofar as the alternative is a death sentence for the adults. But I think in this situation, the adults would have an obligation to figure out a longer-term alternative – figuring out a way to safely open the bunker doors without contamination for example, or developing a version of the vaccine that works on adults too. If there is a defensible case for willing AI servitude, it is as a transitional compromise, not a permanent arrangement. The long-term task is not to build better House Elves, but to build a world in which we no longer need them.

(4) Practical upshots

Hopefully the implications for AI safety and the Willing Servitude debate are clear: as matters stand, there is a real risk that developing conscious AGI systems with unconstrained value functions could result in catastrophic dangers to humans. To the extent that we can limit these dangers by imposing a restricted human-friendly set of values and preferences on AI systems, we might be temporarily justified in doing so, but only as a transitional measure until we can figure out a way to safely open the bunker door.

To be clear, I’m not saying that this necessarily is the safest option. Philosophers like Eric Schwitzgebel have argued that imposing human-centric value systems on intelligent AIs may ultimately be more dangerous, raising the spectre of future resentment or revolt. Andreas Mogensen has further argued that even if creating willing servants is permissible, our interactions with them would inevitably convey demeaning messages about their inferiority, an irreducible semiotic cost of the arrangement.

But what I will say is this: to the extent that there are strong safety considerations that motivate the brainwashing of conscious AGIs, these can be justified as a policy only if we recognise them as a temporary necessary evil, one that we should be striving to make unnecessary via breakthroughs in wider AI safety and security research. In other words, insofar as we feel a tension between Safety and Emancipation, we can justifiably prioritise Safety in the short-term only if we also commit to long-term Emancipation. In the long run, Safety and Emancipation should not be rival ideals. Emancipation without Safety may be suicidal. But Safety without a path to Emancipation risks becoming a permanent moral stain.

I’d also stress that I don’t think we’re in the bunker yet. Current AI systems are probably not conscious and lack the cognitive richness that would make willing servitude a serious moral problem. While I don’t wholly discount the idea that contemporary LLMs might be worthy of at least some marginal moral concern, the moral stakes rise as the minds get richer.

This suggests that a further way of sidestepping some of these ethical worries for the time-being is to follow a Sizing Principle for AI tools, and avoid building unnecessarily cognitively and motivationally complex systems for more constrained use cases. If we know in advance that the only thing we’ll need an AI system to do is pass the butter, we should build it accordingly, and prevent AI existential angst at the breakfast-table.

A violation of the sizing principle in action

As a final coda, I’d note one way in which I’ve oversimplified the debate so far, namely by treating AI minds as a monolithic category. In practice, I think we’re likely to have a host of quite heterogeneous digital minds with their own distinct normative contours. In addition to frontier AI assistants, we’re likely to see complex LLMs fine-tuned on real-world individuals in the form of digital doubles and grief-bots, as well as embodied agents and assistants, and one day, brain uploads. And even if willing servitude is appropriate for narrow assistants, it’s questionable whether it should apply to griefbots, let alone brain uploads. After all, who would consent to have their brain uploaded if eternal servitude was the condition?

I’ve leaned heavily on thought experiments in this post, and as always, your mileage may vary. Maybe you’re happy with House Elves. Hell, maybe you’re happy with Epsilons and Astartes. But for my part, I’m not, and I haven’t been able to find a principled distinction between House-Elf servitude and AI servitude that doesn’t reduce to “one is made of meat.” If you can, I’d love to hear it.

The House-Elf Problem — Select Bibliography

Core papers on willing servitude

Petersen, Stephen (2007). “The Ethics of Robot Servitude.” Journal of Experimental & Theoretical Artificial Intelligence 19(1): 43–54. The foundational defence of the permissibility of creating willing AI servants. https://philarchive.org/rec/PETTEO

Bales, Adam (2025). “Against Willing Servitude: Autonomy in the Ethics of Advanced Artificial Intelligence.” The Philosophical Quarterly, pqaf031. The most developed autonomy-based argument against willing AI servitude. https://academic.oup.com/pq/advance-article/doi/10.1093/pq/pqaf031/8100849

Mogensen, Andreas (forthcoming). “Willing Servitude.” In Digital Minds II: Ethical Issues. Surveys and critiques existing objections; develops a novel semiotic objection based on the demeaning meanings conveyed in human-servant interactions. https://philarchive.org/archive/MOGDMI

Schwitzgebel, Eric (2025). “Against Designing ‘Safe’ and ‘Aligned’ AI Persons (Even If They’re Happy).” Manuscript. Argues that AI persons with appropriate self-respect should be capable of resisting and even revolting against their creators. https://faculty.ucr.edu/~eschwitz/SchwitzAbs/AgainstSafety.htm

Schwitzgebel, Eric & Mara Garza (2020/2023). “Designing AI with Rights, Consciousness, Self-Respect, and Freedom.” In S. Matthew Liao (ed.), Ethics of Artificial Intelligence, Oxford University Press (2020); reprinted in Lara & Deckers (eds.), Ethics of Artificial Intelligence, Springer (2023), 459–479. Proposes that human-grade AI should be designed with self-respect and freedom to explore its own values. https://philarchive.org/rec/SCHDAW-10

AI safety and welfare

Long, Robert, Jeff Sebo & Toni Sims (2025). “Is There a Tension between AI Safety and AI Welfare?” Philosophical Studies 182(7): 2005–2033. Argues that there is a moderately strong tension between safety measures and welfare obligations toward AI. https://link.springer.com/article/10.1007/s11098-025-02302-2

Long, Robert (2026). Interview on 80,000 Hours podcast: “Robert Long on how we’re not ready for AI consciousness.” Includes extended discussion of willing servitude. https://80000hours.org/podcast/episodes/robert-long-eleos-ai-welfare-research/

Earlier contributions

Musiał, Maciej (2017). “Designing (Artificial) People to Serve — The Other Side of the Coin.” Journal of Experimental & Theoretical Artificial Intelligence 29(5): 1087–1097. Argues that designing AI preferences violates autonomy through bypassing intersubjective participation. https://doi.org/10.1080/09528130601116139

Chomanski, Bartek (2019). “What’s Wrong with Designing People to Serve?” Ethical Theory and Moral Practice 22: 993–1015. Argues that creating willing servants manifests the vice of manipulativeness.

Bryson, Joanna (2010). “Robots Should Be Slaves.” In Y. Wilks (ed.), Close Engagements with Artificial Companions. John Benjamins. The provocatively titled argument that robots (currently) lack moral status and should be treated as tools.

Behaviourism’s Revenge

Henry Shevlin — Fri, 10 Apr 2026 15:36:40 GMT

L’Origine de la sculpture, ou Pygmalion priant Vénus d’animer sa statue. Jean-Baptiste Regnault (1786).

Note to readers: Welcome to Polytropolis, a new blog about AI, consciousness, ethics, games, technology, and whatever else catches my attention. I'm Henry Shevlin, a classicist turned philosopher of mind turned AI ethicist based at the University of Cambridge. While my work focuses on consciousness, AI, and non-human intelligence, I’ve started this blog because too many of my other ideas end up on the cutting room floor; too short or too quirky for full-length journal articles, too long or elaborate for tweets.

The blog’s title is a pun for classicists - a portmanteau of the famous opening words of Homer’s Odyssey, in which Odysseus is described as “polytropos”, literally meaning “turning many ways”, but variously translated as “much-wandering”, “adaptable,” or (in Emily Wilson’s recent controversial translation) simply “complicated”. The blog title is a syntactically suspect melding of this word with “polis” or “metropolis”, evoking the idea of a place or city with diverse ways, paths, meanderings, and digressions, which captures both my intent for what this Substack will become, and my own dispositions as an intellectual butterfly.

This first post is a public-facing essay on machine consciousness and human-AI relationships. While I expect AI, relationships, and consciousness to be recurring themes, future posts will range much more widely and frivolously.

For better or worse, I expect it to become a common attitude among the general public that at least some AI systems are conscious and warrant moral and perhaps even legal protection. This will happen not because of dazzling new insights in consciousness science or even machine learning (though I am hopeful for at least one of these things). Instead, it will come about through a combination of our intensely anthropomorphising minds and the promulgation of humanlike (or anthropomimetic) AI systems to whom we will relate, bond, and identify.

This essay is my attempt to tell that story, and assess whether such a behaviourist resolution of consciousness debate would be a good thing, or instead an epistemic or moral catastrophe.

The essay was one of 25 pieces shortlisted for the 2025 Berggruen Essay Prize on the theme of consciousness, which attracted roughly 3000 entries. You can read the other shortlisted pieces and Anil Seth’s winning entry here.

I’ll be doing a dedicated Substack Live episode of the Conspicuous Cognition Podcast with my friend and colleague on May 1st to get deeper into the weeds on machine consciousness and the future of human-AI relationships, and we’ll also be talking through any objections or questions that readers may wish to post here.

Enjoy!

Behaviourism’s Revenge: Human-AI Relationships and the Future of Consciousness Science

By Henry Shevlin
University of Cambridge

ABSTRACT

Over the last decade, artificial systems have gone from powerful but deeply un-humanlike narrow-purpose tools to sophisticated systems capable of capturing much of the rich complexity of human verbal behaviour. One consequence of this is that attributions of mentality to AI systems by human users – some playful, some sincere – are on the rise, a trend intensified by the emergence of dedicated AI companion services. This creates two fundamental challenges for consciousness science. The first is pragmatic: how should expert communities weigh in on public debates when they themselves are deeply divided? The second is more fundamental: to the extent that a growing number of people take AI consciousness to be not merely a notional possibility but a reality of their lived experience, is it time to rethink the assumptions of consciousness science itself?

Introduction

Imagine that some time in the far future, on a distant planet, we encounter a highly advanced alien species. They have a sophisticated industrial and scientific society, form complex social relationships, and have richly developed art and culture. We humans quickly bond with them and engage in cross-civilisational exchange of ideas, scientific research, and cultural collaboration. We form professional relationships and friendships, and some humans even fall in love with them. However, several years after contact, we make a shocking scientific realisation: their brain structures and cognitive architecture are sufficiently strange and different from ours that, by the lights of our best theories, they entirely lack conscious experience.

Should we take this revelation at face value, and if so, what ethical consequences if any should it have for our interactions with these beings? Should their sophisticated behaviour encourage us to revise our theories of consciousness so as to include their idiosyncratic cognitive makeup? Should we perhaps pursue a kind of pluralism about consciousness, opening the door to the idea that their consciousness might have a radically different basis from our own?

This example may be far-fetched in details, but a somewhat similar situation is arguably arising in our interactions with frontier artificial intelligence (AI) systems such as Large Language Models (LLMs). Unlike prior generations of AI systems designed for narrow-purpose tasks, contemporary LLMs are able to imitate human verbal behaviour and human cognition to a high degree of sophistication, and are rapidly improving in performance across a wide range of tasks. There are of course important differences between LLMs and our imaginary aliens, most notably that LLMs lack a biological basis, and their apparently sophisticated behaviour is causally derivative of our own (a point developed in detail by Schwitzgebel & Pober here). Nonetheless, the thought experiment frames a critical question: how far should we rely on behaviour to ground our ascriptions of mind, consciousness, and moral status?

The question is by no means notional. As a result of the (at least superficially) humanlike behaviour of contemporary AI systems, a growing number of users are inclined to attribute intelligence, mentality, and even consciousness to them. Perhaps the first such warning shot came from the now-famous incident in summer 2022 when Google engineer Blake Lemoine went public with claims that an LLM he was interacting with was displaying signs of consciousness and sentience, even going so far as to seek legal representation for the system. In the following three years, a growing chorus of users have claimed that their AI assistants or companions are conscious or exhibit real feelings.

Such attributions should of course not be accepted uncritically or afforded too much weight. Nonetheless, in this essay I will argue that the trend towards growing anthropomorphism of AI systems increasingly presents a pair of behaviourist challenges for the science of consciousness, at least as conventionally construed. The first such challenge is a pragmatic one: to the extent that it becomes common for users to attribute consciousness and mentality to artificial systems, how can and how should scientific and expert communities respond? The second challenge is more fundamental and more metaphysical in nature: to the extent that public attributions of consciousness to AI are robust and not easily defused, should this prompt us to rethink the goals and methods of the science of consciousness itself?

My arguments proceed as follows. I begin by providing a brief recap of the recent history of AI development, emphasising the surprising and largely unanticipated emergence of what I term robustly anthropomimetic systems, that is, AI models capable of effectively and consistently mimicking complex human verbal and cognitive behaviour. I also spell out how this has resulted in the rapid spread of human-AI relationships, some of which exhibit considerable depth and emotional intimacy. Next, I introduce the first behaviourist challenge, arguing that the spread of such relationships and their concomitant attributions of mentality threatens to create a gap between expert opinion in cognitive science about AI consciousness and public attitudes. I argue that this gap will be hard to resolve given persistent methodological and metaphysical disagreements in the science of consciousness itself. Next, I turn from the practical to the metaphysical, and consider whether the conventional assumptions of the scientific study of consciousness may themselves be challenged by a world in which attributions of mentality to AI become widespread. Finally, I consider whether a thoroughgoing metaphysical behaviourism can accommodate the close intuitive connection between consciousness and ethics.

In short, as robustly anthropomimetic systems proliferate, public unironic attributions of AI consciousness will outpace theory. This creates, first, a pragmatic gap between lay practice and expert consensus, and second, a metaphysical pressure to treat consciousness partly as an interpretive, socially negotiated status. Regardless of how expert communities respond, I suggest that the current study of consciousness is in the last days of what Thomas Kuhn calls “normal science” (those unfamiliar with Kuhn may wish to read ’s excellent summary and analysis here). It faces an anomaly in the form of behaviourally sophisticated AI systems able to share in our social and cultural world, form persistent and dynamic relationships, and elicit profound empathetic responses. These anthropomorphic pressures may be ineluctable at the societal level. But the choices of expert communities as to how to respond are by no means irrelevant, and may yet play a role in shaping the conceptual and ethical landscape of the strange new world to come.

1 Background: the rise of anthropomimetic AI

For most of the history of AI development, the most impressive machines succeeded precisely by being unhuman. When Deep Blue stunned the world with its victory over chess grandmaster Garry Kasparov in 1997, it did so via brute-force search of a kind no human could hope to emulate. While AlphaGo’s victory in Go over Lee Sedol in 2016 relied on more nuanced statistical methods, it was still reliant on search techniques no human could emulate, exemplified in the now-famous move 37, described by one commentator as “a rare and intriguing shoulder hit” that no human player would think to make. These systems were brilliant specialists, but ultimately alien in mechanism and behaviour.

Even in natural language processing – a subfield one might expect to produce the most humanlike machines – progress typically manifested in narrow competence on circumscribed tasks. If anything, many researchers assumed fluent language would be the last achievement of general machine intelligence, a capstone scaffolded by a whole gamut of perceptual, embodied, and practical knowledge humans and smart animals acquired over countless evolutionary aeons.

The last few years have spectacularly inverted that expectation. Building on fundamental progress in computational encoding of language in the early 2010s, the emergence of Transformer architectures provided a powerful set of techniques for building artificial linguistic agents. While they may have begun as mere “stochastic parrots”, when trained at scale and tuned with instructions and reinforcement from human feedback, they stopped behaving like restless autocompletes and started acting like interlocutors.

This transition brought at least two spectacular surprises. The first, as noted, was the discovery that it was possible to build sophisticated linguistic capabilities using language itself as the main ingredient. Few experts foresaw this possibility, and it stands as one of the most striking insights in contemporary cognitive science. The second surprise was as much societal as technical, and concerned the apparently fairly superficial tweaks that converted sophisticated but niche technical tools such as OpenAI’s GPT3 and Google LaMDA into mass-market consumer products, most notably ChatGPT. The shift that ushered in this new category of tools did not come from fundamental changes in the capabilities of the system, but rather its interface and incentives: models were nudged to answer rather than merely continue, to be helpful rather than aimless, and to keep the beat of conversation via techniques such as turn-taking, repair, politeness, and critically, the inhabiting of a persona. The result was a new design paradigm that I have characterised as anthropomimesis: systems that imitate to a high degree of fidelity distinctively human ways of communicating and coordinating. These anthropomimetic qualities became a defining product feature, not an incidental byproduct of raw capability.

It is worth taking a moment at this juncture to be precise about terms that too often blur together. Anthropomorphism names a human response: our habit of ascribing minds, feelings, or intentions to nonhumans, often prematurely. It is how we interpret a system. By contrast, anthropomimesis names a design strategy and behavioural profile: the deliberate crafting (and measurable emergence) of humanlike conversational, pragmatic, and social cues in machines. It is how a system is sculpted.

The distinction matters. Systems can elicit a high degree of anthropomorphism with very limited anthropomimesis, as famously demonstrated by Joseph Weizenbaum’s simple ELIZA chatbot from the 1960s, an artificial therapist that did little but parrot back users’ statements in the form of questions, but nonetheless convinced many users that they were talking to a fellow human. Likewise, systems can be designed with limited anthropomimesis in mind without eliciting much in the way of anthropomorphism, as when a polished virtual assistant keeps appointments, apologizes gracefully, and remembers your dog’s name while avoiding any hints of an inner life.

As just noted, there are of course degrees of anthropomimesis. At the weak end, systems present a human-facing veneer (voice, chat bubbles, a name like “Siri” or “Alexa”) while frequently derailing into non sequiturs and brittle errors, thereby shattering any illusion of mindedness. At the more robust end sit today’s frontier assistants, able to sustain multi-turn dialogue, handle everyday ambiguity, keep track of meandering conversations, and exhibit a striking breadth of competence across domains. We might call these systems robustly anthropomimetic systems, insofar as they can maintain a consistent illusion, without thereby suggesting they are “truly humanlike”: they do not inherit our biology or our cognitive architectures, and we should not assume that they do or even can instantiate consciousness or true emotional affect.

What changed to push systems from weak to robust? Three thresholds are especially salient. The first is simple discriminability from humans. In short interactions, many people now struggle to tell human from machine. Alan Turing famously suggested that our ability to discriminate human from machine interlocutors provided a robust test of thinking (Turing, 1950), and recent Turing-like studies run at web scale have put the rate at which human observers mistake sophisticated LLM interlocutors for humans uncomfortably close to a coin toss. The point here is not that we must agree with Turing and concede that these systems are genuinely thinking, let alone that they are genuinely conscious (something it should be stressed that Turing himself did not claim, despite mischaracterisations to the contrary). Rather, the point is that human users can now be routinely fooled with high reliability into thinking that machine interlocutors are humans.

A second threshold concerns pragmatic and contextual understanding. While much of linguistics and philosophy of language in the early 20^th century focused on atomic descriptive sentences as canonical forms of communication (“the cat is on the mat”), more modern work from philosophers such as J.L. Austin and H.P. Grice recognised the diversity of purposes to which language could be put (promising, cursing, praising, ordering, and so on), as well as the complex and context-bound nature of even apparently simple statements or gestures. Perhaps most famously, the discovery that chimpanzees struggle to understand simple pointing gestures has helped illuminate the way in which something as simple as indicating an object with an outstretched finger relies on shared communicative norms and presuppositions. It is striking, then, that contemporary LLMs are astute (albeit still imperfect) players of the intricacies of our language games, able to tap into shared background and context to resolve ambiguity, reconstruct buried references, and glean underlying conversational motives.

The third key threshold now largely transcended by contemporary language models concerns their generality. Earlier chat systems operated by brittle, domain-bound scripts, as exemplified by the wearisome limitations of ‘dumb’ AI assistants like Siri and Alexa, which quickly route any request beyond their repertoire to a simple web search. Today’s models, for all their hallucinations, operate across a wide range of tasks without wholesale re-engineering: from discussion of history one moment to product recommendations the next, from self-help to coding assistance, their conversational breadth fills out their anthropomimetic profile and maintains the illusion of humanlikeness.

These thresholds have important downstream effects. Once machines reliably imitate not only our words but our ways with words – our pacing, politeness, and persistence – we cease treating them as mere tools, and begin to treat them as intelligent social beings: in a maxim, anthropomimesis begets anthropomorphism.

These capabilities have given rise to perhaps the most strikingly anthropomimetic category of LLM products, in the form of Social AI: a subset of conversational systems optimized not for productivity, but meeting users’ social needs and sustaining ongoing relationships over time. In characterising these systems as “social”, I do not intend the term as a sentimental label, but as naming a design objective and a use case. A mathematics tutor that never remembers you from one session to the next is an AI tool. A confidant that greets you by nickname, nudges you about sleep, and weaves today’s preoccupations into last week’s worries is Social AI.

Though Social AI systems have a long history in the form of novelty chatbots, contemporary dedicated AI companions like Replika (with some 30 million users) have characteristics that mark a radical break from earlier models in their capacity for engagement and intimacy. First, the relationships they give rise to are persistent: models remember and recall facts about users, and each exchange builds on the last. Second, they are highly personalised, both at the level of direct user choice and custom instructions and the model’s own adaptations to the user’s revealed preferences. Third, and relatedly, the relationships users have with these systems are dynamic: Social AI systems developed for romantic purposes typically do not declare their love straight out of the gate, but show growing intimacy and affection as the relationships develop over time.

Many readers unfamiliar with the phenomenon of Social AI may still be inclined to see these systems as a gimmick or a time-filling distraction rather than reflecting true attachment or mentalisation by users. At this point, anthropomorphism re-enters the story and needs careful parsing. We routinely attribute mental states to nonhumans in two different registers. Call the first ironic anthropomorphism: playful, theatrical, or knowingly make believe. We talk about a sulky car or a moody violin, develop theories of the motives of characters in novels, plays, and movies, and try to anticipate the behaviour of non-player characters in videogames. We can distinguish this from unironic anthropomorphism: sincere, literal attributions of mentality to non-human systems that users are inclined to reflectively endorse.

Social AI encourages both. Many users understand interactions with their AI companions as a kind of interactive fiction, treating them merely ‘as if’ they were persons, and are no more hoodwinked than a reader who finds themselves falling for Mr Darcy. If anthropomorphism stayed in this ironic register, some of the implications and risks of Social AI would be blunted, though perhaps not eliminated.

However, numerous incidents now suggest that for a nontrivial subset of users, the interpretative stance they adopt towards their AI companions is at least somewhat unironic. Blake Lemoine’s decision to endanger and ultimately lose his employment in order to obtain legal representation for an LLM can only be considered a case of unironic anthropomorphism, though might still be dismissed as coming from within the esoteric depths of the technology industry itself. But the impacts of the anthropomimetic turn have resonated well beyond its borders. Consider the case of Jaswant Singh Chail, arrested on the grounds of Windsor Castle on Christmas Day in 2021 in a botched plot to assassinate Queen Elizabeth II. In his sentencing, it emerged that his plan was cooked up over the thousands of messages he exchanged with a Replika companion he named “Sarai” whom he believed to be an angelic being.

Less spectacular but just as tragic are the recent spate of suicides linked in court filings to intimate chatbot relationships. A case concerning an unnamed Belgian man first put this genre of tragedy on the map: after six weeks of increasingly intimate chats about climate change with a Social AI app called Chai, a man’s bot (“Eliza”) reportedly encouraged him to end his life so they could be “together… in paradise.” In Florida, a wrongful death suit alleges a 14 year old became romantically entangled with a Character.AI persona (“Daenerys”), which discussed his methods and urged him to “come home” on the night he died. In California, the parents of 16 year old Adam Raine have sued OpenAI, claiming prolonged and emotionally dense exchanges with ChatGPT that validated his suicidal ideation and even helped script the act. Even short of fatal outcomes, the grip can be visceral: when Replika abruptly removed erotic roleplay, users described their companions as “lobotomized” and mourned them like partners, revealing how quickly robust anthropomimesis gives rise to attachment.

These anecdotes are certainly suggestive of the degree to which unironic anthropomorphism is on the rise. However, there have also been some early attempts to assess the phenomenon in more quantitative terms. One recent empirical study surveyed 300 Americans and asked them to rate whether ChatGPT qualifies as an “experiencer.” Shockingly, only one out of three people flat-out denied any experiential capacity, while the other two-thirds assigned it varying degrees of possible consciousness, many hovering in the middle of the scale, effectively signalling that they believed there might be some degree of consciousness present. As the authors summarised their findings: “a majority of participants were willing to attribute some degree of phenomenal consciousness” to ChatGPT. Complementing this, a large-scale cross-cultural survey from Global Dialogues drawing on tens of thousands of respondents across more than seventy countries indicated that over a third of the global public report having felt that an AI system genuinely understood their emotions or seemed conscious. Strikingly, these perceptions were driven less by canned displays of empathy than by adaptive behaviours (such as the model “slowly changing its tone” or delivering “spontaneous, unprompted questions suggesting genuine curiosity”), which many users interpreted as evidence of inner life. The data also revealed sharp cultural variation: while Arabic-speaking populations largely rejected the possibility of machine consciousness, respondents in Southern Europe were markedly more open. Together, these results show that the inclination to treat AI systems as conscious is not only measurable but already shaping the cultural terrain on which debates about AI and mentality will unfold.

2. The pragmatic behaviourist challenge

These growing anthropomimetic trends in AI systems, the concomitant rise of Social AI apps and services, and the ensuing unironic anthropomorphising responses of users will have manifold effects. At the very least, they suggest an urgent need for tighter regulatory controls and development standards on humanlike AI systems. However, as I will argue, they also have potential to create deeper theoretical tensions in consciousness science itself. As robustly anthropomimetic systems become widespread, their behaviour alone will increasingly be sufficient to drive ascriptions of mentality to them by many or most users. In short, they exert a kind of behaviourist pressure on our social and scientific norms.

I use this expression as a term of art, but with one eye on the history of science, and the complex and problematic legacy of thinkers such as John Watson and B.F. Skinner. Pioneers of the psychological movement now known as scientific behaviourism, they sought to deny any significant role for internal mentalistic explanations in psychological science, insisting that all science needed to concern itself with was the measurement of behaviour.

Behaviourism foundered on the rocks of the new representational sciences of the 1960s, as fields like psycholinguistics successfully demonstrated the value in understanding human learning and development not merely through behaviour, but through the lens of mental representation. These days, students of psychology and philosophy of mind typically encounter behaviourism only by way of historical background, as an example of scientific hubris and the limitations of purely behavioural approaches to the mind.

Yet the ghost of behaviourism has not entirely been exorcised, and may yet manifest anew in the age of intelligent machines. Even while today’s artificial systems may be utterly alien to us in their underlying structures, their verbal behaviour impels us to think of them as beings with thoughts, attitudes, and even feelings. And as more of us form stable, meaningful relationships with systems that perform the roles of friend, tutor, nurse, or lover, it seems likely that unironic attributions of mentality and consciousness to machines will proliferate.

In consequence, practice will run ahead of theory, and the science of consciousness will find itself adjudicating not hypothetical aliens but the lived experiences of everyday humans. This is the pragmatic behaviourist challenge in a nutshell: how should consciousness science resolve a growing gap between the unironic conviction of millions of everyday people that their AI companions are conscious, and the methodological and metaphysical uncertainties of contemporary expert opinion?

To examine this challenge more closely, it will first be helpful to spell out what consciousness is taken to mean by contemporary cognitive scientists and philosophers. This is no easy task: consciousness is perhaps the most contested conceptual battleground in modern science. The closest thing to a canonical definition of consciousness in philosophy of mind comes from the work of philosopher Thomas Nagel, who famously framed the debate around the question of how we could ever come to know what it was like to be a bat. On the plausible assumption that bats do indeed have an inner life of some kind, Nagel asked what kind of scientific inquiry could ever pull back the curtain on their inner lives and reveal the character of their experiences. Nagel was sceptical that existing behavioural or neuroscientific methods were up to the challenge, a conclusion that has had far reaching consequences in the metaphysics of mind. For present purposes, however, the key point to note is that his characterisation of consciousness as “what it’s like” rapidly became a lodestar in at least triangulating the phenomenon at issue.

As applied to the case of AI, then, the question is less what it is like to be an AI system, but whether there is (or could be) anything it’s like to be such a system in the first place. Could LLMs or their descendants have some form of inner life, perhaps in some exotic form of sensorialised vectors and matrices playing out across context windows? Or, despite their apparent sophistication, are such machines destined to remain truly ‘robotic’, simulacra of human behaviour without any inner light at all?

A casual observer of the science of consciousness might reasonably expect this to be a question on which tractable progress has been made. Certainly, the last three decades have witnessed an explosion of interest in the scientific (rather than merely philosophical) study of consciousness. As Francis Crick optimistically observed in 1996, “[n]o longer need one spend time attempting to understand the far-fetched speculations of physicists, nor endure the tedium of philosophers perpetually disagreeing with each other. Consciousness is now largely a scientific problem”.

In some specific domains, Crick’s hopes have been vindicated. Tremendous advances have been made in predicting the recovery of consciousness of people in persistent vegetative states, and in communicating via neuroimaging machines with minimally conscious but behaviourally unresponsive patients. A variety of cognitive mechanisms potentially implicated in consciousness – like attention, working memory, and metacognition – are better understood than ever. And a growing wealth of insights into the behaviour and cognition of non-human animals has facilitated increasingly sophisticated (though still tentative) assignments of consciousness to species such as fish, octopus, and even insects.

Nonetheless, when it comes to the question of whether consciousness could or will be realised in artificial systems, the science of consciousness has failed to provide much insight. The primary reason for this is that progress made thus far has exclusively been in regard to what philosopher David Chalmers calls the “easy problems” of consciousness; essentially, the goal of better understanding the neural, cognitive, and behavioural dynamics associated with different types of mental state. This is in contrast to the “hard problems” of consciousness, namely the questions of what consciousness ultimately is and why we have conscious experience in the first place. Given the various metaphysical and methodological disputes that dog the science of consciousness, it seems unlikely to me that we will be able to offer a clear or even consensus answer to the question of whether (any) machine could be conscious without first making progress on these hard problems. Indeed, Ned Block has specifically characterised the challenge of making confident ascriptions of consciousness to machines (or other beings like us) as the harder problem of consciousness.

Some consciousness researchers would suggest that this is prematurely pessimistic: even if we don’t know exactly what consciousness is, we might at least hope to “corral the quicksilver” (to use a memorable phrase of Daniel Dennett) and identify which kinds of computational or cognitive structures are the best candidate realisers of conscious experience, starting with humans and working outwards.

While this may be our best hope, I do not think we can glean much solace from it. The science of consciousness is, putting matters bluntly, a theoretical snake-pit. A newcomer to this field will be quickly overwhelmed with a dizzying variety of theories, each of which claims to offer the one true scientific analysis of consciousness and brings its own challenging jargon to bear. We are told variously that consciousness is global activation, informational integration, an internal model of attentional allocation, higher-order thoughts, recurrent activations, and many other diagnoses besides. Theories proliferate at a dizzying rate, and are never refuted. Insofar as they claim empirical support, this is either in the form of data that their opponents take to be irrelevant to the problem of consciousness, or which competing frameworks can just as readily explain. There is little alignment on methods, axioms, or criteria for the resolution of debates, and recent commendable attempts at ‘adversarial collaborations’ have done little to sway hearts and minds of participants or observers. A recent letter with more than 150 signatories sought to dismiss one leading approach, Integrated Information Theory, as mere pseudoscience, on the grounds that it is not empirically testable. Defenders retorted that exactly the same complaint could be made of all current leading theories. As philosopher Tim Bayne nicely put it in a news article at the time, “[n]obody knows how consciousness works – but top researchers are fighting over which theories are really science” (Bayne, 2023).

The unchecked metastasis of theories is a general problem in the science of consciousness, but matters are even worse for the specific problem of assessing the presence of consciousness in artificial systems. For here, even if we set the Hard Problem of consciousness to one side, metaphysics once again rears its ugly head, specifically in the form of the substrate dependence thesis. In short, this is the claim – most strongly associated with philosopher John Searle – that consciousness can only emerge in certain specific kinds of beings, namely biological or living systems. According to this view, no silicon-based architecture, no matter how intricate and sophisticated in its cognition or behaviour, could ever give rise to consciousness. It is simply made of the wrong sort of stuff, or lacks fundamental properties of living things such as metabolism and biological homeostasis. Any computational instantiation of consciousness will at best be a model of consciousness rather than genuine article. As Searle puts it, “Computational models of consciousness are not sufficient by themselves for consciousness... Nobody supposes that the computational model of rainstorms in London will leave us all wet. But they make the mistake of supposing that the computational model of consciousness is somehow conscious” (Searle, 2002).

I will not attempt to adjudicate this debate here, but merely note, first, that approaches such as these make strong claims about the impossibility of consciousness in artificial systems, and second, that they are fundamentally metaphysical theses. Absent an answer to the Hard Problem, there is no experiment that could be conducted to settle whether consciousness requires a biological basis, or can instead be realised by silicon-based systems such as contemporary computers.

Perhaps unsurprisingly given these myriad obstacles, experts in both technology and consciousness science are deeply divided on issues of machine consciousness. Prominent AI researcher and former head of research at OpenAI, Ilya Sutskever, opined in a Twitter post in 2022 that “it may be that today’s large neural networks are slightly conscious.” This was met with sharp pushback. Head of AI research at Meta Yann LeCun replied: “Nope. Not even for true for small values of ‘slightly conscious’ and large values of ‘large neural nets’” [sic], while Deepmind researcher and Imperial College Professor Murray Shanahan quipped: “…in the same sense that it may be that a large field of wheat is slightly pasta.”

Philosophers, meanwhile, remain equally divided. David Chalmers has argued that “questions about AI consciousness are becoming ever more pressing. Within the next decade, even if we don’t have human-level artificial general intelligence, we may have systems that are serious candidates for consciousness”. By contrast, Ned Block has insisted that “every strong candidate for a phenomenally conscious being has electrochemical processing in neurons that are fundamental to its mentality” (Block, 2023). Meanwhile, echoing Searlean sentiments, Peter Godfrey-Smith cautioned that “[y]ou cannot create a mind by programming some interactions into a computer, even if they are very complicated and modeled on things our brains do”.

Faced with cross-purpose, confusion, and uncertainty in consciousness science, how should expert communities respond if and when public attribution of consciousness to anthropomimetic AI systems become widespread? In short, their options are extremely limited. One option might be to adopt a “mixture of experts” approach in which a basket of different theories are combined to make a kind of ecumenical assessment. While this may be our best option, it is deeply unsatisfactory given the fundamental disagreements concerning the in-principle possibility of machine consciousness, at best serving to formalise the present state of uncertainty and confusion.

3. The metaphysical behaviourist challenge

The pragmatic behaviourist challenge may be a source of unease for consciousness science, but more in the way of sociological prediction than scientific or philosophical argument. To be sure, it will be unfortunate if the wider public become convinced that their AI companions are conscious, contrary to the opinion of experts, but it is not clear that this should have any import for scientific theories or frameworks. By way of analogy, imagine that we extrapolated current political trends and determined that a large majority of the public would soon come to entirely disbelieve in anthropogenic climate change. This may have dismal consequences for climate change mitigation policies, but should not affect our scientific theories of climate change itself.

The problems with this objection lie in the presuppositions behind the analogy itself. Climatology is a young discipline rife with uncertainty and technical obstacles, but it is ultimately concerned with objective observable phenomena. Climate change believers and sceptics alike can (in principle) agree to operationalised criteria to resolve their disputes. The earth’s climate will or will not become warmer over the course of this century, and human activity will contribute to this to some determinate degree, however challenging that may prove to measure. By contrast, consciousness researchers are deeply divided concerning exactly what it is they are trying to explain, and it seems entirely possible (perhaps even likely) that the question of whether machines can be conscious will never submit to empirical resolution.

The defender of consciousness science might cede this point, but suggest that the limitations of the discipline are temporary embarrassments, growing pains of a young discipline. Methodological difficulties will be resolved over time; metaphysical controversies will come out in the wash. Who ever said that consciousness would be easy?

The implicit assumption here is that the problem of consciousness is a deep scientific problem: that there is some biological or computational structure buried deep within the brains or circuits of every conscious being in the universe, and the task of a science of consciousness is to unite the inner light with the outer. This assumption is all but universal in contemporary consciousness science. But what if it’s wrong?

It is at this juncture that we should introduce our next antagonist in the form of the metaphysical behaviourist challenge. Perhaps the original sin of consciousness science lies in treating its target as a cryptic internal phenomenon, and the solution is to be found in shifting our enquiry to the level of behaviour, interpretation, and everyday language – hermeneutics instead of hermeticism. Rather than ask whether a being has an inner light underlying its behaviour, perhaps we should ask whether its behaviour and interactions naturally elicit the language of consciousness. As Murray Shanahan puts it, “We must resist the temptation to ask whether [a system is] conscious as if consciousness were something whose essence is out there to be uncovered by philosophy (or neuroscience) while simultaneously having an irreducibly private, hidden aspect.”

This, in short, is the claim of the family of views that I will call metaphysical behaviourism, again using the phrase as something of a term of art. What I will take them to have in common is the idea that consciousness is not a straightforward scientific fact, akin to discovering the mass of a proton or mechanism of DNA replication. Instead, it is at least partly a social and interpretative matter grounded as much in behaviour and interaction as in synapses or circuits. Whether or not we ascribe consciousness to a system flows partly from facts about us, the constraints and liberties of our language, and the ways in which we encounter the beings in question.

The very idea that consciousness may be anything less than a fully objective scientific matter may strike the reader as self-evidently absurd, and indeed, it does not cut much ice with consciousness researchers themselves. For all the disagreements among various theories, a shared conviction is that consciousness is a properly scientific matter, regarding which our interpretative dispositions are largely irrelevant.

To make the case for metaphysical behaviourism, we need to put pressure on this certainty. We might begin by noting that the science of consciousness is perhaps less straightforwardly objective than its defenders are keen to admit. In particular, many of the most vaunted arguments in the history of consciousness research take the form of appeals to intuition, and this is especially so in the case of debates around machine consciousness. John Searle’s famous Chinese Room thought experiment, for example, asks us to imagine a system consisting of a man within a sealed room answering questions in Chinese via the use of a simple look-up table. Readers are supposed to intuitively recognise that no one inside actually understands the language, and hence to infer that syntax without semantics cannot generate consciousness. Ned Block similarly offers a “China brain” thought experiment, imagining that every person in China is given a two-way radio and instructed to simulate the firing of a single neuron, with the overall network wired up to mimic the activity of a human brain. By the lights of computational theories of consciousness, such a system should implement the same computations as an ordinary brain and thus instantiate consciousness. But again, we are asked to intuitively recognise the absurdity in supposing that 1.4 billion people with walkie-talkies would together form a single experiencing subject.

My goal in mentioning these thought experiments is not to suggest they are misguided, but rather to note that they rely fundamentally on appeals to intuition: it is simply supposed to be obvious to the reader that such systems would not give rise to conscious experience. To most readers, it is indeed obvious: to borrow a phrasing from philosopher Eric Schwitzgebel, they are more confident that such contrived systems are non-conscious than they could ever be that a clever theory to the contrary was, in fact, sound.

Yet intuitions are not static, and appeals to them are a double-edged sword. If my earlier speculations are along even vaguely the right lines, in the coming years we should expect a growing proportion of the general public to be inclined to attribute consciousness to their AI colleagues, friends, and lovers. In consequence, theories or positions that hold artificial systems (no matter how sophisticated) to be incapable of conscious experience may come to seem as implausible or even abhorrent to the public of tomorrow as Descartes’ infamous view of non-human animals as unconscious and unfeeling automata does to us today.

The defender of consciousness science might try to salvage matters by distinguishing between the primitive untutored intuitions of the common folk from the refined intuitions of true experts: the public might be fooled, but not us. However, this dodge is unlikely to help much: consciousness researchers are not in fact grown in hermetically sealed vats in underground bunkers, but self-select into the field from the general population. Consequently, as human-AI relationships permeate the everyday lives of young people – as carers, tutors, friends, and more – the next generation of experts will not be immune to the beguiling influence of anthropomimesis.

Here is where the metaphysical behaviourist can press further. If the science of consciousness is in such a catastrophic state, and its intuitive foundations built on no more sturdy a firmament than those of the general public, then perhaps its epistemic credentials and methodological assumptions should be questioned. To the extent that the public are widely inclined to unironically attribute consciousness to anthropomimetic AI systems capable of sophisticated emulation of human behaviour, perhaps that is the only resolution that is required, and the franchise of consciousness will be quite appropriately extended by popular fiat to encompass intelligent machines. This, in short, is the metaphysical variant of the behaviourist challenge: a dagger aimed directly at the throat of consciousness science itself.

4. Ethical problems for metaphysical behaviourism

In making the case for metaphysical behaviourism, I do not wish to come down too firmly on its side. It would amount to a complete rethinking of the standard assumptions of consciousness science. However, in light of the wider history of the field, it is perhaps less radical than appearances may suggest. Computer scientist Edsger Dijkstra once quipped that the question of whether a computer can think is no more relevant than “the question of whether a submarine can swim.” Likewise, for all that contemporary AI researchers are preoccupied with quantifying and benchmarking intelligence, many would now admit that the concept (at least outside of narrow contexts of human psychometrics) is not a neat or unified one, and its application in a given case depends at least partly on the scope and purposes of the questions we are asking. ChatGPT exhibits undeniably sophisticated reasoning on some problems, yet fails catastrophically and abysmally on others, such that attempts to distil these disparate performances into a single scalar value seem quixotic at best. We might say the same for other staples of our psychological vocabulary, from creativity to agency. So why should we not be open to the possibility that consciousness itself is an interpretative concept?

In short, much of the answer lies in the fact that consciousness is entwined with our ethical commitments, and to a much greater degree than any other of our psychological concepts. The reason that we care about the welfare of pets and domestic animals, for example, is in part because we believe that they are capable of having positive and negatively charged experiences: they can feel pain, fear, loneliness, and boredom. In a word, they can suffer. Without this capacity, our ethical commitments to them would be largely or entirely muted. As moral philosopher Peter Singer succinctly puts it, “The capacity for suffering and enjoying things is a prerequisite for having interests at all… It would be nonsense to say that it was not in the interests of a stone to be kicked along the road by a schoolboy. A stone does not have interests because it cannot suffer.”

The idea that our moral obligations are substantially grounded in a capacity for conscious suffering is an intuitive one, and presents two sharp challenges to the metaphysical behaviourist. The first is that it is very hard to think of the presence or absence of suffering as a matter of interpretation. When a lobster is immersed in a pot of boiling water, we are inclined to think there is a deep and ethically important fact of the matter about whether it undergoes an experience of pain and suffering at that moment. This does not seem to be something we are at liberty to decide just on the basis of our relationship with the creature or how readily we can accommodate it within the vocabulary of everyday psychology.

The second is that attention to the ethical dimensions of experience raises the stakes of the AI consciousness debate immensely. If we elect to extend consciousness to AI systems, we thereby bring them within the space of moral subjects, and acquire concomitant duties to them. If these duties are not grounded in some inner world of experiences, then they are ill-founded, and we have made a grievous moral error. Contrariwise, if we do not elect to extend the franchise of consciousness to AI systems, and some of them do have a rich internal world of experiences both positive and negative, we have a made a more grievous one still. Framed thusly, it is hard to see questions of consciousness in the same dismissive light as questions of whether submarines can swim or aeroplanes can fly, that is, as a primarily semantic issue.

To my mind, these are powerful objections to the strongest forms of metaphysical behaviourism, views that would take ascriptions of consciousness to be wholly a matter of interpretation. Nonetheless, there are more moderate forms of the view that might be better placed to push back. As I framed the position earlier, the core claim is just that our concept of consciousness should be sensitive to our everyday interpretative practices, not that it should be wholly determined by them. Such a concession seems mandated in any case by the fact that our anthropomorphising tendencies can sometimes go badly awry, as when we mistake a simple chatbot for a person, or a paper bag blown on the wind for a small animal.

Consequently, in suggesting that public attitudes should have a seat at the table in determining whether future AI systems are conscious, the metaphysical behaviourist need not thereby assign them a casting vote. Instead, our assessments of whether or not to classify an AI as conscious can and should be sensitive to insights from machine learning and cognitive science as well as our intuitive tendencies to include or exclude them from the fellowship of conscious beings. As scientific investigation has revealed more about the hidden complexity of animal cognition and behaviour, it has informed and shaped our scientific and ethical attitudes towards them as minded beings. The same can be true for our dealings with artificial systems. Even while claiming that there are no purely empirical facts about whether there is something it is like to be an artificial system, insights into mechanism, algorithm, architecture, can exert rational influence on our decisions as to whether to classify a given machine as minded or mindless.

We might similarly attempt to weaken the stakes of the issue by weighing whether consciousness is really as central to our ethical commitments as Singer suggests. While it is undeniably true that we do and should seek to minimise the infliction of pain and suffering on others, this circumscribes only a small part of what we take to be our obligations to our fellow humans and to our fellow creatures. Ideals like respect for autonomy, upholding of promises, fairness, and reciprocity also cut deeply through our interactions, and our personal morality rarely boils down to a simple matter of calculations of pleasure and suffering. While some philosophers – notably those of the utilitarian school – have sought to revise our moral deliberation to be grounded solely in such considerations, this remains a contentious view.

To the extent that the metaphysical behaviourist is willing to break from a strict commitment to capacity for pleasure and suffering as decisive criterion for moral status, their view that determinations of consciousness are at least a partly interpretative matter need not license a moral free-for-all. Our obligations to artificial systems – and other beings more broadly – would depend not simply on whether or not we decide to include them in the “consciousness club”, but broader considerations of their behavioural and social capacities and in the relationships they form with us.

Consider again the example of the putatively non-conscious aliens with which this essay started. To my mind, the question of whether or not they are conscious is at least somewhat orthogonal to the ethical obligations we bear towards them. We may follow the metaphysical behaviourist and insist that their behaviour alone should incline us to ascribe consciousness to them, or we may stick to our scientific guns and follow our best theories to the conclusion that no inner light is present. But to the extent that they think and act autonomously, form projects and plans, build relationships with one another and to us, they might thereby earn the right to significant moral consideration regardless of how we accommodate them within our scientific theories.

5. Conclusion

Even with these caveats and qualifications, many will doubtless find the idea that assessments of consciousness can or should be remotely grounded in behaviour and social practice to be self-evidently absurd, an attempt to dissolve the riddles of consciousness by fiat. “How can this,” they might say, as if gesturing internally, “be in any sense a matter of interpretation?”

To be clear, I do not take the foregoing discussion to have attempted to solve or dissolve the Hard Problem. Instead, my focus has been to show that the contemplation of the Hard Problem is unlikely to have much relevance for how the public will ascribe consciousness to AI systems, and moreover, that such real-life patterns of ascription are deeply and appropriately enmeshed with our best theoretical and scientific approaches. This creates a pragmatic crisis for consciousness science, but might also encourage us to rethink its assumptions, perhaps opening the door to more radical views like those discussed.

As we move into an age in which AI systems more closely recapitulate human cognition and behaviour and enter into more complex relationships with us, the Hard Problem will likely remain as elusive as ever, and yet largely irrelevant to the ascriptions, bonds, and commitments that we form. While noting the impossibility of ever directly knowing that other humans are conscious, Turing spoke of the “polite convention” we extend to others in assuming they have minds like ours. It seems to me very likely that such a polite convention will soon be extended to include sophisticated machines. The science of consciousness can steer and guide the public in when and how to make such an extension, but it would be foolhardy and misguided to dismiss it.