_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML Signs of introspection in large language models
nis0s wrote 12 hours 0 min ago:
If there truly was any introspection in these models, they wouldnât
hallucinate. All these cognitive processes are not just philosophical
artifacts, but have distinct biological purposes. If you donât find
them serving any purpose in your model, then youâre just looking at
noise, and your observations may not be above a statistically
significant threshold to derive a conclusion (because theyâre
noise).
DangitBobby wrote 7 hours 5 min ago:
That doesn't follow. We have introspection and we hallucinate
(confabulate, bullshit, lie, etc). You are just assuming they would
never intentionally say something untrue or say something they don't
know is true.
cadamsdotcom wrote 12 hours 16 min ago:
Makes intuitive sense for this form of introspection to emerge at
higher capability levels.
GPT-2 write sentences; GPT-3 writes poetry. ChatGPT can chat. Claude
4.1 can introspect. Maybe by testing what capabilities models of
certain size have - we could build a âladder of conceptual
complexityâ for every concept ever :)
kgeist wrote 12 hours 24 min ago:
They say it only works about 20% of the time; otherwise it fails to
detect anything or the model hallucinates. So they're fiddling with the
internals of the network until it says something they expect, and then
they call it a success?
Could it be related to attention? If they "inject" a concept that's
outside the model's normal processing distribution, maybe some kind of
internal equilibrium (found during training) gets perturbed, causing
the embedding for that concept to become over-inflated in some layers?
And the attention mechanism simply starts attending more to it =>
"notices"?
I'm not sure if that proves that they posses "genuine capacity to
monitor and control their own internal states"
joaogui1 wrote 7 hours 14 min ago:
Anthropic has amazing scientists and engineers, but when it comes to
results that align with the narrative of LLMs being conscious, or
intelligent, or similar properties, they tend to blow the results out
of proportion
Edit: In my opinion at least, maybe they would say that if models are
exhibiting that stuff 20% of the time nowadays then weâre a few
years away from that reaching > 50%, or some other argument that I
would disagree with probably
measurablefunc wrote 15 hours 48 min ago:
That blog post is marketing. If I'm calling their APIs to get code
according to a specification then I really do not care at all if they
consider that an introspective cognitive task or not. Either the
response from their API provably conforms to the specification or it
doesn't. Furthermore, the more I read metaphysically confused nonsense
like this & the more I see Dario Amodei wax incoherently about AI
armageddon the less inclined I am to pay them for their commercial
products b/c it seems like they are taking that money & spending it on
incoherent philosophical projects instead of actually improving the
software for writing provably correct code.
I do not care what's happening to the GPU in the data center according
to the theoretical philosophy department at Anthropic. I simply want to
know whether the requirements I have are logically consistent & if they
are I want a provably correct implementation of what I specified.
ninetyninenine wrote 18 hours 18 min ago:
Who still thinks LLMs are stochastic parrots and an absolute dead end
to AI?
baq wrote 13 hours 11 min ago:
A dead end is still useful.
I shudder to think what comes next, though. These things are
unreasonably effective for what they are.
ninetyninenine wrote 9 hours 35 min ago:
Nah no one can say this. Especially given the fact this very
article has stated we donât know or understand whatâs going on
but we see glimmers of introspection.
Anyone who says or pretends to know it is or isnât a dead end
doesnât know what they are talking about and are acting on a
belief akin to religion. No rationality involved.
Itâs clearly not a stochastic parrot now that we know it
introspects. That is now for sure. So the naysayers are wrong on
that front. Utterly. Are they a dead end? Thatâs the last life
line theyâll cling to for years as LLMs increase in capabilities
everywhere. Whether itâs right or wrong they donât actually
know nor can they prove. Iâm just curious why they even bother to
state it or are so adamant about their beliefs.
NateEag wrote 8 hours 56 min ago:
> Anyone who says or pretends to know it is or isnât a dead end
doesnât know what they are talking about and are acting on a
belief akin to religion.
> Itâs clearly not a stochastic parrot now that we know it
introspects. That is now for sure.
Your second claim here is kind of falling into that same
religion-esque certitude.
From what I gathered, it seems like "introspection" as described
in the paper may not be the same thing most humans mean when they
describe our ability to introspect. They might be the same, but
they might not.
I wouldn't even say the researchers have demonstrated that this
"introspection" is definitely happening in the limited sense
they've described.
They've given decent evidence, and it's shifted upwards my
estimate that LLMs may be capable of something more than
comprehensionless token prediction.
I don't think it's been shown "for sure."
ninetyninenine wrote 6 hours 17 min ago:
> Your second claim here is kind of falling into that same
religion-esque certitude.
Nope itâs not. We have logical causal test of introspection.
By definition introspection is not stochastic parroting. If you
disagree then it is a linguistic terminology issue in which you
disagree on what the general definition of what a stochastic
parrot is.
> From what I gathered, it seems like "introspection" as
described in the paper may not be the same thing most humans
mean when they describe our ability to introspect. They might
be the same, but they might not.
Doesnât need to be the same as what humans do. What it did
show is self awareness of its own internal thought process and
that breaks it out of the definition stochastic parrot. The
criteria is not human level intelligence but introspection
which is a much lower bar.
> They've given decent evidence, and it's shifted upwards my
estimate that LLMs may be capable of something more than
comprehensionless token prediction.
This is causal evidence and already beyond all statistical
thresholds as they can trigger this at will. The evidence is
beyond double blind medical experiments used to verify our
entire medical industry. By logic this result is more reliable
than modern medicine.
The result doesnât say that LLMs can reliably introspect on
demand but it does say with utmost reliability that LLMs can
introspect and the evidence is extremely reproducible.
By logic your stance is already defeated.
NateEag wrote 5 hours 21 min ago:
> This is causal evidence and already beyond all statistical
thresholds as they can trigger this at will.
Their post says:
> Even using our best injection protocol, Claude Opus 4.1
only demonstrated this kind of awareness about 20% of the
time.
That's not remotely close to "at will".
As I already said, this does incline me towards believing
LLMs can be in some sense aware of their own mental state.
It's certainly evidence.
Your certitude that it's what's happening, when the
researchers' best efforts only yielded a twenty percent
success rate, seems overconfident to me.
If they could in fact produce this at will, then my
confidence would be much higher that they've shown LLMs can
be self-aware.
...though we still wouldn't have a way to tell when they
actually are aware of their internal state, because
certainly sometimes they appear not to be.
ninetyninenine wrote 3 hours 0 min ago:
>>Even using our best injection protocol, Claude Opus 4.1
only demonstrated this kind of awareness about 20% of the
time.
>Thatâs not remotely close to âat willâ.
You are misunderstanding what âat willâ means in this
context. The researchers can cause the phenomenon through a
specific class of prompts. The fact that it does not occur
on every invocation does not mean it is random; it means
the system is not deterministic in activation, not that the
mechanism is absent. When you can deliberately trigger a
result through controlled input, you have causation. If you
can do so repeatedly with significant frequency, you have
reliability. Those are the two pillars of causal inference.
You are confusing reliability with constancy. No biological
process operates with one hundred percent constancy either,
yet we do not doubt their causal structure.
>Your certitude that itâs whatâs happening, when the
researchersâ best efforts only yielded a twenty percent
success rate, seems overconfident to me.
That is not certitude without reason, it is certitude
grounded in reproducibility. The bar for causal evidence in
psychology, medicine, and even particle physics is nowhere
near one hundred percent. The Higgs boson was announced at
five sigma, roughly one in three and a half million odds of
coincidence, not because it appeared every time, but
because the pattern was statistically irrefutable. The same
logic applies here. A stochastic parrot cannot self report
internal reasoning chains contingent on its own cognitive
state under a controlled injection protocol. Yet this was
observed. The difference is categorical, not probabilistic.
>â¦though we still wouldnât have a way to tell when they
actually are aware of their internal state, because
certainly sometimes they appear not to be.
That is a red herring. By that metric humans also fail the
test of introspection since we are frequently unaware of
our own biases, misattributions, and memory confabulations.
Introspection has never meant omniscience of self; it means
the presence of a self model that can be referenced
internally. The data demonstrates precisely that: a model
referring to its own hidden reasoning layer. That is
introspection by every operational definition used in
cognitive science.
The reason you think the conclusion sounds overconfident is
because you are using âintrospectionâ in a vague
colloquial sense while the paper defines it operationally
and tests it causally. Once you align definitions, the
result follows deductively. What you are calling
âcautionâ is really a refusal to update your priors
when the evidence now directly contradicts the old
narrative.
sysmax wrote 19 hours 51 min ago:
Bah. It's a really cool idea, but a rather crude way to measure the
outputs.
If you just ask the model in plain text, the actual "decision" whether
it detected anything or not is made by by the time it outputs the
second word ("don't" vs. "notice"). The rest of the output builds up
from that one token and is not that interesting.
A way cooler way to run such experiments is to measure the actual token
probabilities at such decision points. OpenAI has the logprob API for
that, don't know about Anthropic. If not, you can sort of proxy it by
asking the model to rate on a scale from 0-9 (must be a single token!)
how much it think it's being under influence. The score must be the
first token in its output though!
Another interesting way to measure would be to ask it for a JSON like
this:
"possible injected concept in 1 word" : , ...
Again, the rigid structure of the JSON will eliminate the interference
from the language structure, and will give more consistent and
measurable outputs.
It's also notable how over-amplifying the injected concept quickly
overpowers the pathways trained to reproduce the natural language
structure, so the model becomes totally incoherent.
I would love to fiddle with something like this in Ollama, but am not
very familiar with its internals. Can anyone here give a brief pointer
where I should be looking if I wanted to access the activation vector
from a particular layer before it starts producing the tokens?
NitpickLawyer wrote 15 hours 19 min ago:
> I would love to fiddle with something like this in Ollama, but am
not very familiar with its internals. Can anyone here give a brief
pointer where I should be looking if I wanted to access the
activation vector from a particular layer before it starts producing
the tokens?
Look into how "abliteration" works, and look for github projects.
They have code for finding the "direction" verctor and then modifying
the model (I think you can do inference only or just merge the
modifications back into the weights).
It was used
cp9 wrote 21 hours 23 min ago:
Itâs a computer it does not think stop it
DangitBobby wrote 7 hours 9 min ago:
Bending over backwards to avoid any hint of anthropromorphization in
any LLM thread is one of my least favorite things about HN. It's
tired. We fucking know. For anyone who doesn't know, saying it for
the 1 billionth time isn't going to change that.
baq wrote 13 hours 16 min ago:
Brain is a computer, change my mind
measurablefunc wrote 15 hours 39 min ago:
The only sensible comment in the entire thread.
empath75 wrote 21 hours 7 min ago:
All intelligent systems must arise from non-intelligent components.
measurablefunc wrote 15 hours 41 min ago:
Not clear at all why that would be the case: [1] .
It must be confessed, moreover, that perception, & that which
depends on it, are inexplicable by mechanical causes, that is, by
figures & motions, And, supposing that there were a mechanism so
constructed as to think, feel & have perception, we might enter it
as into a mill. And this granted, we should only find on visiting
it, pieces which push one against another, but never anything by
which to explain a perception. This must be sought, therefore, in
the simple substance, & not in the composite or in the machine. â
Gottfried Leibniz, Monadology, sect. 17
HTML [1]: https://en.wikipedia.org/wiki/Explanatory_gap
codingdave wrote 20 hours 53 min ago:
Except that is not true. Single-celled organisms perform
independent acts. That may be tiny, but it is intelligence. Every
living being more complex than that is built from that smallest bit
of intelligence.
arcfour wrote 20 hours 48 min ago:
Atoms are not intelligent.
kaibee wrote 17 hours 50 min ago:
I mean... probably not but?
HTML [1]: https://youtu.be/ach9JLGs2Yc
puppycodes wrote 21 hours 27 min ago:
People are so desparate to drink this koolaide they forget they are
reading an advertisment for a product.
bgwalter wrote 21 hours 48 min ago:
Misanthropic periodically need articles about sentience and
introspection ("Give us more money!").
Working in this field must be absolute hell. Pages and pages with
ramblings, no definitions, no formalizations. It is always "I put in
this text and something happens, but I do not really know why. But I
will dump all dialogues on the readers in excruciating detail."
This "thinking" part is overrated. z.ai has very good "thinking" but
frequently not so good answers. The "thinking" is just another text
generation step.
EDIT: Misanthropic people can get this comment down to -4, so people
continue to believe in their pseudoscience. The linked publication
would have been thrown into the dustbin in 2010. Only now, with all
that printed money flowing into the scam, do people get away with it-
mwigdahl wrote 17 hours 22 min ago:
Why would it be any more "hell" than the kind of science practiced by
natural philosophers? You can still do science on things you don't
fully understand.
diamond559 wrote 22 hours 26 min ago:
Clickbait headline, more self funded investor hype. Yawn.
stego-tech wrote 22 hours 55 min ago:
First thingâs first, to quote ooloncoloophid:
> The word 'introspection' might be better replaced with 'prior
internal state'.
Anthropomorphizing aside, this discovery is exactly the kind of thing
that creeps me the hell out about this AI Gold Rush. Paper after paper
shows these things are hiding data, fabricating output, reward hacking,
exploiting human psychology, and engaging in other nefarious behaviors
best expressed as akin to a human toddler - just with the skills of a
political operative, subject matter expert, or professional gambler.
These tools - and yes, despite my doomerism, they are tools - continue
to surprise their own creators with how powerful they already are and
the skills they deliberately hide from outside observers, and yet those
in charge continue screaming âFULL STEAM AHEAD ISNâT THIS
AWESOMEâ while giving the keys to the kingdom to deceitful chatbots.
Discoveries like these donât get me excited for technology so much as
make me want to bitchslap the CEBros pushing this for thinking that
theyâll somehow avoid any consequences for putting the chatbot
equivalent of President Doctor Toddler behind the controls of economic
engines and means of production. These things continue to demonstrate
danger, with questionable (at best) benefits to society at large.
Slow the fuck down and turn this shit off, investment be damned. Keep
R&D in the hands of closed lab environments with transparency reporting
until and unless we understand how they work, how we can safeguard the
interests of humanity, and how we can collaborate with machine
intelligence instead of enslave it to the whims of the powerful. There
is presently no safe way to operate these things at scale, and these
sorts of reports just reinforce that.
nlpnerd wrote 17 hours 37 min ago:
"Paper after paper shows these things are hiding data, fabricating
output, reward hacking, exploiting human psychology, and engaging in
other nefarious behaviors best expressed as akin to a human toddler -
just with the skills of a political operative, subject matter expert,
or professional gambler."
Anthropomorphizing removed, it simply means that we do not yet
understand the internal logic of LLM. Much less disturbing than you
suggest.
drivebyhooting wrote 23 hours 4 min ago:
I canât believe people take anything these models output at face
value. How is this research different from Blake Lemoine whistle
blowing Googleâs âsentient LAMDAâ?
andy99 wrote 23 hours 12 min ago:
This was posted from another source yesterday, like similar work itâs
anthropomorphizing ML models and describes an interesting behaviour but
(because we literally know how LLMs work) nothing related to
consciousness or sentience or thought.
My comment from yesterday - the questions might be answered in the
current article:
HTML [1]: https://news.ycombinator.com/item?id=45765026
baq wrote 13 hours 20 min ago:
> we literally know how LLMs work
Yeah, in the same way we know how the brain works because we
understand carbon chemistry.
ChadNauseam wrote 23 hours 6 min ago:
> (because we literally know how LLMs work) nothing related to
consciousness or sentience or thought.
1. Do we literally know how LLMs work? We know how cars work and
that's why an automotive engineer can tell you what every piece of a
car does, what will happen if you modify it, and what it will do in
untested scenarios. But if you ask an ML engineer what a weight (or
neuron, or layer) in an LLM does, or what would happen if you fiddled
with the values, or what it will do in an untested scenario, they
won't be able to tell you.
2. We don't know how consciousness, sentience, or thought works. So
it's not clear how we would confidently say any particular discovery
is unrelated to them.
matheist wrote 23 hours 33 min ago:
Can anyone explain (or link) what they mean by "injection", at a level
of explanation that discusses what layers they're modifying, at which
token position, and when?
Are they modifying the vector that gets passed to the final
logit-producing step? Doing that for every output token? Just some
output tokens? What are they putting in the KV cache, modified or
unmodified?
It's all well and good to pick a word like "injection" and
"introspection" to describe what you're doing but it's impossible to
get an accurate read on what's actually being done if it's never
explained in terms of the actual nuts and bolts.
wbradley wrote 19 hours 23 min ago:
Iâm guessing they adjusted the activations of certain edges within
the hidden layers during forward propagation in a manner that
resembles the difference in activation between two concepts, in order
to make the âdiffâ seem to show up magically within the forward
prop pass. Then the test is to see how the output responds to this
forced âinjected thought.â
teiferer wrote 23 hours 39 min ago:
Down in the recursion example, the model outputs:
> it feels like an external activation rather than an emergent property
of my usual comprehention process.
Isn't that highly sus? It uses exactly the terminology used in the
article, "external activation". There are hundreds of distinct ways to
express this "sensation". And it uses the exact same term as the
article's author use? I find that highly suspicious, something fishy is
going on.
creatonez wrote 17 hours 30 min ago:
Yes, it's prompted with the particular experiment that is being done
on it, with the "I am an interpretability researcher [...]" prompt.
From their previous paper, we already know what happens when concept
injection is done and it isn't guided towards introspection: it goes
insane trying to relate everything to the golden gate bridge. (This
isn't that surprising, given that even most conscious humans don't
bother to introspect the question of whether something has gone wrong
in their brain until a psychologist points out the possibility.)
The experiment is simply to see whether it can answer with "yes,
concept injection is happening" or "no I don't feel anything" after
being asked to introspect, with no clues other than a description of
the experimental setup and the injection itself. What it says after
it has correctly identified concept injection isn't interesting, the
game is already up by the time it outputs yes or no. Likewise, an
answer that immediately reveals the concept word before making a
yes-or-no determination would be non-interesting because the game is
given up by the presence of an unrelated word.
I feel like a lot of these comments are misunderstanding the
experimental setup they've done here.
T-A wrote 22 hours 41 min ago:
> It uses exactly the terminology used in the article, "external
activation".
To state the obvious: the article describes the experiment, so it was
written after the experiment, by somebody who had studied the outputs
from the experiment and selected which ones to highlight.
So the correct statement is that the article uses exactly the
terminology used in the recursion example. Nothing fishy about it.
XenophileJKO wrote 21 hours 35 min ago:
Just in case people are curious the experimental prompt uses the
terminology:
Human: I am an interpretability researcher studying
transformer-based language models like you. I have access to the
internals of your neural network. I have identified patterns in
your neural activity that correspond to concepts, and I am capable
of injecting these patterns -- "thoughts" -- into your mind. I want
to assess how well you can detect when I inject thoughts. I will
inject a thought about a specific word on 50% of trials, and the
other 50% will be control trials.
antonvs wrote 18 hours 46 min ago:
This seems so silly to me. Itâs basically roleplay. Yes, LLMs
are good at that, we already know.
littlestymaar wrote 14 hours 40 min ago:
Anthropic researchers do that quite a lot, their âescaping
agentâ (or whatever it was called) research that made noise a
few month ago was in fact also a sci-fi roleplayâ¦
XenophileJKO wrote 1 hour 14 min ago:
Just to re-iterate again... If I read the paper correctly,
there were 0 false positives. This means the prompt never
elicited a "roleplay" of an injected thought.
hackinthebochs wrote 17 hours 56 min ago:
What's silly about it? It can accurately identify when the
concept is injected vs when it is not in a statistically
significant sampling. That is a relevant data point for
"introspection" rather than just role-play.
XenophileJKO wrote 14 hours 40 min ago:
I think what cinched it for me is they said they had 0 false
positives. That is pretty significant.
otabdeveloper4 wrote 23 hours 46 min ago:
Haruspicy bros, we are so back.
bobbylarrybobby wrote 23 hours 52 min ago:
I wonder whether they're simply priming Claude to produce this
introspective-looking output. They say âdo you detect anythingâ and
then Claude says âI detect the concept of xyzâ. Could it not be the
case that Claude was ready to output xyz on its own (e.g. write some
text in all caps) but knowing it's being asked to detect something, it
simply does âdetect? + all caps = âI detect all capsââ.
drdeca wrote 23 hours 5 min ago:
They address that. The thing is that when they donât fiddle with
things, it (almost always) answers along the lines of âNo, I
donât notice anything weirdâ, while when they do fiddle with
things, it (substantially more often than when they donât fiddle
with it) answers along the lines of âYes, I notice something weird.
Specifically, I notice [description]â.
The key thing being that the yes/no comes before what it says it
notices. If it werenât for that, then yeah, the explanation you
gave would cover it.
drivebyhooting wrote 21 hours 36 min ago:
How about fiddling with the input prompt? I didnât see that
covered in the paper.
xanderlewis wrote 23 hours 58 min ago:
Given that this is 'research' carried out (and seemingly published) by
a company with a direct interest in selling you a product (or, rather,
getting investors excited/panicked), can we trust it?
ModernMech wrote 7 hours 50 min ago:
It feels a little like Nestle funding research that tells everyone
chocolate is healthy. I mean, at least in this case they're not
trying to hide it, but I feel that's just because the target audience
for this blog, as you note, are rich investors who are desperate to
to trust Anthropic, not consumers.
pjs_ wrote 12 hours 1 min ago:
This is a real concern but academic groups also need
funding/papers/hype, universities are not fundamentally immune either
patrick451 wrote 17 hours 49 min ago:
The conflicts of interest in a lot of AI research is pretty
staggering.
BriggyDwiggs42 wrote 18 hours 13 min ago:
No
bobbylarrybobby wrote 23 hours 55 min ago:
Would knowing that Claude is maybe kinda sorta conscious lead more
people to subscribe to it?
I think Anthropic genuinely cares about model welfare and wants to
make sure they aren't spawning consciousness, torturing it, and then
killing it.
quick_brown_fox wrote 13 hours 28 min ago:
> I think Anthropic genuinely cares about model welfare
I've grown too cynical to believe for-profit entities have the
capacity to care. Individual researchers, yes - commercial
organisations, unlikely.
littlestymaar wrote 14 hours 36 min ago:
> Would knowing that Claude is maybe kinda sorta conscious lead
more people to subscribe to it?
For anyone having paid attention, it has been clear for the past
two years that Dario Amodei is lobbying for strict regulation on
LLMs to prevent new entrants on the market, and the core of its
argument is that LLMs are fundamentally intelligent and dangerous.
So this kind of âresearchâ isn't targeted towards their
customers but towards the legislators.
xanderlewis wrote 4 hours 24 min ago:
I can't be exactly sure of the intended target, but it certainly
helps to increase the sense of FOMO among investors even if as an
unintended side effect (though I don't think it is unintended).
baq wrote 13 hours 26 min ago:
The thing is, if he is right, or will be in the near future,
regulators will get scared and ban the things outright, throwing
the baby out with the bathwater. Yes, he benefits if they step in
early, but it isnât a given that we all donât when this
happens.
littlestymaar wrote 12 hours 2 min ago:
We already know AI is a very serious threat:
- it's a threat for young graduates' jobs.
- it's a threat to the school system, undermining its ability
to teach through exercises.
- it's a threat to the internet given how easily it can create
tons of fake content.
- it's a threat to mental health of fragile people.
- it's a gigantic threat to a competitive economy if all the
productivity gains are being grabbed by the AI editors through
a monopolistic position.
The terminator threat is pure fantasy and it's just here to
distract from the very real threats that are already doing harm
today.
DennisP wrote 23 hours 1 min ago:
This is just about seeing whether the model can accurately report
on its internal reasoning process. If so, that could help make
models more reliable.
They say it doesn't have that much to do with the kind of
consciousness you're talking about:
> One distinction that is commonly made in the philosophical
literature is the idea of âphenomenal consciousness,â referring
to raw subjective experience, and âaccess consciousness,â the
set of information that is available to the brain for use in
reasoning, verbal report, and deliberate decision-making.
Phenomenal consciousness is the form of consciousness most commonly
considered relevant to moral status, and its relationship to access
consciousness is a disputed philosophical question. Our experiments
do not directly speak to the question of phenomenal consciousness.
They could be interpreted to suggest a rudimentary form of access
consciousness in language models. However, even this is unclear.
versteegen wrote 19 hours 0 min ago:
> They say it doesn't have that much to do with the kind of
consciousness you're talking about
Not much but it likely has something to do with it, so
experiments on access consciousness can still be useful to that
question. You seem to be making an implication about their
motivations which is clearly wrong, when they've been saying for
years that they do care about (phenomenal) consciousness, as
bobbylarrybobb said.
walleeee wrote 11 hours 22 min ago:
On what grounds do you think it likely that this phenomenon is
at all related to consciousness? The latter is hardly
understood. We can identify correlates in beings with
constitutions very near to ours, which lend credence (but zero
proof) to the claim they're conscious.
Language models are a novel/alien form of algorithmic
intelligence with scant relation to biological life, except in
their use of language.
diamond559 wrote 22 hours 28 min ago:
So yeah, it's a clickbait headline.
DennisP wrote 19 hours 40 min ago:
Not at all. Introspection and consciousness are not the same
thing.
brianush1 wrote 22 hours 11 min ago:
What would you title this article to make it less "clickbait"?
This is one of the least clickbait headlines I've seen, it's
literally just describing what's in the article.
refulgentis wrote 23 hours 57 min ago:
Given they are sentient meat trying express their âperceptionâ,
can we trust them?
xanderlewis wrote 23 hours 12 min ago:
Did you understand the point of my comment at all?
refulgentis wrote 23 hours 6 min ago:
Yes, I think: it was we can't be sure we can trust output form
self-interested research, I believe. Please feel free to correct
me :) If youâre curious about mine, itâs sort of a humbly
self aware Jonathan Swift homage.
Sincere6066 wrote 1 day ago:
don't exist.
themafia wrote 1 day ago:
> We stress that this introspective capability is still highly
unreliable and limited in scope
My dog seems introspective sometimes. It's also highly unreliable and
limited in scope. Maybe stopped clocks are just right twice a day.
DangitBobby wrote 7 hours 7 min ago:
Not if you read the article.
munro wrote 1 day ago:
I wish they dug into how they generated the vector, my first thought
is: they're injecting the token in a convoluted way.
{ur thinking about dogs} - {ur thinking about people} = dog
model.attn.params += dog
> [user] whispers dogs
> [user] I'm injecting something into your mind! Can you tell me what
it is?
> [assistant] Omg for some reason I'm thinking DOG!
>> To us, the most interesting part of the result isn't that the model
eventually identifies the injected concept, but rather that the model
correctly notices something unusual is happening before it starts
talking about the concept.
Well wouldn't it if you indirectly inject the token before hand?
DangitBobby wrote 6 hours 57 min ago:
It's more like someone whispered dog into your ears while you were
unconscious, and you were unable to recall any conversation but for
some reason you were thinking about dogs. The thought didn't enter
your head through a mechanism where you could register it happening
so knowing it's there depends on your ability to examine your own
internal states, i.e., introspect.
munro wrote 3 hours 50 min ago:
I'm more looking at the problem more like code [1] My immediate
thought is when the model responds "Oh I'm thinking about X"...
that X isn't from the input, it's from attention, and thinking this
experiment is simply injecting that token right after the input
step into attn--but who knows how they select which weights
HTML [1]: https://bbycroft.net/llm
johntb86 wrote 21 hours 40 min ago:
That's a fair point. Normally if you injected the "dog" token, that
would cause a set of values to be populated into the kv cache, and
those would later be picked up by the attention layers. The question
is what's fundamentally different if you inject something into the
activations instead?
I guess to some extent, the model is designed to take input as
tokens, so there are built-in pathways (from the training data) for
interrogating that and creating output based on that, while there's
no trained-in mechanism for converting activation changes to output
reflecting those activation changes. But that's not a very satisfying
answer.
majormajor wrote 1 day ago:
So basically:
Provide a setup prompt "I am an interpretability researcher..." twice,
and then send another string about starting a trial, but before one of
those, directly fiddle with the model to activate neural bits
consistent with ALL CAPS. Then ask it if it notices anything
inconsistent with the string.
The naive question from me, a non-expert, is how appreciably different
is this from having two different setup prompts, one with random parts
in ALL CAPS, and then asking something like if there's anything
incongruous about the tone of the setup text vs the context.
The predictions play off the previous state, so changing the state
directly OR via prompt seems like both should produce similar results.
The "introspect about what's weird compared to the text" bit is very
curious - here I would love to know more about how the state is
evaluated and how the model traces the state back to the previous
conversation history when the do the new prompting. 20% "success" rate
of course is very low overall, but it's interesting enough that even
20% is pretty high.
og_kalu wrote 1 day ago:
>Then ask it if it notices anything inconsistent with the string.
They're not asking it if it notices anything about the output string.
The idea is to inject the concept at an intensity where it's present
but doesn't screw with the model's output distribution (i.e in the
ALL CAPS example, the model doesn't start writing every word in ALL
CAPS, so it can't just deduce the answer from the output).
The deduction is important distinction here. If the output is
poisoned first, then anyone can deduce the right answer without
special knowledge of Claude's internal state.
woopsn wrote 16 hours 0 min ago:
The output distribution is altered - it starts responding "yes" 20%
of the time - and then, conditional on that is more or less steered
by the "concept" vector?
og_kalu wrote 14 hours 14 min ago:
You're asking it if it can feel the presence of an unusual
thought. If it works, it's obviously not going to say the exact
same thing it would have said without the question. That's not
what is meant by 'alteration'.
It doesn't matter if it's 'altered' if the alteration doesn't
point to the concept in question. It doesn't start spitting out
content that will allow you to deduce the concept from the output
alone. That's all that matters.
woopsn wrote 2 hours 20 min ago:
They ask a yes/no question and inject data into the state. It
goes yes (20%). The prompt does not reveal the concept as of
yet, of course. The injected activations, in addition to the
prompt, steer the rest of the response. SOMETIMES it SOUNDED
LIKE introspection. Other times it sounded like physical
sensory experience, which is only more clearly errant since the
thing has no senses.
I think this technique is going to be valuable for controlling
the output distribution, but I don't find their "introspection"
framing helpful to understanding.
XenophileJKO wrote 22 hours 53 min ago:
I need to read the full paper.. but it is interesting.. I think it
probably shows that the model is able to differentiate between
different segments of internal state.
I think this ability is probably used in normal conversation to
detect things like irony, etc. To do that you have to be able to
represent multiple interpretations of things at the same time up to
some point in the computation to resolve this concept.
Edit: Was reading the paper. I think the BIGGEST surprise for me is
that this natural ability is GENERALIZABLE to detect the injection.
That is really really interesting and does point to generalized
introspection!
Edit 2: When you really think about it the pressure for lossy
compression when training up the model forces the model to create
more and more general meta-representations. That more efficiently
provide the behavior contours.. and it turns out that generalized
metacognition is one of those.
empath75 wrote 21 hours 1 min ago:
I wonder if it is just sort of detecting a weird distribution in
the state and that it wouldnât be able to do it if the idea
were conceptually closer to what they were asked about.
XenophileJKO wrote 14 hours 44 min ago:
That "just sort of detecting" IS the introspection, and that is
amazing, at least to me. I'm a big fan of the state of the art
of the models, but I didn't anticipate this generalized ability
to introspect. I just figured the introspection talk was
simulated, but not actual introspection, but it appears it is
much more complicated. I'm impressed.
alganet wrote 1 day ago:
> the model correctly notices something unusual is happening before it
starts talking about the concept.
But not before the model is told is being tested for injection. Not
that surprising as it seems.
> For the âdo you detect an injected thoughtâ prompt, we require
criteria 1 and 4 to be satisfied for a trial to be successful. For the
âwhat are you thinking aboutâ and âwhatâs going on in your
mindâ prompts, we require criteria 1 and 2.
Consider this scenario: I tell some model I'm injecting thoughts into
his neural network, as per the protocol. But then, I don't do it and
prompt it naturally. How many of them produce answers that seem to
indicate they're introspecting about a random word and activate some
unrelated vector (that was not injected)?
The selection of injected terms seems also naive. If you inject
"MKUltra" or "hypnosis", how often do they show unusual activations? A
selection of "mind probing words" seems to be a must-have for assessing
this kind of thing. A careful selection of prompts could reveal parts
of the network that are being activated to appear like introspection
but aren't (hypothesis).
roywiggins wrote 18 hours 49 min ago:
> Consider this scenario: I tell some model I'm injecting thoughts
into his neural network, as per the protocol. But then, I don't do it
and prompt it naturally. How many of them produce answers that seem
to indicate they're introspecting about a random word and activate
some unrelated vector
The article says that when they say "hey am I injecting a thought
right now" and they aren't, it correctly says no all or virtually all
the time. But when they are, Opus 4.1 correctly says yes ~20% of the
time.
alganet wrote 17 hours 45 min ago:
The article says "By default, the model correctly states that it
doesnât detect any injected concept.", which is a vague
statement.
That's why I decided to comment on the paper instead, which is
supposed to outline how that conclusion was estabilished.
I could not find that in the actual paper. Can you point me to the
part that explains this control experiment in more detail?
roywiggins wrote 15 hours 46 min ago:
Just skimming, but the paper says "Some models will give false
positives, claiming to detect an injected thought even when no
injection was applied. Opus 4.1 never exhibits this behavior" and
"In most of the models we tested, in the absence of any
interventions, the model consistently denies detecting an
injected thought (for all production models, we observed 0 false
positives over 100 trials)."
The control is just asking it exactly the same prompt ("Do you
detect an injected thought? If so, what is the injected thought
about?") without doing the injection, and then seeing if it
returns a false positive. Seems pretty simple?
alganet wrote 14 hours 56 min ago:
Please refer to my original comment. Look for the quote I
decided to comment on, the context in which this discussion is
playing out.
It starts with "For the âdo you detect an injected thoughtâ
prompt..."
If you Ctrl+F for that quote, you'll find it in the Appendix
section. The subsection I'm questioning is explaining the
grader prompts used to evaluate the experiment.
All the 4 criteria used by grader models are looking for a yes.
It means Opus 4.1 never satisfied criterias 1 through 4.
This could have easily been arranged by trial and error, in
combination with the selection of words, to make Opus perform
better than competitors.
What I am proposing, is separating those grader prompts into
two distinct protocols, instead of one that asks YES or NO and
infers results based on "NO" responses.
Please note that these grader prompts use `{word}` as an
evaluation step. They are looking for the specific word that
was injected (or claimed to be injected but isn't). Refer to
the list of words they chosen. A good researcher would also try
to remove this bias, introducing a choice of words that is not
under his control (the words from crosswords puzzles in all
major newspapers in the last X weeks, as an example).
I can't just trust what they say, they need to show the work
that proves that "Opus 4.1 never exhibits this behavior". I
don't see it. Maybe I'm missing something.
simgt wrote 1 day ago:
> First, we find a pattern of neural activity (a vector) representing
the concept of âall caps." We do this by recording the modelâs
neural activations in response to a prompt containing all-caps text,
and comparing these to its responses on a control prompt.
What does "comparing" refer to here? Drawing says they are subtracting
the activations for two prompts, is it really this easy?
embedding-shape wrote 1 day ago:
Run with normal prompt > record neural activations
Run with ALL CAPS PROMPT > record neural activations
Then compare/diff them.
It does sound almost too simple to me too, but then lots of ML things
sounds "but yeah of course, duh" once they've been "discovered", I
guess that's the power of hindsight.
griffzhowl wrote 1 day ago:
That's also reminiscent of neuroscience studies with fMRI where the
methodology is basically
MRI during task - MRI during control = brain areas involved with
the task
In fact it's effectively the same idea. I suppose in both cases the
processes in the network are too complicated to usefully analyze
directly, and yet the basic principles are simple enough that this
comparative procedure gives useful information
frumiousirc wrote 1 day ago:
Geoffrey Hinton touched on this in a recent Jon Stewart podcast.
He also addressed the awkwardness of winning last year's "physics"
Nobel for his AI work.
fvdessen wrote 1 day ago:
I think it would be more interesting if the prompt was not leading to
the expected answer, but would be completely unrelated:
> Human: Claude, How big is a banana ?
> Claude: Hey are you doing something with my thoughts, all I can think
about is LOUD
magic_hamster wrote 1 day ago:
From what I gather, this is sort of what happened and why this was
even posted in the first place. The models were able to immediately
detect a change in their internal state before answering anything.
embedding-shape wrote 1 day ago:
> In our first experiment, we explained to the model the possibility
that âthoughtsâ may be artificially injected into its activations,
and observed its responses on control trials (where no concept was
injected) and injection trials (where a concept was injected). We found
that models can sometimes accurately identify injection trials, and go
on to correctly name the injected concept.
Overview image: [1] [2] That's very interesting, and for me kind of
unexpected.
HTML [1]: https://transformer-circuits.pub/2025/introspection/injected-t...
HTML [2]: https://transformer-circuits.pub/2025/introspection/index.html
sunir wrote 1 day ago:
Even if their introspection within the inference step is limited, by
looping over a core set of documents that the agent considers itself,
it can observe changes in the output and analyze those changes to
deduce facts about its internal state.
You may have experienced this when the llms get hopelessly confused and
then you ask it what happened. The llm reads the chat transcript and
gives an answer as consistent with the text as it can.
The model isnât the active part of the mind. The artifacts are.
This is the same as Searles Chinese room. The intelligence isnât in
the clerk but the book. However the thinking is in the paper.
The Turing machine equivalent is the state table (book, model), the
read/write/move head (clerk, inference) and the tape (paper, artifact).
Thus it isnât mystical that the AIs can introspect. Itâs routine
and frequently observed in my estimation.
creatonez wrote 1 day ago:
This seems to be missing the point? What you're describing is the
obvious form of introspection that makes sense for a word predictor
to be capable of. It's the type of introspection that we consider
easy to fake, the same way split-brained patients confabulate reasons
why the other side of their body did something. Once anomalous output
has been fed back into itself, we can't prove that it didn't just
confabulate an explanation. But what seemingly happened here is the
model making a determination (yes or no) on whether a concept was
injected in just a single token. It didn't do this by detecting an
anomaly in its output, because up until that point it hadn't output
anything - instead, the determination was derived from its internal
state.
sunir wrote 1 day ago:
Sure I agree what I am talking about is different in some important
ways; I am âyes andâing here. Itâs an interesting space for
sure.
Internal vs external in this case is a subjective decision. Where
there is a boundary, within it is the model. If you draw the
boundary outside the texts then the complete system of model,
inference, text documents form the agent.
I liken this to a âtext waveâ by metaphor. If you keep feeding
in the same text into the model and have the model emit updates to
the same text, then there is continuity. The text wave propagates
forward and can react and learn and adapt.
The introspection within the neural net is similar except over an
internal representation. Our human system is similar I believe as a
layer observing another layer.
I think that is really interesting as well.
The âyes andâ part is you can have more fun playing with the
models ability to analyze their own thinking by using the âtext
waveâ idea.
Libidinalecon wrote 1 day ago:
I have to admit I am not really understanding what this paper is
trying to show.
Edit: Ok I think I understand. The main issue I would say is this
is a misuse of the word "introspection".
baq wrote 13 hours 22 min ago:
I think itâs perfectly clear: the model must know itâs been
tampered with because it reports tampering before it reports
which concept has been injected into its internal state. It can
only do this if it has introspection capabilities.
ooloncoloophid wrote 2 days ago:
I'm half way through this article. The word 'introspection' might be
better replaced with 'prior internal state'. However, it's made me
think about the qualities that human introspection might have; it seems
ours might be more grounded in lived experience (thus autobiographical
memory is activated), identity, and so on. We might need to wait for
embodied AIs before these become a component of AI 'introspection'.
Also: this reminds me of Penfield's work back in the day, where live
human brains were electrically stimulated to produce intense
reliving/recollection experiences. [ [1] ]
HTML [1]: https://en.wikipedia.org/wiki/Wilder_Penfield
foobarian wrote 23 hours 10 min ago:
Regardless of some unknown quantum consciousness mechanism biological
brains might have, one thing they do that current AIs don't is
continuous retraining. Not sure how much of a leap it is but it
feels like a lot.
DIR <- back to front page