Detecting LLM Misbehaviors from the Inside Out with Deep Learning on Structured Data
Structure-aware architectures to detect failure modes in language models.
In this blogpost, we present our approach for effective and lightweight deep learning monitors for LLM misbehaviors. We develop architectures that learn over various LLM internals, incorporating inductive biases tailored to their underlying structure.
This is based on three papers co-authored with Derek Lim, Yoav Gelberg, Yaniv Galron, Ran El-Yaniv, Gal Chechik and Yftah Ziser (with thanks to him for feedback on an earlier draft). The post was mainly written by Fabrizio and Guy; Haggai provided substantial feedback and input throughout.
Table of content
Introduction
Modern AI systems are impressively powerful, yet known to misbehave: they may memorize text, hallucinate responses, or get caught in flawed logical reasoning. Despite substantial progress in mechanistic interpretability toward reverse-engineering model internals, scaling these insights to reliably explain complex failure modes — such as hallucination — still seems to remain an open challenge1.
A different vision for safer and more reliable AI models is currently emerging: instead of striving to build and leverage a complete understanding of these complex systems, one could more concretely correlate their internal states to (mis)behaviors of interest, directly aiming to automatically detect and correct them. This approach is situated within the broader context of pragmatic interpretability2, a manifesto recently shared by the Google DeepMind interpretability team: rather than pursuing comprehensive mechanistic understanding as an end in itself, interpretability research should prioritize methods that demonstrably help audit, predict, and control model behavior.
In this spirit, probing classifiers3 have been, for example, largely adopted to detect misbehaviors in Large Language Models (LLMs). These consist in logistic regression models trained to predict failures such as hallucinations from specific hidden token representations. Yet, each invocation of an LLM produces a much richer set of inner states, e.g., internal activations across all tokens and layers, next-token probability distributions, attention matrices, and so on.
Why, then, restricting our toolkit to very simple models applied over arbitrary fragments of these model internals when we have more comprehensive data and specialized modeling tools at our disposal?4
In the following, we illustrate a series of methods we recently developed for a more general, yet principled approach to learn on these model internals to detect failure modes in language models.
In particular, we argue that computational traces — i.e., the data objects representing the computation carried out by LLMs — constitute structured data objects in their own right, and, that, in the spirit of Geometric Deep Learning, effective learning thereon requires inductive biases aligned with their underlying structure.
Specifically, in a series of papers we will discuss next, we developed architectures tailored to notable traces, namely activations5, attentions6, next-token probability distributions7 (see Fig. 1 below).

Learning on Activation Tensors
“Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT”, NeurIPS 2025
We begin with learning from hidden token representations (i.e. activations) with a two-fold aim. First, we seek to overcome a main limitation of probing classifiers, that is, they process only a single activation vector from a carefully chosen token and layer position, e.g., “the last response token in the 24th layer of the language model”. We aim, instead, to design an architecture that learns over full sets of activations across tokens and layers, which we call Activation Tensors. Second, we aim to support joint training on multiple LLMs' Activation Tensors, towards capturing shared truthfulness patterns across different LLMs. This is not generally possible with probing classifiers, which are tied to a fixed hidden dimension and a preselected layer-token combination specific to each LLM.
The objects we are dealing with are 3D tensors in the form:
where M indicates an LLM, LM is the number of M’s layers, N is the number of output tokens, DM is the activation feature dimension.
Now, what are the right inductive biases to consider when processing Activation Tensors? Our main observation was that they exhibit a structure similar to that of images: layers and tokens form two sequential axes, while the feature dimensions act akin to channels (see Fig. 2 below).

Accordingly, we design our architecture, ACT-ViT, centered around a Vision-Transformer (ViT) backbone responsible for extracting general, “spatial” truthfulness patterns across all tokens and layers. In order for ACT-ViT to be able to jointly train across multiple, different LLMs, we first project the activation feature dimension DM to a shared, low dimensional space d using an LLM-specific linear adapter, and then input this low-dimensional activation tensor to the Vision-Transformer backbone. This way we are effectively “pushing” ACT-ViT to learn patterns predictive for hallucinations that are shared across LLMs.
A visual representation of the pipeline is depicted in the figure below.
Main Experimental Takeaways
We tested ACT-ViT on a set of different hallucination detection datasets across three distinct LLMs8. We found it outperforms probes in the standard setup, i.e., in terms of generalization performance over LLMs whose Activation Tensors were seen during training. This is more than encouraging, but the real test of whether the backbone has learned genuinely general truthfulness patterns is whether it transfers to LLMs it has never seen. To test this hypothesis, we use the following setup, see Figure 4 below.

Importantly, we keep the Vision-Transformer backbone frozen, and only train a new linear adapter for this new LLM. This is essentially a “probing-like” approach (as we only learn a single linear layer), but with a much more powerful feature extractor (the downstream pretrained Vision-Transformer).
Our results are shown in Figure 5, where we see that ACT-ViT clearly outperforms probing, indicating that the frozen Vision-Transformer backbone was able to learn to extract LLM-agnostic truthfulness patterns.

(Graph) Learning on Attention Matrices
“Neural Message-Passing on Attention Graphs for Hallucination Detection”, ICLR 2026
Until now, we have focused exclusively on activations... but attention scores, too, have been shown to carry signals predictive of hallucinations9. Yet, as with activations, existing methods rely on simple classifiers, applied, in these cases, to heuristic features.
Lookback Lens10, for instance, is a pioneering approach running a logistic regressor on a simple feature: the relative proportion of attention directed toward the response versus the input context. The intuition is that over-attending to the response may signal generation from statistical habit rather than grounding in the input. This is interpretable and, to a degree, effective in detecting contextual hallucinations11, but what if more complex patterns in the attention maps (potentially in combinations with activations) were more predictive, and could be learned end-to-end? We address these questions in a separate work where crucially, again, we look for structure in our data to guide architectural design.
Here the intuition is straightforward: pairwise attention scores naturally induce graphs, which we can leverage as computational structures for learning on traces. Concretely, given a prompt-response pair, the idea is to assemble a Computational Trace Graph where tokens are nodes connected by an edge whenever their attention score exceeds a predefined threshold. Importantly, nodes and edges in such a graph are attributed with values from attention maps, and can be endowed with additional features such as token activations. See Fig. 6.

The conceptual leap now: hallucination detection can be cast as a prediction task on these graphs. We design a Graph Neural Network, CHARM12, that processes computational trace graphs and outputs either token-wise or response-level scores. The former ones convey whether each response token belongs to a hallucinated passage, the latter is indicative of the general presence of errors. A depiction of CHARM is shown in Fig. 7.

CHARM runs a stack of neural message-passing layers13, followed by an optional pooling module and a prediction head. Two properties are worth highlighting. First, CHARM provably subsumes attention-based detectors like Lookback Lens and LLM-Check14, meaning their underpinning features can be recovered as special cases of CHARM’s computations. Second, message-passing scales linearly with the number of edges, and we found that computational trace graphs can be aggressively sparsified with no drops in accuracy. The upshot is that, practically, CHARM’s cost stays well below the quadratic growth of raw attention maps!
We tested CHARM on several hallucination detection benchmarks and found it significantly outperforms activation- and attention-based methods across granularities. Figure 8 shows a computational trace graph with a zoom-in on CHARM’s predictions over a test sample from the NQ dataset15. Interestingly, our model outputs localised hallucination predictions signalling, for instance, that “Sheriff Ed Tom Bell” is not the character played by Barry Pepper in “True Grit” (2010)16.

Learning on LLM Output Signatures
So far, we've focused on detecting hallucinations — but what about other misbehaviors? Data contamination (Fig. 9) is another critical problem: an LLM may appear to perform well simply because it has memorized evaluation data, or may have been illegitimately trained on copyrighted material. Our methods aren't specific to hallucination detection — the same computational traces can reveal these issues too.

In a recent work (presented in AAAI 2026), we developed LOS-Net, an architecture tackling both hallucination and data contamination detection in the gray-box setting, where one only has access to (some of) the LLM's output probabilities — a common scenario when interacting with closed-weight, proprietary models17. In this gray-box setting, we consider observing only the LLM Output Signature (LOS) (see Figure 10 below).
LOS data objects consist of:
Token Distribution Sequences (TDS), i.e., a sequence of the next-token probability distributions over the token vocabulary and across generation steps:
\(X \in \mathbb{R}^{N \times V},\)where N is the sequence length, V is the vocabulary size.
Actual Token Probabilities (ATP), i.e., a sequence of the probabilities assigned to the tokens actually generated (or present) in the sequence:
\(p \in \mathbb{R}^{N}.\)
Most previous approaches have focused almost exclusively on the ATPs, computing statistics like the average probability of the selected tokens. Some methods do incorporate the TDS, but only in a very limited way, using it for simple normalization procedures rather than as a source of information about the model's uncertainty and decision-making process in its own right.
We argue that the TDS is important! Why? Because differences in the level of model uncertainty may not be captured by the value of the ATP alone. See, e.g., Figure 11: the ATP holds the same value for both sequences, while their next-token probability distributions differ substantially, indicating very different levels of uncertainty.

LOS-Net Architecture
We develop LOS-Net, an architecture for processing the full LOS (TDS & ATP). We sort and truncate (slice) each TDS row so it works with API-limited closed-weight models that expose only top-k next-token probabilities, then combine it with a parametric ATP encoding. This sequential data object is then processed by a lightweight Transformer18. The LOS-Net pipeline is shown in Figure 12.
To motivate our architectural design choice, we show that LOS-Net can approximate a family of functions called Gated Scoring Functions (GSFs), which provide a unifying view of prior methods. A GSF scores a LOS via R(X,p) by summing token-level scores, while including only tokens whose calculated confidence value exceeds a threshold (see Fig. 13). Under this view, earlier approaches emerge as special cases, and LOS-Net can approximate the full GSF family, effectively subsuming many prior methods.
Main Experimental Takeaways
LOS-Net works well on both tasks of hallucination detection and data contamination detection. On the latter, specifically, LOS-Net demonstrated very strong zero-shot cross-LLM generalization (see Fig. 13). This means you can train our LOS-Net on one LLM’s LOS data, and apply it directly to another, without additional training. This is a crucial capability for practical deployment.

Finally, we experimented with limited API access scenario, where only the top k scoring probabilities from the TDS are available. We found LOS-Net still performs well in this setting (see Fig. 15).

Conclusions
LLM computational traces may contain important clues revealing misbehaviors such as hallucinations; in this blogpost we have argued that (i) these traces are comprehensive and structured objects and that (ii) deep learning architectures primed with the right structural inductive biases can learn to extract effective features predictive for various failure modes.
This led us to treat activation tensors as images (ACT-ViT), attention maps as graphs (CHARM), and output distributions as sequences (LOS-Net); each time we designed lightweight, specialized models that significantly outperformed simpler baselines and, encouragingly, showed signs of generalization across tasks and LLMs.
Much remains to be explored. Can these complementary approaches be combined into a single method operating jointly over multiple trace types? Can our underpinning ideas be translated to detect failure modes across other modalities beyond text19? Can our methods go beyond the pragmatic interpretability manifesto and be used to interpret why and where things go wrong? These are only some of the exciting directions we have in mind — if you do find them interesting too, get in touch!
We would, eg., refer to “Open Problems in Mechanistic Interpretability” by Sharkey et al., 2025
See, in particular, the lesswrong post “A Pragmatic Vision for Interpretability” by Neel Nanda et al., December 2025.
See Squib “Probing Classifiers: Promises, Shortcomings, and Advances” by Yonatan Belinkov and references within.
We benchmark hallucination detection on the responses generated by LlaMa-2-8B, Mistral-2-7B, Qwen-2.5-7B on datasets “HotpotQA” (with and without context), “Movies”, “TriviaQA”, “IMDB”, from the suite curated by Orgad et al. (“LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations”, ICLR 2025). All considered models are in their “instruct” version.
Representative works are:
“Lookback Lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps”. Chuang et al., EMNLP 2024;
“LLM-check: Investigating detection of hallucinations in large language models”, Sriramanan et al., NeurIPS 2024;
“Hallucination detection in LLMs using spectral features of attention maps”, Binkowski et al., EMNLP 2025.
Yes, this refers to 9.1 above.
These refer to errors whereby the LLM does not correctly retrieve and elaborate information plainly present in the provided context. Examples include hallucinations occurring when prompting a model to summarise a piece of text or to edit a code snippet.
Read like: “Catching HAllucinated Responses via Message-passing”. Not the most straightforward acronym, we’d agree, but it does the job of being somewhat memorable. At least this is what we like to think 😁.
The term was popularised by “Neural Message Passing for Quantum Chemistry”, Gilmer et al., 2017. Message Passing Neural Networks are amongst the most popular Graph Neural Network architectures. For a high-level intuition on message-passing, see this post.
… and yes, this refers to 9.2 above.
“Natural Questions: A Benchmark for Question Answering Research”, Kwiatkowski et al, 2019.
Sheriff Ed Tom Bell is, instead, a prominent character from “No Country For Old Men” (2007).
For example, at the time of writing, ChatGPT does give access to the token logprobs, but only the top 20 ones. See here (accessed March 2026).
This is around 1M parameters.
Think about, e.g., misalignment between texts and images in vision-language models.









