精读

Transformers are inherently succinct

摘要

本文从逻辑与自动机理论中的 “简洁性”（Succinctness）维度研究 Transformer 的表达能力。研究证明，固定精度的 Transformer 在描述特定语言时，比线性时序逻辑（LTL）和 RNN（及 SSM）紧凑指数倍，比有限自动机紧凑双指数倍。其核心技术发现是 Transformer 能通过注意力机制实现双指数计数器。这种高度压缩的表达能力也带来了副作用：Transformer 的等价性和空性验证等基本问题被证明是 EXPSPACE 完全的，在计算上属于不可解范畴。

荐读理由

该研究证明了 Transformer 在描述复杂模式时比 RNN 和 SSM 具有指数级的简洁度（Succinctness），为你提供了理解其架构优势的底层理论依据，有助于你在技术选型时，从“表达效率”视角重新评估 Transformer 与 SSM 的长期竞争关系。

原文（转写自 PDF）

Published as a conference paper at ICLR 2026 # TRANSFORMERS ARE INHERENTLY SUCCINCT **Pascal Bergstraßer** RPTU Kaiserslautern-Landau Kaiserslautern, Germany bergstraesser@cs.uni-kl.de **Ryan Cotterell** ETH Zürich Zurich, Switzerland ryan.cotterell@inf.ethz.ch **Anthony W. Lin** RPTU Kaiserslautern-Landau and MPI-SWS Kaiserslautern, Germany lin@cs.uni-kl.de ### ABSTRACT We study *succinctness* as a measure of the expressive power of transformers. Succinctness—how compactly a formalism can describe a language relative to other formalisms—is a classical notion in logic and automata theory. We prove that fixed-precision transformers are remarkably succinct: they can be exponentially more succinct than both linear temporal logic (LTL) and recurrent neural networks, and, by extension, state-space models, and doubly exponentially more succinct than finite automata. In other words, there exist families of languages describable by polynomial-size transformers whose smallest equivalent LTL formula or recurrent neural network is exponentially large, and whose smallest equivalent automaton is doubly exponentially large. We also establish matching upper bounds, showing that any fixed-precision transformer can be converted to an LTL formula with at most an exponential blow-up—improving a prior doubly exponential translation. As a consequence of this succinctness, we show that basic verification problems for transformers, such as emptiness and equivalence, are provably intractable: specifically, EXPSPACE-complete. ## 1 INTRODUCTION Transformers (Vaswani et al., 2017) are the dominant architecture underlying most modern large language models. A substantial body of recent theoretical work has investigated their expressive power (Strobl et al., 2024; Barceló et al., 2024; Yang et al., 2024; Hahn, 2020; Pérez et al., 2021; Chiang and Cholak, 2022; Jerad et al., 2025), their trainability and ability to generalize to unseen strings of longer lengths (Zhou et al., 2024; Huang et al., 2025; Chiang and Cholak, 2022), and the extent to which their behavior can be formally verified (Sälzer et al., 2025). A key finding of this line of work is that transformers with finite precision—the setting most faithful to real-world hardware—recognize various classes of subregular languages depending on the exact assumptions made (Yang et al., 2024; Barceló et al., 2024; Jerad et al., 2025; Li and Cotterell, 2025). Subregular languages constitute strict subclasses of the regular languages. For instance, the subregular class of star-free languages are precisely those definable by regular expressions that replace the Kleene star with intersection and complementation. The language $a^*b^*$ is star-free because it can be written as $\emptyset \cdot b \cdot a \cdot \emptyset$, whereas $(aa)^*$ is not star-free (Straubing, 1994). By contrast, recurrent neural networks (RNNs) can recognize all regular languages under a fixed precision assumption (Minsky, 1967; Siegelmann and Sontag, 1995; Merrill et al., 2020; Svete and Cotterell, 2023), making them strictly more expressive than transformers as language recognizers. However, the strong empirical performance of transformers invites the question as to whether expressive capacity is the most revealing lens through which to compare architectures. In this paper, we propose *succinctness* as an alternative lens for understanding the expressivity of transformers. The succinctness of a language $L$ with respect to a class $\mathcal{C}$ of language recognizers (e.g., transformers, finite automata, and formulas in FO[<]) measures the size of the smallest $C \in \mathcal{C}$ that recognizes $L$. In other words, succinctness tells us how many symbols are needed to describe $L$ with respect to the class $\mathcal{C}$. Succinctness is a classical notion in logic and computer science (Stockmeyer, 1974; Grohe and Schweikardt, 2004), where it sharpens expressive power into a complexity-theoretic refinement: rather than asking only which languages a formalism can recognize, succinctness asks how compactly each such language can be described within it. Greater succinctness comes at a price—more succinct formalisms typically have correspondingly harder decision problems, since their compact descriptions force any decision procedure to unfold a larger amount of underlying structure. A well-known example concerns linear temporal logic (LTL; Pnueli, 1977), which is expressively equivalent to the star-free languages (Libkin, 2004), and, hence, also to the counter-free automata of McNaughton and Papert (1971). Despite this equivalence in expressive power, LTL can be exponentially more succinct than finite automata (Sistla and Clarke, 1985), i.e., certain languages admit polynomial-size LTL formulas but require exponentially larger automata. A direct consequence is that decision problems for LTL, such as checking whether a formula recognizes a trivial language, are provably harder than the corresponding problems for automata (Sistla and Clarke, 1985). This paper offers a formal result, which can be summarized as follows: transformers can describe certain languages extremely succinctly. Specifically, we show that transformers can be *exponentially* more succinct than LTL and RNNs, and hence also state-space models (SSMs; Gu and Dao, 2023; Merrill et al., 2024). Moreover, they are *doubly exponentially* more succinct than finite automata. In concrete terms, there exist families of languages describable by polynomial-size transformers that require exponentially larger LTL formulas or RNNs, and doubly exponentially larger automata. We also establish matching upper bounds: we give a translation from finite-precision transformers to LTL formulas of exponential size, significantly improving the doubly exponential translation of Yang et al. (2024). It follows that for any fixed-precision transformer, there is an equivalent LTL formula of exponential size and an equivalent finite automaton of doubly exponential size.¹ The key technical ingredient behind these results is showing that transformers can count from 0 to $2^{2^N}$—that is, implement doubly exponentially large counters—via a subtle encoding using attention. We then prove that the resulting languages require exponentially larger descriptions as LTL formulas or RNNs, and doubly exponentially larger descriptions as finite automata. A natural consequence of this succinctness is that analyzing transformers should be computationally challenging. And, indeed, we show that checking whether a given transformer recognizes a trivial language, is EXPSPACE-complete. Under standard complexity-theoretic assumptions, this means that no algorithm can solve the problem in less than double exponential time. The specific transformer model we study is the unique-hard attention transformer (UHAT), a simple and widely used abstraction of self-attention (Yang et al., 2024; Jerad et al., 2025; Strobl et al., 2024; Hao et al., 2022; Li and Cotterell, 2025; Hahn, 2020; Barceló et al., 2024; Bergstraßer et al., 2024). In particular, Jerad et al. (2025) show that expressivity bounds on UHATs entail corresponding bounds on softmax transformers with fixed precision. Different results in this paper hold under different precision assumptions: the UHAT upper bounds are stated for arbitrary rational weights, whereas the corresponding RNN results assume fixed (finite) precision. Importantly, this means our conclusions are valid in the setting that most faithfully mirrors real-world implementations—fixed precision arithmetic. ## 2 PRELIMINARIES We adopt the following notational conventions in this paper. We write $\mathbb{N} \stackrel{\text{def}}{=} \{1, 2, 3, ...\}$ for the natural numbers and $\mathbb{N}_0 \stackrel{\text{def}}{=} \{0, 1, 2, 3, ...\}$ for the natural numbers including zero. Given $N \in \mathbb{N}$, we define $[N] \stackrel{\text{def}}{=} \{1, ..., N\}$. Furthermore, we write $\mathbb{Q}$ for the rational numbers. We denote scalars by lowercase italicized Latin letters, vectors by boldface lowercase italicized Latin letters, and matrices by boldface uppercase italicized Latin letters. For a vector $\mathbf{v} = (v_1, ..., v_D)$, we write $\mathbf{v}_{i:j} \stackrel{\text{def}}{=} (v_i, ..., v_j)$ for all $1 \le i \le j \le D$, and $v_i$ for its $i^{\text{th}}$ component. An alphabet is a finite, non-empty set $\Sigma$ of symbols. A word (also called a string) is a finite sequence of symbols $\mathbf{a} = a_1 \dots a_N$. ¹This holds without exception: the LTL bound is the constructive translation of Prop. 13, and the automaton bound is its composition with the standard exponential LTL-to-automaton conversion (Sistla and Clarke, 1985; Vardi and Wolper, 1994); both are unconditional upper bounds on every fixed-precision transformer. We denote symbols using lowercase Latin letters and words as boldfaced lowercase Latin letters. We write $|\mathbf{a}| = |a_1 \dots a_N| = N$ for the length of a word $\mathbf{a}$. We write $\Sigma^*$ for the set of all words—including the empty word $\varepsilon$—and $\Sigma^+ \stackrel{\text{def}}{=} \Sigma^* \setminus \{\varepsilon\}$. A language is a subset $L \subseteq \Sigma^*$. We assume familiarity with basic formal language theory and complexity theory; see Kozen (1997) and Sipser (1997) for standard references. In particular, we work with finite automata and the following complexity classes (Sipser, 1997): P $\subseteq$ NP $\subseteq$ PSPACE $\subseteq$ EXP $\subseteq$ NEXP $\subseteq$ EXPSPACE. P and NP are problems solvable by a Turing machine in polynomial and nondeterministic polynomial time, respectively, and EXP and NEXP are their exponential-time counterparts. PSPACE and EXPSPACE are problems solvable by a Turing machine in polynomial and exponential space, respectively. ### 2.1 LINEAR TEMPORAL LOGIC A formula in linear temporal logic (LTL) over an alphabet $\Sigma$ is defined by the grammar $\varphi ::= \top \mid \bot \mid Q_a (a \in \Sigma) \mid \varphi \wedge \varphi \mid \varphi \vee \varphi \mid \neg\varphi \mid \varphi \mathbf{S} \varphi \mid \varphi \mathbf{U} \varphi$. Satisfaction of an LTL formula $\varphi$ on a word $\mathbf{a} = a_1 \dots a_N \in \Sigma^+$ at position $n \in [N]$, written $\mathbf{a}, n \models \varphi$, is defined inductively (omitting the trivial cases for $\top$ and $\bot$): | | | | | :--- | :--- | :--- | | $\mathbf{a}, n \models Q_a$ | iff | $a_n = a \quad (a \in \Sigma)$ | | $\mathbf{a}, n \models \varphi_1 \wedge \varphi_2$ | iff | $\mathbf{a}, n \models \varphi_1$ and $\mathbf{a}, n \models \varphi_2$ | | $\mathbf{a}, n \models \varphi_1 \vee \varphi_2$ | iff | $\mathbf{a}, n \models \varphi_1$ or $\mathbf{a}, n \models \varphi_2$ | | $\mathbf{a}, n \models \neg\varphi_1$ | iff | $\mathbf{a}, n \not\models \varphi_1$ | | $\mathbf{a}, n \models \varphi_1 \mathbf{S} \varphi_2$ | iff | for some $j$ with $1 \le j < n$: $\mathbf{a}, j \models \varphi_2$ and for all $k$ with $j < k < n$: $\mathbf{a}, k \models \varphi_1$ | | $\mathbf{a}, n \models \varphi_1 \mathbf{U} \varphi_2$ | iff | for some $j$ with $n < j \le N$: $\mathbf{a}, j \models \varphi_2$ and for all $k$ with $n < k < j$: $\mathbf{a}, k \models \varphi_1$ | We also use the standard abbreviations $\mathbf{P}\varphi := \top \mathbf{S} \varphi \quad \mathbf{F}\varphi := \top \mathbf{U} \varphi \quad \mathbf{Y}\varphi := \bot \mathbf{S} \varphi \quad \mathbf{H}\varphi := \varphi \wedge \neg\mathbf{P}\neg\varphi$. An LTL formula $\varphi$ recognizes the language $L(\varphi)$ consisting of all words $\mathbf{a} \in \Sigma^+$ where $\mathbf{a}, N \models \varphi$.² **Example 1.** *The star-free language $(ab)^+$ can be defined in LTL as* $$Q_b \wedge \mathbf{H}(Q_b \to \mathbf{Y}Q_a) \wedge \mathbf{H}((Q_a \wedge \mathbf{Y}\top) \to \mathbf{Y}Q_b). \quad (1)$$ *Eq. (1) asserts that the last letter is b, every b is preceded by a, and every a that has a predecessor is preceded by b.* ### 2.2 UNIQUE-HARD ATTENTION TRANSFORMERS **Symbol Embedding.** Let $\Sigma$ be an alphabet. A **symbol embedding** is a function $\text{emb}: \Sigma \to \mathbb{Q}^D$ for some $D > 0$.³ A symbol embedding naturally extends to a homomorphism $\Sigma^* \to (\mathbb{Q}^D)^*$, where $\text{emb}(a_1 \dots a_N) = \text{emb}(a_1), \dots, \text{emb}(a_N)$ for $a_1, \dots, a_N \in \Sigma$. ²For fragments that only allow $\mathbf{U}$ or $\mathbf{F}$, we use $\mathbf{a}, 1 \models \varphi$ instead. ³We define transformers over arbitrary rational numbers, as this is the most general setting in which our upper bounds hold. All results, however, carry over to fixed-precision arithmetic, i.e., a constant number of bits per value, independent of input length. The lower bounds hold under the even stronger restriction to fixed-precision integers. The precise statement of this carry-over depends on the formalization of fixed-precision arithmetic adopted: standard floating-point representations (Goldberg, 1991) are not associative, so the value of, for example, a dot product in an attention head can depend on the order in which its summands are evaluated. Our claim should therefore be understood with respect to a fixed, deterministic evaluation order; algebraic identities used in the proofs that rely on associativity (such as the order of summation) need to be re-checked under any other choice. **Attention layer.** A unique hard-attention (UHA) layer of width $R > 0$ is specified by: * Three affine transformations: $\mathbf{A}, \mathbf{B} : \mathbb{Q}^R \to \mathbb{Q}^R$ and $\mathbf{C} : (\mathbb{Q}^R \times \mathbb{Q}^R) \to \mathbb{Q}^S$; * A mask predicate $M : \mathbb{N} \times \mathbb{N} \to \{0, 1\}$, defined as one of $M(n, m) \stackrel{\text{def}}{=} 1$ (no masking), $M(n, m) \stackrel{\text{def}}{=} \mathbf{1}[m < n]$ (strict future masking), or $M(n, m) \stackrel{\text{def}}{=} \mathbf{1}[m > n]$ (strict past masking); * A tie-breaking function $\tau$ that selects an element of a finite, non-empty subset of $\mathbb{N}$, defined as either $\min$ (leftmost) or $\max$ (rightmost). Given a sequence of $N$ vectors $\mathbf{v}_1, \dots, \mathbf{v}_N \in \mathbb{Q}^R$ with $N \ge 1$, the layer operates as follows. The **score function** is defined as the dot product $$S(\mathbf{v}_n, \mathbf{v}_m) \stackrel{\text{def}}{=} \langle \mathbf{A}(\mathbf{v}_n), \mathbf{B}(\mathbf{v}_m) \rangle \quad (2)$$ for all $n, m \in [N]$. For each position $n \in [N]$, let $$U_n \stackrel{\text{def}}{=} \{m \in [N] \mid M(n, m) = 1\} \quad (3a)$$ $$B_n \stackrel{\text{def}}{=} \{m \in U_n \mid \forall m' \in U_n : S(\mathbf{v}_n, \mathbf{v}_m) \ge S(\mathbf{v}_n, \mathbf{v}_{m'})\} \quad (3b)$$ be the set of unmasked positions and the subset of those that maximize the score, respectively. The **attention vector** at position $n$ is defined as $\mathbf{a}_n \stackrel{\text{def}}{=} \mathbf{v}_{\tau(B_n)}$ if $U_n \ne \emptyset$ and $\mathbf{a}_n \stackrel{\text{def}}{=} \mathbf{0}$ otherwise. The layer outputs the sequence $\mathbf{C}(\mathbf{v}_1, \mathbf{a}_1), \dots, \mathbf{C}(\mathbf{v}_N, \mathbf{a}_N)$. **ReLU layer.** A **ReLU layer** of width $R > 0$ applies, for a designated coordinate $r \in [R]$, the ReLU function to the $r^{\text{th}}$ component of each input vector. Formally, define $\rho_r : \mathbb{Q}^R \to \mathbb{Q}^R$ by $$\rho_r(\mathbf{v}) \stackrel{\text{def}}{=} (v_{1:r-1}, \max(0, v_r), v_{r+1:R}). \quad (4)$$ Given a sequence of $N$ input vectors $\mathbf{v}_1, \dots, \mathbf{v}_N \in \mathbb{Q}^R$ with $N \ge 1$, the layer outputs the sequence $\rho_r(\mathbf{v}_1), \dots, \rho_r(\mathbf{v}_N)$ obtained by applying $\rho_r$ position-wise. Equivalently, one could place a feed-forward network at the end of each encoder layer (Hao et al., 2022; Barceló et al., 2024). **Transformer.** A **unique hard-attention transformer (UHAT)** is a length-preserving function $\mathcal{T} : \Sigma^+ \to (\mathbb{Q}^S)^+$ obtained by composing

Hacker News · 122 赞 · 36 评讨论 → 阅读原文 →

这条对你有帮助吗？