Learning Regular Languages with the TTT Algorithm
摘要
本文是一篇关于主动自动机学习(Active Automata Learning)的深度教程,完整实现了用于推断正则语言的 TTT 算法。相比经典的 L* 算法,TTT 通过引入歧视树(DT)替代观察表,并结合二分查找分析反例,显著减少了成员查询次数。文章详细介绍了前缀转换和判别器完成化等优化技术,确保算法在数学上无冗余,并提供了完整的 Python 代码来演示从黑盒 Oracle 构建 DFA 的全过程。
荐读理由
在对解析器、网络协议或安全过滤器等黑盒系统进行逆向工程时,你可以直接复用文中完整的 TTT 算法 Python 实现,利用其优于 L* 算法的歧视树与二分搜索机制,以理论最优的查询效率推导出精确的 DFA 状态机模型。
原文
Learning Regular Languages with the TTT Algorithm
TLDR; This tutorial is a complete implementation of the TTT algorithm for active automata learning in Python. TTT combines the discrimination tree of Kearns and Vazirani with binary search counterexample analysis from Rivest and Schapire, and adds prefix transformation and discriminator finalization to eliminate all redundant membership queries. The Python interpreter is embedded so that you can work through the implementation steps.
Why learn an input language?
Suppose you are given a piece of software. For example, a network protocol implementation, a parser, or a security filter. You want to understand what inputs it accepts. You have no access to the source code, and can only run it and observe whether it accepts or rejects a given input. This is the blackbox setting.
A naive answer is to try to test it exhaustively. But the set of all strings accepted by even simple grammars is infinite. A better approach is to infer a finite model, that is a DFA, that captures the input behaviour exactly. Such a model is useful on its own. You can inspect it, verify properties, generate tests from it, or compare it against a specification to find discrepancies. Active automata learning is the discipline of constructing this model efficiently, using as few queries as possible.
In my previous post, I implemented Angluin’s L* algorithm for learning regular languages from a blackbox oracle. L* uses a flat observation table to track state distinctions, which leads to redundant membership queries: when a counterexample arrives, all its suffixes are added as columns even though most distinguish no new states.
TTT is the state-of-the art algorithm for regular language inference. Using this algorithm, you can infer the input language of any blackbox program up to its regular approximation. It is much more faster than L*, and the number of membership queries it generates (that is, the number of inputs it needs to test the blackbox with) is provably non-redundant.
Several independent contributions are incorporated in the TTT algorithm. Rivest and Schapire1 contributed the binary search counterexample analysis, which finds the single relevant suffix in a counterexample in (O(\log k)) queries (rather than (k) queries). The introduction of discrimination tree as a replacement for the observation table is due Kearns and Vazirani2.
TTT by Isberner, Howar and Steffen3 adds two further refinements: prefix transformation, which keeps access sequences minimal, and discriminator finalization, which keeps the discrimination tree shallow. TTT is provably redundancy-free. That is, it never makes a membership query whose answer could have been derived from earlier queries.
Language inference can also be applied to hardware. There are however, other considerations in such settings. For example, it may not be possible or even expensive to restart a system. ADT4 is a notable extension of TTT, which uses adaptive distinguishing sequences, and can reduce resets in hardware settings.
Definitions
Alphabet (A): the set of input symbols the DFA reads.
Membership query: a string passed to the blackbox oracle. The oracle answers yes (accepted) or no (rejected).
Equivalence query: a hypothesis grammar passed to the teacher. The teacher answers yes, or returns a counterexample string where the hypothesis and the target disagree.
PAC oracle: a probabilistic approximation to the equivalence oracle. After (N) random tests without finding a counterexample, we declare the hypothesis probably approximately correct.
Discrimination tree (DT): a binary tree whose inner nodes are discriminator suffixes and whose leaves are states. Sifting a string (w) through the tree classifies it to a state using one membership query per level.
Access sequence (reach(q)): the shortest known string that reaches state (q) in the target. This is called (acc(q)) in TTT literature, but using (reach(q)) to avoid conflation with (accept) in DFA.
Spanning tree: a mapping from each known state to its access sequence. In this implementation we use a dict (called State Table) rather than a tree.
Open transition: a transition from state (q) on symbol (a) that has not yet been sifted to determine its target state.
Counterexample decomposition: the process of finding the split point in a counterexample, extracting a new discriminator, and splitting a leaf in the DT.
Contents
Definitions
Prerequisites
From L* to TTT
The DFA Representation
The Oracle
The Discrimination Tree
The State Table
Sifting
Hypothesis Construction
Incremental Hypothesis Update
Counterexample Decomposition
The Split Point
Prefix Transformation
Splitting a Leaf
Discriminator Finalization
Finding the Split Point
Putting Decomposition Together
Worklist Growth in
close_transitionsNon-Redundancy
A Note on the Equivalence Oracle
The Main Loop
Examples
Evaluating Model Accuracy
Comparison with L*
References
Artifacts
Important: Pyodide takes time to initialize. Initialization completion is indicated by a red border around Run all button.
Prerequisites
We use the Teacher and Oracle from the L* post unchanged. The rest of the algorithm is independent of L*.
Available Packages
These are packages that refer either to my previous posts or to pure python packages that I have compiled, and is available in the below locations. As before, install them if you need to run the program directly on the machine. To install, simply download the wheel file (pkg.whl) and install using pip install pkg.whl.
simplefuzzer-0.0.1-py2.py3-none-any.whl from "The simplest grammar fuzzer in the world".
rxfuzzer-0.0.1-py2.py3-none-any.whl from "Fuzzing With Regular Expressions".
earleyparser-0.0.1-py2.py3-none-any.whl from "Earley Parser".
cfgrandomsample-0.0.1-py2.py3-none-any.whl from "Uniform Random Sampling of Strings from Context-Free Grammar".
cfgremoveepsilon-0.0.1-py2.py3-none-any.whl from "Remove Empty (Epsilon) Rules From a Context-Free Grammar.".
gatleastsinglefault-0.0.1-py2.py3-none-any.whl from "Specializing Context-Free Grammars for Inducing Faults".
hdd-0.0.1-py2.py3-none-any.whl from "Hierarchical Delta Debugging".
Since this notebook serves both as a web notebook as well as a script that can be run on the command line, we redefine canvas if it is not defined already. The __canvas__ function is defined externally when it is used as a web notebook.
From L* to TTT
In L*, when the equivalence oracle returns a counterexample (ce) of length (k), the algorithm adds all (k) suffixes of (ce) as new columns (i.e. discriminators for states). For each new column, it must re-query every existing row. However, many of these new columns are redundant because they do not distinguish any new pair of states.
The key insight that distinguishes TTT from L* is that a counterexample identifies only one pair of states that was wrongly merged. Hence, exactly one new discriminator is sufficient, not (k). TTT incorporates the following independent contributions:
**Kearns and Vazirani (1994)**2 replaced the observation table with a discrimination tree. A binary tree of discriminator suffixes where each leaf is a state. Sifting a string down the tree classifies it in (O(depth)) queries rather than (O(|suffixes|)).
**Rivest and Schapire (1993)**1 showed that binary search over the counterexample finds the single relevant split point in (O(\log k)) queries, rather than adding all (k) suffixes.
**Isberner, Howar and Steffen (2014)**3 combined these with prefix transformation (keeping access sequences minimal) and discriminator finalization (keeping the DT shallow), producing TTT.
The combination of a discrimination tree with a spanning tree of access sequences is known as an observation pack 5. This post does not use that abstraction directly; we manage the two structures separately.
The DFA Representation
The DFA class is similar to the one from the RPNI post.
We also add run(), which returns the state reached after consuming a string, and accepts(), which checks whether that state is an accepting state.
The DFA class can now model a deterministic finite state machine. We also add a helper to render any DFA as a Graphviz dot diagram.
Let us test it thoroughly.
The Oracle
Let us define a simple mock oracle for testing the components in isolation. The Teacher imported from L* is the full oracle, and will be used in the main loop.
The Discrimination Tree
The discrimination tree (DT) replaces L*’s flat observation table. Think of it as a game of 20 questions: each inner node asks: if I append suffix (d) to this string, does the target accept it? and routes left (no) or right (yes). Each leaf is a known state.
The discriminator suffixes at different nodes are independent strings with no relationship to each other; they play the same role as the suffix columns in L*, but each counterexample adds exactly one, rather than all (k) suffixes of the counterexample at once.
The tree structure is not a trie; a parent’s suffix is not a prefix of its children’s suffixes, and there is no requirement that sibling suffixes share anything in common. The tree is purely a decision structure: each node’s suffix is the question that splits the states reachable at that point, chosen only because it distinguishes some pair of states that would otherwise be merged. Two nodes at different depths may share a suffix, or their suffixes may be completely unrelated; what matters is only the binary answer at each step.
Both children of an inner node can themselves be inner nodes, and the tree can be arbitrarily deep. A leaf represents a known state; it has no discriminator. Early in learning the tree is shallow; each counterexample adds exactly one new inner node, splitting one existing leaf into two children.
A node starts life as a leaf holding a state name. When the equivalence oracle returns a counterexample, it means the blackbox and the hypothesis DFA disagree on some string: two strings that reach different blackbox states are being routed to the same hypothesis state. decompose identifies which leaf is responsible and splits it.
Splitting a leaf means it becomes an inner node: it forgets its state name, gains a discriminator suffix, and sprouts two child leaves, one for each of the two now-distinct states. Which child goes right and which goes left is determined by querying the oracle on reach(old_state) + discriminator. A right child means that query returned True; a left child means it returned False. Note that right does not mean “accepted”: later, when sifting an arbitrary string (w), the same node asks whether (w + discriminator) is accepted, and routes right on True, left on False. The direction encodes agreement with the oracle, not acceptance of (w) itself.
We mutate in place because other nodes in the tree already hold references to this object; replacing it with a new object would leave those references stale. The split method (called split_leaf in TTT literature) encapsulates this mutation.
The State Table
Each state in the hypothesis has an access sequence: the shortest known string that reaches it from <start>. The TTT paper calls this structure a spanning tree because, conceptually, the states form a tree: <start> is the root, each state is a child of the state whose access sequence is one character shorter, and walking from root to any node spells out that node’s access sequence. In the original Kearns-Vazirani algorithm this tree was traversed explicitly: to find a state’s access sequence you’d walk up to the root collecting edge labels. TTT’s prefix transformation makes traversal unnecessary by storing the full access string directly against each state. The tree structure is therefore implicit, and the implementation reduces to a plain dict. We call the class StateTable to reflect what it actually is.
We test the node types before proceeding.
Sifting
To classify any string (w) to a state, we walk the DT from root to leaf. At each inner node labeled (d), we query (member(w \cdot d)) and go right (yes) or left (no). The leaf we land on is the state (w) belongs to. Each step is one membership query, so sifting costs at most (depth(DT)) queries, far fewer than L*’s (O(|suffixes|)).
At any inner node, either child may itself be another inner node. This is not like a trie where deeper nodes share a common prefix with their parent; the suffix at a deeper node is simply a different, independent question that separates states the shallower question could not. The tree gets deeper only when a pair of states survive all questions asked so far and a new discriminator must be introduced to tell them apart. Keeping the tree shallow therefore reduces the query cost of every future sift, which is why TTT’s discriminator finalization step aggressively replaces long, incidental discriminators with shorter, permanent ones.
We also add a helper to render a DT as a Graphviz dot diagram. Inner nodes show their discriminator suffix; leaves show the state name. Left edges (membership query returned False) are labelled “no”; right edges (True) are labelled “yes”. An empty discriminator is shown as (\varepsilon) (epsilon), meaning the string itself is queried with nothing appended. An optional tracer argument (a DTTracer, defined below) colours the recorded sift path blue.
DTTracer wraps a real DT node and records every step taken during a sift. It proxies is_leaf, discriminator, left, right, and state to the wrapped node, but intercepts the left/right accesses to record which edge was taken. After sift(DTTracer(root), w, oracle) returns, the tracer holds path_nodes and path_edges as sets of id()s referencing the unwrapped nodes, so they match what dt_to_dot sees.
We test sifting on the even-a’s example: the DT has one discriminator (the empty string) that separates even-a states from odd-a states.
Single leaf: every string sifts to <start>.
Two-level tree: a single split on discriminator (\varepsilon) separates <odd> (rejects (\varepsilon)) from <start> (accepts (\varepsilon), since '' has zero a’s). We start from a single-leaf DT rooted at <start> and split it, recording that <odd> is reached via 'a'.
Sifting 'a' (odd a’s) goes left to <odd>; sifting 'aa' (even a’s) goes right to <start>.
Sifting 'aa' goes right.
The two examples above only ever have one inner node at the root. To see both children being inner nodes, consider what happens after a second split. We split the <start> leaf again: discriminator 'aa' separates <start> (accepts 'aa') from a new state <even2> (rejects 'aa', e.g. reached via 'aaa'). The right child of the root is now itself an inner node, and sifting takes two steps before reaching a leaf.
Sifting 'aa' through the three-state DT:
node d='' : member('aa' + '') = member('aa') = True -> go right
node d='aa': member('aa' + 'aa') = member('aaaa') = True -> go right
leaf <start>: done.
Sifting 'aaa': goes right at root, then member('aaa'+'aa') = False, go left to leaf <even2>.
The even-a’s DT uses (\varepsilon) as its discriminator because membership of the access sequence alone distinguishes all states. Most targets produce non-empty discriminators; we will see this concretely in the Counterexample Decomposition section and in the full worklist walkthrough.
Hypothesis Construction
At any point during learning, we have a DT and a state table. Together they are enough to construct a complete hypothesis DFA. The states of the DFA are exactly the states recorded in the state table. The transitions are derived by sifting: for each state (q) and each alphabet symbol (a), form (reach(q) \cdot a) and sift it through the DT. The leaf it lands on identifies the target state. Repeat for every ((q, a)) pair and the full transition table is filled.
Concretely, for the even-a’s example with a two-leaf DT (discriminator '', left = <odd>, right = <start>), and state table {<start>: '', <odd>: 'a'}:
sift('a') : node d='': member('a' +'') = False -> left => <start> -a-> <odd>
sift('b') : node d='': member('b' +'') = True -> right => <start> -b-> <start>
sift('aa'): node d='': member('aa'+'') = True -> right => <odd> -a-> <start>
sift('ab'): node d='': member('ab'+'') = False -> left => <odd> -b-> <odd>
A transition is open if it is absent from the DFA entirely, meaning it has not yet been sifted to determine its target state.
close_transitions works through all known states, sifting each open transition. When sifting discovers a state not yet in the state table, that state is added and appended to the work list so its own transitions are processed in the same pass. This means the loop may grow as it runs: starting with one state, it may end with the full set.
leaf_index maps leaf.node_id to the list of (state, char) transitions that sifted to that leaf. It is passed in explicitly so its contents are visible to callers; update_hypothesis reads it to find stale transitions.
Accepting states are determined by querying the oracle directly on each access sequence. If (reach(q)) is accepted by the target, then (q) is an accepting state.
To see what closing looks like step by step, consider the even-a’s target with alphabet {a, b}. We start with one state <start> and a single-leaf DT. Every transition is open because no target state has been determined yet. Closing sifts reach(<start>) + 'a' = 'a' and reach(<start>) + 'b' = 'b' through the single-leaf DT. Both land on <start>, so both transitions point back to <start> – the hypothesis says everything loops. We visualise the DT and the resulting DFA at this stage.
DFA before the split.
DT before the split
Build the hypothesis.
After the first counterexample (say 'a') is processed, decompose splits the DT: a new inner node with discriminator '' separates <start> (goes right: accepts '') from new state <odd> (goes left: rejects ''). Re-sifting now correctly routes 'a' to <odd> and 'b' back to <start>.
DFA built from the two-leaf DT above.
Incremental Hypothesis Update
When decompose splits a leaf (\ell) into an inner node, every transition that was previously routed to (\ell) is now stale: it pointed to a single state, but the DT now says some of those strings belong to the left child and some to the right.
We could re-sift every transition in the hypothesis from scratch, but that is wasteful. Instead, leaf_index records exactly which (state, char) pairs were routed to each leaf. We remove only those transitions from the DFA and re-sift only them. Every other transition remains correct.
In the even-a’s example, the initial DT has a single leaf <start>. Every transition sifted during build_hypothesis therefore landed on that leaf, so every transition is stale after the first split. After splitting on '' and registering <1> (the odd-a’s state, called <odd> in earlier examples), all transitions are removed and re-sifted through the two-node DT. <start> -a-> <start> becomes <start> -a-> <1>; the rest route to the same state as before. update_hypothesis is called after a leaf split. It pops the stale transitions from leaf_index, removes them from the DFA, then re-closes to re-sift only those transitions plus the new state’s open transitions. This corresponds to TTT’s incremental hypothesis update step.
We test hypothesis construction on the even-a’s example.
Counterexample Decomposition
When the equivalence oracle returns a counterexample (ce), we know the hypothesis is wrong on (ce). But where exactly does it go wrong?
The Split Point
Walk (ce) through the hypothesis. At position 0, both hypothesis and target are in (\langle start \rangle), so they agree. At position (|ce|), they disagree (that is what the counterexample means). So somewhere in between is the first point of disagreement. That is, the position (i) where the hypothesis first takes a wrong transition.
We find this by binary search in (O(\log|ce|)) queries. At each midpoint (m), we check whether (reach(q_m) \cdot ce[m:]) gives the same answer as the full counterexample. If yes, the split point is to the right; if no, it is here or to the left.
Prefix Transformation
After finding the split point (i), we need the string that reaches state (q_i) and then reads (ce[i]). The raw counterexample prefix (ce[:i+1]) would work, but we use (reach(q_i) \cdot ce[i]) instead. This is the prefix transformation, and it gives two guarantees:
Correctness: (reach(q_i)) traces a known path through the hypothesis, so the sift is guaranteed to work even if the hypothesis is partially stale.
Minimality: the new state gets access sequence (reach(q_i) \cdot ce[i]), which is the shortest possible. Using (ce[:i+1]) could produce a much longer access sequence, making future sifts more expensive.
Splitting a Leaf
Once we have the split point, we know:
A leaf (\ell) currently represents
old_stateold_stateandnew_statewere treated as identical by the hypothesis, but the counterexample proves they are differentThe discriminator (ce[i+1:]) is the suffix that tells them apart
DTNode.split was introduced in the Discrimination Tree section above and is used directly here. decompose calls it with the leaf found by sifting the transformed prefix, the new discriminator, and the fresh state.
We now show update_hypothesis in action. We build the stale hypothesis (single-leaf DT, everything loops to <start>), then split the leaf and call update_hypothesis. The stale transition <start> -a-> <start> is removed and re-sifted to <1>.
Split the DT: add state <1> reached via 'a', then split the <start> leaf.
Now update: stale transitions are removed and re-sifted.
DFA after update: -a-> <1>, all other transitions unchanged.
Discriminator Finalization
The discriminator (ce[i+1:]) is correct but may be longer than necessary. A shorter suffix of (ce[i+1:]) may distinguish the two states just as well. Keeping discriminators short keeps the DT shallow, which reduces sifting costs in all future iterations.
A suffix (d) distinguishes two states when appending it to their access sequences gives different membership answers: oracle(reach(old_state) + d) != oracle(reach(new_state) + d). This means the two states respond differently to (d), so they cannot be the same state in the target.
Starting from the full suffix (ce[i+1:]), we try progressively shorter suffixes (by advancing the start index). We keep shrinking as long as the shorter suffix still distinguishes the two states. The moment a candidate fails to distinguish them, no further shortening can help, so we stop and return the shortest distinguishing suffix found so far.
We test discriminator finalization.
Finding the Split Point
find_split_point records the hypothesis states visited while reading (ce), then binary-searches for the first position where (reach(q_i) \cdot ce[i:]) disagrees with the target answer on (ce). That position (i) is where the hypothesis first takes a wrong transition. The search costs (O(\log|ce|)) membership queries.
The function returns both the split index and the full states list, since decompose needs the states list for the prefix transformation.
We test find_split_point on a simple case.
Putting Decomposition Together
With find_split_point, prefix_transformation, finalize_discriminator, and DTNode.split all in place, decompose is a straightforward four-step sequence. One counterexample yields exactly one new state and one new discriminator.
Note: decompose uses hypothesis transitions only to find the split point. The actual split uses (reach(q_i)) from the state table, which is always correct with respect to the target, so decompose is correct even if the hypothesis is partially stale.
We test decompose.
test 1: single symbol counterexample ‘a’
test 2: longer counterexample ‘aab’; binary search finds split at position 0, so the new state still gets access sequence ‘a’ and discriminator is ‘b’.
test 3: two states, counterexample ‘aa’ reveals a third state. The DT gains a second level; the new state gets access sequence ‘aa’.
Worklist Growth in close_transitions
We now trace the full (a|b)*ba learning run to show how state names emerge from the algorithm and how the worklist grows during close_transitions. No states are pre-declared; they are created by dfa.new_state() inside decompose as each counterexample is processed.
Step 1. Single-leaf DT, only <start> in the state table. close_transitions sifts '' + 'a' and '' + 'b'. Both land on the only leaf <start>, so no new states are discovered.
DFA after step 1: everything loops back to <start>.
Step 2. Counterexample 'ba': the hypothesis accepts it as <start> (which is accepting) but shouldn’t, since <start> should only accept ''. decompose creates a fresh state (call it s1) and splits the DT leaf with discriminator 'a'. update_hypothesis re-sifts stale transitions; sifting '' + 'b' now lands on s1, which is new, so it is appended to the worklist and its transitions are closed immediately. Worklist grows from ['<start>'] to ['<start>', s1].
DFA after step 2.
Step 3. Counterexample 'ba' again; now the hypothesis rejects it because s1 -a-> <start> is wrong (should reach an accepting state). decompose creates s2 and splits the <start> leaf with discriminator (\varepsilon). update_hypothesis re-sifts; sifting reach(s1) + 'a' = 'ba' lands on s2, which is new and appended. Worklist grows to include s2. Here is that sift path, the one that grows the worklist:
And sifting reach(s1) + 'b' = 'bb' lands on <start>, which is already known; no append.
After update_hypothesis finishes, all three states are wired up.
Final DFA, with states named by the algorithm as they were discovered.
Non-Redundancy
The central claim of the TTT is that it never makes a membership query whose answer could have been derived from earlier queries. To see what this means concretely: in L*, if the counterexample is ba, the algorithm adds both a and ba as new suffix columns and re-queries every existing state against both. If a was already a column, that work is wasted. TTT avoids this by extracting exactly one new suffix per counterexample and routing all future classification through the DT. This holds at every level:
Sifting is non-redundant. Every query is (w \cdot d) where (d) was placed in the DT by a previous split that proved it necessary.
Splitting is non-redundant. Each split adds exactly one discriminator, proven necessary by the counterexample.
Closing is non-redundant. Each transition is sifted exactly once per iteration. Newly discovered states come with their DT position already established by the sift that found them, so no extra queries are needed.
This contrasts with L*, where adding all (k) suffixes of a counterexample forces re-querying every existing row against every new column, most of which add no new information.
A Note on the Equivalence Oracle
TTT assumes the equivalence oracle is exact: if it says the hypothesis is wrong, it returns a string the hypothesis genuinely misclassifies. The Teacher we use is a PAC oracle: it samples a finite set of strings and declares equivalence if none expose a mistake. This is an approximation.
In principle, a false counterexample from a PAC oracle could cause TTT to create a redundant state: one that could have been merged with an existing state without changing the language the DFA accepts. The DFA would still be correct, just slightly larger than necessary. Neither TTT nor L* guarantees a globally minimal DFA under a PAC oracle: both can acquire spurious states or columns from false counterexamples, and neither performs a global minimization pass afterward. In practice this is a minor concern: the PAC oracle is unlikely to produce false counterexamples with reasonable parameters, and the language accepted by the DFA is unaffected either way.
The Main Loop
The main loop orchestrates everything:
Build the initial hypothesis: one state, all transitions open.
Ask the equivalence oracle. If it says yes, we are done.
If not, decompose the counterexample to find one new state and one new discriminator.
Incrementally update the hypothesis: re-sift only the stale transitions.
Repeat from step 2.
The loop runs exactly (n - 1) times where (n) is the number of states in the minimal DFA, one counterexample per new state discovered.
Examples
We test TTT on three targets of increasing complexity.
target 1: strings over {a, b} with an even number of a’s
target 2: strings over {a, b} that end in ‘b’
target 3: binary strings whose value is divisible by 3
This target has no convenient regex, so we write a custom teacher. It is a good stress test: the minimal DFA has exactly 3 states (one per remainder mod 3), the alphabet is {0, 1}, and transitions are determined by how reading a bit updates the current value modulo 3. This exercises TTT on a target where the states correspond to arithmetic structure rather than string patterns.
Test this.
Evaluating Model Accuracy
We measure precision and recall by cross-fuzzing the target grammar and the inferred grammar. Precision is the fraction of strings generated by the inferred DFA that the target accepts. Recall is the fraction of strings generated by the target that the inferred DFA accepts.
The inferred DFA may contain a dead/sink state: a non-accepting state with no exit, representing strings the target permanently rejects. Such a state causes LimitFuzzer to loop, because the grammar has no finite derivation from it. We remove dead states before fuzzing using fuzzer.compute_cost, which assigns each nonterminal the minimum number of steps needed to reach a terminal string. Any nonterminal with infinite cost is a dead state; we remove it and all rules that reference it.
We define a match helper that wraps the Earley parser in a boolean check.
Testing Each pair is (regex, alphabet). Cases cover a range of DFA shapes: two-segment and three-segment chains, prefix-anchored, suffix-anchored, substring-containment, exact-alternation, and disjoint finite sets.
Comparison with L*
| L* | TTT | |
|---|---|---|
| Data structure | Flat observation table | Discrimination tree (Kearns-Vazirani) |
| Counterexample processing | Add all (k) suffixes | Binary search for 1 suffix (Rivest-Schapire) |
| Prefix transformation | No | Yes (minimal access sequences, TTT) |
| Discriminator finalization | No | Yes (shallow DT, TTT) |
| Redundant queries | Many | None: the DT structure prevents re-querying known distinctions |
| Closedness check | Explicit global scan | Lazy, local (open transitions) |
| Consistency check | Explicit global scan | Structurally prevented by DT |
The DT depth is bounded by (n) (one split per state), so sifting never becomes expensive. This makes TTT the preferred algorithm when membership queries are costly, as is typical when learning protocol implementations or library APIs through testing.
References
Artifacts
The runnable Python source for this notebook is available here.
Ronald L. Rivest and Robert E. Schapire. Inference of Finite Automata Using Homing Sequences. Information and Computation, 103(2):51-73, 1993. ↩ ↩2
Michael Kearns and Umesh Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994. pp. 44-58. ↩ ↩2
Malte Isberner, Falk Howar, and Bernhard Steffen. The TTT Algorithm: A Redundancy-Free Approach to Active Automata Learning. RV 2014. ↩ ↩2
Markus Frohme. Active Automata Learning with Adaptive Distinguishing Sequences. Master Thesis, TU Dortmund, 2015. https://arxiv.org/abs/1902.01139 ↩
Falk Howar, Bernhard Steffen, and Maik Merten. Internalization of Observation Packs. ICFEM 2012. ↩
这条对你有帮助吗?