EricSpencer00 commited on
Commit
c1a3e8b
·
verified ·
1 Parent(s): 29897c4

v20: README (benchmarks from latest full CSV if present)

Browse files
Files changed (1) hide show
  1. README.md +82 -109
README.md CHANGED
@@ -20,7 +20,7 @@ datasets:
20
  pipeline_tag: text-generation
21
  ---
22
 
23
- # ChatTLA-20b (v15)
24
 
25
  ChatTLA is a fine-tuned version of [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) specialised in generating **TLA+ formal specifications** — the language used by AWS, Microsoft, and Intel to mathematically verify distributed systems.
26
 
@@ -28,87 +28,59 @@ Given a plain-English description of a concurrent or distributed system, ChatTLA
28
 
29
  ---
30
 
31
- ## Benchmark Results (v15, 3-shot self-correct)
32
 
33
- Evaluated on a 30-spec held-out suite spanning communication protocols, concurrency primitives, consensus, data structures, memory/caches, mutual exclusion, classical puzzles, scheduling, transactions, and workflow state machines. Each spec gets up to 3 self-correction attempts using TLC error feedback. Tiers are defined by what the spec actually does under SANY and TLC, not just whether it parses:
 
 
 
 
 
 
34
 
35
  | Tier | Meaning |
36
  |------|---------|
37
- | 💎 Diamond | Gold **and** TLC explores ≥1 distinct state, has a non-trivial invariant, and the invariant catches a mutation |
38
  | 🥇 Gold | SANY parses **and** TLC model-checks clean |
39
  | 🥈 Silver | SANY parses, TLC finds violation or timeout |
40
  | Bronze | SANY parse failure |
41
 
42
- Diamond is the headline metric: it's the only tier that proves the spec is *semantically* useful rather than just syntactically valid.
43
-
44
- ### Per-spec results (30-spec holdout)
45
-
46
- | # | Batch | Module | Tier | Diamond |
47
- |---|-------|--------|------|:------:|
48
- | 1 | communication_protocols | AlternatingBit | Bronze | |
49
- | 2 | communication_protocols | Arp | Bronze | |
50
- | 3 | communication_protocols | AtomicRegister | Bronze | |
51
- | 4 | concurrency_primitives | BinarySemaphore | Bronze | |
52
- | 5 | concurrency_primitives | Channel | Bronze | |
53
- | 6 | concurrency_primitives | CountDownLatch | Bronze | |
54
- | 7 | consensus_election | AtomicCommit | Bronze | |
55
- | 8 | consensus_election | BullyElection | 🥇 Gold | 💎 |
56
- | 9 | consensus_election | ByzantineQuorum | Bronze | |
57
- | 10 | data_structures | BinaryHeap | Bronze | |
58
- | 11 | data_structures | BloomCounter | 🥇 Gold | 💎 |
59
- | 12 | data_structures | BloomFilter | ⏱ Timeout | |
60
- | 13 | memory_caches | ArenaAllocator | 🥇 Gold | 💎 |
61
- | 14 | memory_caches | BuddyAllocator | Bronze | |
62
- | 15 | memory_caches | CopyingGc | Bronze | |
63
- | 16 | mutual_exclusion | AdaptiveMutex | 🥇 Gold | 💎 |
64
- | 17 | mutual_exclusion | AndersonMutex | 🥇 Gold | 💎 |
65
- | 18 | mutual_exclusion | AravindMutex | ⏱ Timeout | |
66
- | 19 | puzzles_classical | BlocksWorld | Bronze | |
67
- | 20 | puzzles_classical | ChessKingMoves | Bronze | |
68
- | 21 | puzzles_classical | ColoredHats | Bronze | |
69
- | 22 | scheduling_resources | AdmissionControl | 🥇 Gold | 💎 |
70
- | 23 | scheduling_resources | BackpressureChannel | 🥇 Gold | 💎 |
71
- | 24 | scheduling_resources | Bankers | ⏱ Timeout | |
72
- | 25 | transactions_databases | ChainReplication | ⏱ Timeout | |
73
- | 26 | transactions_databases | DistributedLock | Bronze | |
74
- | 27 | transactions_databases | FencingToken | Bronze | |
75
- | 28 | workflows_state_machines| ContentModeration | 🥇 Gold | 💎 |
76
- | 29 | workflows_state_machines| DocumentApproval | 🥇 Gold | 💎 |
77
- | 30 | workflows_state_machines| EmailVerification | Bronze | |
78
-
79
- **Diamond: 9/30 (30%) · Gold: 9/30 (30%)**
80
-
81
- ### Per-domain breakdown
82
-
83
- | Domain | Diamond |
84
- |--------|:-------:|
85
- | communication_protocols | 0/3 |
86
- | concurrency_primitives | 0/3 |
87
- | consensus_election | 1/3 |
88
- | data_structures | 1/3 |
89
- | memory_caches | 1/3 |
90
- | mutual_exclusion | 2/3 |
91
- | puzzles_classical | 0/3 |
92
- | scheduling_resources | 2/3 |
93
- | transactions_databases | 0/3 |
94
- | workflows_state_machines | 2/3 |
95
 
96
  ### Version history
97
 
98
- | Version | Suite | SANY | TLC | Diamond / Notes |
99
- |---------|-------|------|-----|-----------------|
100
- | v6 | 20-problem handcraft | 4/20 (20%) | 1/20 (5%) | — |
101
- | v7 | 20-problem handcraft | 6/20 (30%) | 1/20 (5%) | — |
102
- | v8 | 20-problem handcraft | 8/20 (40%) | 1/20 (5%) | — |
103
- | v9 | 20-problem handcraft | 6/20 (30%) | 3/20 (15%) | — |
104
  | v9 best-of-5 + self-correct | 20-problem handcraft | 16/20 (80%) | 5/20 (25%) | — |
105
- | v10 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
106
- | v11 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
107
- | v13 (SFT + DPO) | 20-problem handcraft | 9/20 (45%) | 5/20 (25%) | not measured (trivial invariants counted as Gold) |
108
- | v14 (Diamond SFT) | 30-spec holdout (single-shot) | 16/30 (53%) | 5/30 (17%) | 4/30 (13%) |
109
- | **v15 (Repair GRPO)** | **30-spec holdout (3-shot)** | 9/30 (30%) | 9/30 (30%) | **9/30 (30%)** |
 
110
 
111
- > v15 applies repair-based GRPO (Group Relative Policy Optimization) on top of v14's Diamond SFT weights. The model learns to fix its own broken specs by training on (broken repaired) trajectory pairs with TLC-graded improvement reward. v15 eval uses 3-shot self-correction with TLC error feedback, matching realistic usage; v14 was evaluated single-shot, so SANY/TLC rates are not directly comparable. Diamond is the metric to track going forward.
112
 
113
  ---
114
 
@@ -117,10 +89,12 @@ Diamond is the headline metric: it's the only tier that proves the spec is *sema
117
  ### Ollama (recommended)
118
 
119
  ```bash
120
- # Pull and run directly
121
- ollama run EricSpencer00/chattla-20b
122
 
123
- # Or use the bundled Modelfile
 
 
124
  curl -L https://huggingface.co/EricSpencer00/chattla-20b/resolve/main/gguf/Modelfile -o Modelfile
125
  ollama create chattla:20b -f Modelfile
126
  ollama run chattla:20b "Write a TLA+ spec for a token ring with N nodes."
@@ -148,13 +122,11 @@ print(result[0]["generated_text"])
148
  ### llama.cpp / GGUF
149
 
150
  ```bash
151
- # Download GGUF
152
  huggingface-cli download EricSpencer00/chattla-20b \
153
- gguf/chattla-20b-v15-Q8_0.gguf \
154
  --local-dir ./chattla
155
 
156
- # Run with llama.cpp
157
- ./llama-cli -m chattla/gguf/chattla-20b-v15-Q8_0.gguf \
158
  -n 1024 --temp 0.4 \
159
  -p "Write a TLA+ spec for mutual exclusion with N processes."
160
  ```
@@ -166,17 +138,15 @@ huggingface-cli download EricSpencer00/chattla-20b \
166
  | Property | Value |
167
  |----------|-------|
168
  | Base model | openai/gpt-oss-20b |
169
- | Parameters | 20.9B |
170
  | Architecture | GptOss (sliding + full attention) |
171
- | Fine-tuning method | Diamond SFT (LoRA) → Repair GRPO (LoRA) → merged |
172
  | Context length | 2048 (trained) / 131072 (base) |
173
  | GGUF quantisation | Q8_0 (~22 GB) |
174
- | Training date | April 2026 |
175
 
176
  ### System prompt
177
 
178
- The model is prompted with:
179
-
180
  ```
181
  You are ChatTLA, an expert at writing verified TLA+ formal specifications.
182
  When asked to write a TLA+ spec, follow these rules exactly:
@@ -193,55 +163,58 @@ When asked to write a TLA+ spec, follow these rules exactly:
193
 
194
  ## Training
195
 
196
- ### Phase 1: Diamond SFT (v14)
197
 
198
- v14 was produced by the **Diamond curation pipeline**: candidate TLA+ specs are generated by an earlier checkpoint, then graded by a tlc_validator that checks SANY parsing, TLC state-space exploration, non-trivial invariants, and mutation-test sensitivity. Specs that survive grading are LLM-judged for chain-of-thought quality, leaving a curated training pool (209 raw → 73 curated for the v14 SFT round). The model is fine-tuned with LoRA on this pool and merged.
199
 
200
- ### Phase 2: Repair GRPO (v15)
201
 
202
- v15 applies **repair-based GRPO** (Group Relative Policy Optimization) on top of the v14 checkpoint. The key insight: instead of training on gold-standard specs alone, the model learns to *fix broken specs* using TLC error feedback as reward signal.
203
 
204
- **Pipeline:**
205
  1. **Trajectory collection** — the v14 model generates specs for 398 problems with up to 6 repair iterations each, producing (broken, repaired) pairs scored by a multi-stage validator (SANY → TLC → Apalache → TLAPS).
206
- 2. **Dataset filtering** — pairs are filtered to keep the "learnable middle": `min_before_score=0.10` (drop unparseable) and `max_before_score=0.80` (drop already-good), yielding ~430 gradable pairs centered on score ≈ 0.45.
207
- 3. **GRPO training** — 300 steps, 4 generations per prompt, max 384 completion tokens. The reward is the improvement delta: `after_score - before_score`, normalized by group. Learning rate 3e-6, KL penalty β=0.02, temperature 0.5.
208
- 4. **LoRA merge** — best checkpoint (around step 140–160 where reward peaked) merged back into full weights.
209
 
210
- Reward peaked at steps 140–160 with `reward_std ≈ 0.25` (vs 0.0 in prior full-spec GRPO attempts that had zero variance). This was the first successful RL run on TLA+ spec generation.
211
 
212
- **R2 regression and R3 (in progress).** A second flywheel round (R2) continued GRPO from v15's merged weights on a freshly harvested dataset and regressed to 6/30 (20%). Post-mortem: the Phase 2 merge deduped pairs on `(nl[:80], round(before_score, 1))`, a score-bucket width of 0.1 that collapsed most of the learnable-middle band; combined with a raised `min_before_score = 0.10`, the usable training set fell from 433 → 179 pairs, shifted hard (mean before_score 0.26 → 0.42), and the model overtrained past its 150-step peak over 300 steps. Regressions concentrated in `mutual_exclusion` and `workflows_state_machines` (2/3 lost each). R3 pulls only the data and step-budget levers: dedup key widened to `(nl[:120], round(before_score, 2))`, score floor restored to 0.02, `--max-iters` raised 6 → 9 to grow the raw pool, and `--max-steps` cut to 175 with a checkpoint picker that selects the save closest to step 150. v15 remains the production checkpoint until R3 beats 9/30.
213
 
214
- DPO/KTO refinement was used in v11–v13 but was deprecated in the Diamond overhaul: 0/484 specs from those preference-trained checkpoints actually passed Diamond, indicating the model had learned TLA+ syntax without learning semantics.
 
 
 
 
 
215
 
216
- ### Training configuration
 
 
217
 
218
  | Setting | Value |
219
  |---------|-------|
220
- | SFT method | LoRA (lora_dropout=0) |
221
- | GRPO method | LoRA, 4 generations, 384 max completion |
222
- | GRPO learning rate | 3e-6 |
223
- | GRPO KL β | 0.02 |
224
- | GRPO steps | 300 (best checkpoint ~150) |
225
- | Max sequence length | 2048 |
226
- | TRL | 0.28.0 |
227
- | Transformers | 5.2.0 |
228
- | PyTorch | 2.10.0 |
229
  | Hardware | 2× Quadro RTX 8000 (48 GB each) |
230
 
 
 
231
  ---
232
 
233
  ## Files
234
 
235
  ```
236
  EricSpencer00/chattla-20b
237
- ├── config.json # Model architecture
238
- ├── tokenizer.json # Tokenizer
239
- ├── tokenizer_config.json
240
- ├── chat_template.jinja # Chat template
241
- ├── pytorch_model.bin # Full BF16 weights (39 GB)
242
- ├── generation_config.json
243
  └── gguf/
244
- ├── chattla-20b-v15-Q8_0.gguf # Quantised GGUF for Ollama / llama.cpp
245
  └── Modelfile # Ollama Modelfile
246
  ```
247
 
 
20
  pipeline_tag: text-generation
21
  ---
22
 
23
+ # ChatTLA-20b (v20)
24
 
25
  ChatTLA is a fine-tuned version of [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) specialised in generating **TLA+ formal specifications** — the language used by AWS, Microsoft, and Intel to mathematically verify distributed systems.
26
 
 
28
 
29
  ---
30
 
31
+ ## What's new in v20
32
 
33
+ v20 is the **autonomous self-repair flywheel** model — the result of 5 successful repair-GRPO cycles (c1 c5) building on v15's Repair-GRPO weights. Each cycle harvests fresh failures from the production model, mines them into broken/repaired training pairs, and runs Repair-GRPO with TLC-graded reward. Promotion to production happens only when the 30-spec holdout score improves; cycles c6, c7, c8 trained on top of v20/c5 but failed to beat it (11, 9, 9 vs 12) and were not promoted.
34
+
35
+ ---
36
+
37
+ ## Benchmark Results (v20, 3-shot self-correct)
38
+
39
+ Evaluated on the same 30-spec held-out suite as v14/v15, spanning communication protocols, concurrency primitives, consensus, data structures, memory/caches, mutual exclusion, classical puzzles, scheduling, transactions, and workflow state machines. Each spec gets up to 3 self-correction attempts using TLC error feedback.
40
 
41
  | Tier | Meaning |
42
  |------|---------|
 
43
  | 🥇 Gold | SANY parses **and** TLC model-checks clean |
44
  | 🥈 Silver | SANY parses, TLC finds violation or timeout |
45
  | Bronze | SANY parse failure |
46
 
47
+ > Diamond tier (mutation-test caught + non-trivial invariant) was not assessed in this round; v20's evaluation reports Gold-rate only.
48
+
49
+ ### Per-domain breakdown (30-spec holdout, 3-shot)
50
+
51
+ | Domain | Gold |
52
+ |--------|:-----:|
53
+ | communication_protocols | 2/3 |
54
+ | concurrency_primitives | 2/3 |
55
+ | consensus_election | 2/3 |
56
+ | data_structures | 0/3 |
57
+ | memory_caches | 0/3 |
58
+ | mutual_exclusion | 2/3 |
59
+ | puzzles_classical | 1/3 |
60
+ | scheduling_resources | 1/3 |
61
+ | transactions_databases | 1/3 |
62
+ | workflows_state_machines | 1/3 |
63
+ | **Total** | **12 / 30 (40 %)** |
64
+
65
+ Domains where v20 reaches 2/3: communication, concurrency, consensus, mutual exclusion. Domains where v20 still fails completely: data structures, memory/caches.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ### Version history
68
 
69
+ | Version | Suite | SANY | TLC (Gold) | Notes |
70
+ |---------|-------|------|------------|-------|
71
+ | v6 | 20-problem handcraft | 4/20 (20%) | 1/20 (5%) | — |
72
+ | v7 | 20-problem handcraft | 6/20 (30%) | 1/20 (5%) | — |
73
+ | v8 | 20-problem handcraft | 8/20 (40%) | 1/20 (5%) | — |
74
+ | v9 | 20-problem handcraft | 6/20 (30%) | 3/20 (15%) | — |
75
  | v9 best-of-5 + self-correct | 20-problem handcraft | 16/20 (80%) | 5/20 (25%) | — |
76
+ | v10 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
77
+ | v11 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
78
+ | v13 (SFT + DPO) | 20-problem handcraft | 9/20 (45%) | 5/20 (25%) | trivial invariants counted as Gold |
79
+ | v14 (Diamond SFT) | 30-spec holdout (single-shot) | 16/30 (53%) | 5/30 (17%) | Diamond 4/30 (13%) |
80
+ | v15 (Repair GRPO) | 30-spec holdout (3-shot) | 9/30 (30%) | 9/30 (30%) | Diamond 9/30 (30%) |
81
+ | **v20 (Flywheel c5)** | **30-spec holdout (3-shot)** | — | **12/30 (40%)** | first promoted-by-holdout-gain release |
82
 
83
+ Compared to v15, v20 adds 3 specs to the Gold pool (+33 % relative): the gains concentrate in **communication** (+2) and **mutual exclusion** (+2), with **transactions** and **workflows** holding ground. **Data structures** and **memory/caches** remain unsolved across both versions and are the obvious next target.
84
 
85
  ---
86
 
 
89
  ### Ollama (recommended)
90
 
91
  ```bash
92
+ ollama run hf.co/EricSpencer00/chattla-20b
93
+ ```
94
 
95
+ Or use the bundled Modelfile:
96
+
97
+ ```bash
98
  curl -L https://huggingface.co/EricSpencer00/chattla-20b/resolve/main/gguf/Modelfile -o Modelfile
99
  ollama create chattla:20b -f Modelfile
100
  ollama run chattla:20b "Write a TLA+ spec for a token ring with N nodes."
 
122
  ### llama.cpp / GGUF
123
 
124
  ```bash
 
125
  huggingface-cli download EricSpencer00/chattla-20b \
126
+ gguf/chattla-20b-v20-Q8_0.gguf \
127
  --local-dir ./chattla
128
 
129
+ ./llama-cli -m chattla/gguf/chattla-20b-v20-Q8_0.gguf \
 
130
  -n 1024 --temp 0.4 \
131
  -p "Write a TLA+ spec for mutual exclusion with N processes."
132
  ```
 
138
  | Property | Value |
139
  |----------|-------|
140
  | Base model | openai/gpt-oss-20b |
141
+ | Parameters | 20.9 B |
142
  | Architecture | GptOss (sliding + full attention) |
143
+ | Fine-tuning method | Diamond SFT (LoRA) → Repair GRPO (LoRA) → Self-Repair Flywheel (5× LoRA) → merged |
144
  | Context length | 2048 (trained) / 131072 (base) |
145
  | GGUF quantisation | Q8_0 (~22 GB) |
146
+ | Training date | April – May 2026 |
147
 
148
  ### System prompt
149
 
 
 
150
  ```
151
  You are ChatTLA, an expert at writing verified TLA+ formal specifications.
152
  When asked to write a TLA+ spec, follow these rules exactly:
 
163
 
164
  ## Training
165
 
166
+ ### Phase 1 Diamond SFT (v14)
167
 
168
+ v14 was produced by the **Diamond curation pipeline**: candidate TLA+ specs are generated by an earlier checkpoint, then graded by a tlc_validator that checks SANY parsing, TLC state-space exploration, non-trivial invariants, and mutation-test sensitivity. Specs that survive grading are LLM-judged for chain-of-thought quality, leaving a curated training pool (209 raw → 73 curated). The model is fine-tuned with LoRA on this pool and merged.
169
 
170
+ ### Phase 2 Repair GRPO (v15)
171
 
172
+ v15 applies **repair-based GRPO** (Group Relative Policy Optimization) on top of v14: instead of training on gold-standard specs alone, the model learns to *fix broken specs* using TLC error feedback as reward signal.
173
 
 
174
  1. **Trajectory collection** — the v14 model generates specs for 398 problems with up to 6 repair iterations each, producing (broken, repaired) pairs scored by a multi-stage validator (SANY → TLC → Apalache → TLAPS).
175
+ 2. **Dataset filtering** — pairs filtered to keep the "learnable middle" (`min_before_score=0.10`, `max_before_score=0.80`), yielding ~430 gradable pairs centered on score ≈ 0.45.
176
+ 3. **GRPO training** — 300 steps, 4 generations per prompt, max 384 completion tokens. Reward is the score-improvement delta `after before`, normalized by group. lr=3e-6, KL β=0.02, temp=0.5.
177
+ 4. **LoRA merge** — best checkpoint (around step 140–160) merged back into full weights.
178
 
179
+ ### Phase 3 Self-Repair Flywheel (v20)
180
 
181
+ v20 wraps Phase 2 in an **autonomous outer loop** that keeps running on the production GPU pool. Each cycle:
182
 
183
+ 1. **Failure harvest.** Sample 400 random NL prompts, call the *current* production model, classify the outputs (Gold / Silver / Bronze).
184
+ 2. **Pair construction.** Bootstrap (broken ��� repaired) pairs from the bronze and silver outputs; the repaired side comes from the same model under a stricter retry budget.
185
+ 3. **Repair-GRPO step.** 160 steps on the harvested pairs, LoRA r=8 / α=16, lr=3e-6, KL β=0.02, on the current best merged base.
186
+ 4. **Merge → GGUF → Ollama** as `chattla:20b-c{N}`.
187
+ 5. **Holdout eval.** 30-spec 3-shot benchmark against the same held-out suite.
188
+ 6. **Promote-on-improvement.** If `score_cN > best_score`, update the production tag `chattla:20b-repair`. Otherwise keep prior; the failed candidate stays as `chattla:20b-c{N}` for analysis.
189
 
190
+ v20 is **cycle 5** of this flywheel: c1 → 5/30, c2 → 10/30, c3 → 8/30, c4 → 5/30, **c5 → 12/30 (promoted)**. Cycles c6 (11/30), c7 (9/30), c8 (9/30) did not promote.
191
+
192
+ ### Training configuration (v20 incremental cycle)
193
 
194
  | Setting | Value |
195
  |---------|-------|
196
+ | Method | Repair GRPO with LoRA |
197
+ | LoRA rank / α / dropout | 8 / 16 / 0.0 |
198
+ | GRPO steps | 160 per cycle |
199
+ | GRPO generations / prompt | 4 |
200
+ | GRPO max completion length | 384 tokens |
201
+ | Learning rate | 3e-6 |
202
+ | KL β | 0.02 |
203
+ | Temperature | 0.5 |
204
+ | Failures harvested / cycle | 400 (filtered to ~150–250 gradable pairs) |
205
  | Hardware | 2× Quadro RTX 8000 (48 GB each) |
206
 
207
+ DPO/KTO refinement was used in v11–v13 but was deprecated in the Diamond overhaul: 0/484 specs from those preference-trained checkpoints actually passed Diamond, indicating the model had learned TLA+ syntax without learning semantics.
208
+
209
  ---
210
 
211
  ## Files
212
 
213
  ```
214
  EricSpencer00/chattla-20b
215
+ ├── README.md
 
 
 
 
 
216
  └── gguf/
217
+ ├── chattla-20b-v20-Q8_0.gguf # Quantised GGUF for Ollama / llama.cpp (~22 GB)
218
  └── Modelfile # Ollama Modelfile
219
  ```
220