EricSpencer00 commited on
Commit
0048f5a
·
verified ·
1 Parent(s): 5315709

SFT diamond_v3 job 7098234 (checkpoint-396, 3 epochs)

Browse files
README.md CHANGED
@@ -1,270 +1,209 @@
1
  ---
2
- base_model: openai/gpt-oss-20b
3
- language:
4
- - en
5
- license: apache-2.0
6
- library_name: transformers
7
- model_name: ChatTLA-20b
8
  tags:
9
- - tla-plus
10
- - formal-methods
11
- - formal-verification
12
- - code-generation
13
- - trl
14
  - sft
15
- - grpo
16
- - reinforcement-learning
17
- - generated_from_trainer
18
- datasets:
19
- - EricSpencer00/chattla-20b
20
- pipeline_tag: text-generation
21
  ---
22
 
23
- # ChatTLA-20b (v15)
24
 
25
- ChatTLA is a fine-tuned version of [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) specialised in generating **TLA+ formal specifications** — the language used by AWS, Microsoft, and Intel to mathematically verify distributed systems.
26
 
27
- Given a plain-English description of a concurrent or distributed system, ChatTLA outputs a complete, syntactically valid TLA+ module including `Init`, `Next`, `Spec`, `TypeOK`, and domain invariants, together with a TLC model-checker configuration block.
28
 
29
- ---
30
 
31
- ## Benchmark Results (v15, 3-shot self-correct)
32
-
33
- Evaluated on a 30-spec held-out suite spanning communication protocols, concurrency primitives, consensus, data structures, memory/caches, mutual exclusion, classical puzzles, scheduling, transactions, and workflow state machines. Each spec gets up to 3 self-correction attempts using TLC error feedback. Tiers are defined by what the spec actually does under SANY and TLC, not just whether it parses:
34
-
35
- | Tier | Meaning |
36
- |------|---------|
37
- | 💎 Diamond | Gold **and** TLC explores ≥1 distinct state, has a non-trivial invariant, and the invariant catches a mutation |
38
- | 🥇 Gold | SANY parses **and** TLC model-checks clean |
39
- | 🥈 Silver | SANY parses, TLC finds violation or timeout |
40
- | Bronze | SANY parse failure |
41
-
42
- Diamond is the headline metric: it's the only tier that proves the spec is *semantically* useful rather than just syntactically valid.
43
-
44
- ### Per-spec results (30-spec holdout)
45
-
46
- | # | Batch | Module | Tier | Diamond |
47
- |---|-------|--------|------|:------:|
48
- | 1 | communication_protocols | AlternatingBit | Bronze | |
49
- | 2 | communication_protocols | Arp | Bronze | |
50
- | 3 | communication_protocols | AtomicRegister | Bronze | |
51
- | 4 | concurrency_primitives | BinarySemaphore | Bronze | |
52
- | 5 | concurrency_primitives | Channel | Bronze | |
53
- | 6 | concurrency_primitives | CountDownLatch | Bronze | |
54
- | 7 | consensus_election | AtomicCommit | Bronze | |
55
- | 8 | consensus_election | BullyElection | 🥇 Gold | 💎 |
56
- | 9 | consensus_election | ByzantineQuorum | Bronze | |
57
- | 10 | data_structures | BinaryHeap | Bronze | |
58
- | 11 | data_structures | BloomCounter | 🥇 Gold | 💎 |
59
- | 12 | data_structures | BloomFilter | ⏱ Timeout | |
60
- | 13 | memory_caches | ArenaAllocator | 🥇 Gold | 💎 |
61
- | 14 | memory_caches | BuddyAllocator | Bronze | |
62
- | 15 | memory_caches | CopyingGc | Bronze | |
63
- | 16 | mutual_exclusion | AdaptiveMutex | 🥇 Gold | 💎 |
64
- | 17 | mutual_exclusion | AndersonMutex | 🥇 Gold | 💎 |
65
- | 18 | mutual_exclusion | AravindMutex | ⏱ Timeout | |
66
- | 19 | puzzles_classical | BlocksWorld | Bronze | |
67
- | 20 | puzzles_classical | ChessKingMoves | Bronze | |
68
- | 21 | puzzles_classical | ColoredHats | Bronze | |
69
- | 22 | scheduling_resources | AdmissionControl | 🥇 Gold | 💎 |
70
- | 23 | scheduling_resources | BackpressureChannel | 🥇 Gold | 💎 |
71
- | 24 | scheduling_resources | Bankers | ⏱ Timeout | |
72
- | 25 | transactions_databases | ChainReplication | ⏱ Timeout | |
73
- | 26 | transactions_databases | DistributedLock | Bronze | |
74
- | 27 | transactions_databases | FencingToken | Bronze | |
75
- | 28 | workflows_state_machines| ContentModeration | 🥇 Gold | 💎 |
76
- | 29 | workflows_state_machines| DocumentApproval | 🥇 Gold | 💎 |
77
- | 30 | workflows_state_machines| EmailVerification | Bronze | |
78
-
79
- **Diamond: 9/30 (30%) · Gold: 9/30 (30%)**
80
-
81
- ### Per-domain breakdown
82
-
83
- | Domain | Diamond |
84
- |--------|:-------:|
85
- | communication_protocols | 0/3 |
86
- | concurrency_primitives | 0/3 |
87
- | consensus_election | 1/3 |
88
- | data_structures | 1/3 |
89
- | memory_caches | 1/3 |
90
- | mutual_exclusion | 2/3 |
91
- | puzzles_classical | 0/3 |
92
- | scheduling_resources | 2/3 |
93
- | transactions_databases | 0/3 |
94
- | workflows_state_machines | 2/3 |
95
-
96
- ### Version history
97
-
98
- | Version | Suite | SANY | TLC | Diamond / Notes |
99
- |---------|-------|------|-----|-----------------|
100
- | v6 | 20-problem handcraft | 4/20 (20%) | 1/20 (5%) | — |
101
- | v7 | 20-problem handcraft | 6/20 (30%) | 1/20 (5%) | — |
102
- | v8 | 20-problem handcraft | 8/20 (40%) | 1/20 (5%) | — |
103
- | v9 | 20-problem handcraft | 6/20 (30%) | 3/20 (15%) | — |
104
- | v9 best-of-5 + self-correct | 20-problem handcraft | 16/20 (80%) | 5/20 (25%) | — |
105
- | v10 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
106
- | v11 | 20-problem handcraft | 6/20 (30%) | 2/20 (10%) | — |
107
- | v13 (SFT + DPO) | 20-problem handcraft | 9/20 (45%) | 5/20 (25%) | not measured (trivial invariants counted as Gold) |
108
- | v14 (Diamond SFT) | 30-spec holdout (single-shot) | 16/30 (53%) | 5/30 (17%) | 4/30 (13%) |
109
- | **v15 (Repair GRPO)** | **30-spec holdout (3-shot)** | 9/30 (30%) | 9/30 (30%) | **9/30 (30%)** |
110
-
111
- > v15 applies repair-based GRPO (Group Relative Policy Optimization) on top of v14's Diamond SFT weights. The model learns to fix its own broken specs by training on (broken → repaired) trajectory pairs with TLC-graded improvement reward. v15 eval uses 3-shot self-correction with TLC error feedback, matching realistic usage; v14 was evaluated single-shot, so SANY/TLC rates are not directly comparable. Diamond is the metric to track going forward.
112
 
113
- ---
114
 
115
- ## Quick Start
116
 
117
- ### Ollama (recommended)
118
 
119
- ```bash
120
- # Pull and run directly
121
- ollama run EricSpencer00/chattla-20b
122
 
123
- # Or use the bundled Modelfile
124
- curl -L https://huggingface.co/EricSpencer00/chattla-20b/resolve/main/gguf/Modelfile -o Modelfile
125
- ollama create chattla:20b -f Modelfile
126
- ollama run chattla:20b "Write a TLA+ spec for a token ring with N nodes."
127
- ```
 
 
128
 
129
- ### Python (transformers)
130
 
131
- ```python
132
- from transformers import pipeline
133
 
134
- pipe = pipeline(
135
- "text-generation",
136
- model="EricSpencer00/chattla-20b",
137
- device_map="auto",
138
- )
139
 
140
- prompt = (
141
- "Write a complete TLA+ specification for a two-phase commit protocol "
142
- "with one coordinator and N participants."
143
- )
144
- result = pipe([{"role": "user", "content": prompt}], max_new_tokens=1024, return_full_text=False)
145
- print(result[0]["generated_text"])
146
- ```
147
 
148
- ### llama.cpp / GGUF
149
 
150
- ```bash
151
- # Download GGUF
152
- huggingface-cli download EricSpencer00/chattla-20b \
153
- gguf/chattla-20b-v15-Q8_0.gguf \
154
- --local-dir ./chattla
155
 
156
- # Run with llama.cpp
157
- ./llama-cli -m chattla/gguf/chattla-20b-v15-Q8_0.gguf \
158
- -n 1024 --temp 0.4 \
159
- -p "Write a TLA+ spec for mutual exclusion with N processes."
160
- ```
161
 
162
- ---
163
 
164
- ## Model Details
165
 
166
- | Property | Value |
167
- |----------|-------|
168
- | Base model | openai/gpt-oss-20b |
169
- | Parameters | 20.9B |
170
- | Architecture | GptOss (sliding + full attention) |
171
- | Fine-tuning method | Diamond SFT (LoRA) → Repair GRPO (LoRA) → merged |
172
- | Context length | 2048 (trained) / 131072 (base) |
173
- | GGUF quantisation | Q8_0 (~22 GB) |
174
- | Training date | April 2026 |
175
-
176
- ### System prompt
177
-
178
- The model is prompted with:
179
-
180
- ```
181
- You are ChatTLA, an expert at writing verified TLA+ formal specifications.
182
- When asked to write a TLA+ spec, follow these rules exactly:
183
- 1. Start the module with ---- MODULE <ModuleName> ----
184
- 2. End with ====
185
- 3. Include EXTENDS, VARIABLES, Init, Next, and Spec operators
186
- 4. After the TLA+ module, append a TLC configuration block:
187
- SPECIFICATION Spec
188
- INVARIANT TypeOK (if TypeOK is defined)
189
- 5. Output only valid TLA+ code. No markdown fences, no explanation outside the spec.
190
- ```
191
 
192
- ---
193
 
194
- ## Training
195
 
196
- ### Phase 1: Diamond SFT (v14)
197
 
198
- v14 was produced by the **Diamond curation pipeline**: candidate TLA+ specs are generated by an earlier checkpoint, then graded by a tlc_validator that checks SANY parsing, TLC state-space exploration, non-trivial invariants, and mutation-test sensitivity. Specs that survive grading are LLM-judged for chain-of-thought quality, leaving a curated training pool (209 raw → 73 curated for the v14 SFT round). The model is fine-tuned with LoRA on this pool and merged.
199
 
200
- ### Phase 2: Repair GRPO (v15)
201
 
202
- v15 applies **repair-based GRPO** (Group Relative Policy Optimization) on top of the v14 checkpoint. The key insight: instead of training on gold-standard specs alone, the model learns to *fix broken specs* using TLC error feedback as reward signal.
203
 
204
- **Pipeline:**
205
- 1. **Trajectory collection** — the v14 model generates specs for 398 problems with up to 6 repair iterations each, producing (broken, repaired) pairs scored by a multi-stage validator (SANY → TLC → Apalache → TLAPS).
206
- 2. **Dataset filtering** — pairs are filtered to keep the "learnable middle": `min_before_score=0.10` (drop unparseable) and `max_before_score=0.80` (drop already-good), yielding ~430 gradable pairs centered on score ≈ 0.45.
207
- 3. **GRPO training** — 300 steps, 4 generations per prompt, max 384 completion tokens. The reward is the improvement delta: `after_score - before_score`, normalized by group. Learning rate 3e-6, KL penalty β=0.02, temperature 0.5.
208
- 4. **LoRA merge** — best checkpoint (around step 140–160 where reward peaked) merged back into full weights.
209
 
210
- Reward peaked at steps 140–160 with `reward_std ≈ 0.25` (vs 0.0 in prior full-spec GRPO attempts that had zero variance). This was the first successful RL run on TLA+ spec generation.
211
 
212
- **R2 regression and R3 (in progress).** A second flywheel round (R2) continued GRPO from v15's merged weights on a freshly harvested dataset and regressed to 6/30 (20%). Post-mortem: the Phase 2 merge deduped pairs on `(nl[:80], round(before_score, 1))`, a score-bucket width of 0.1 that collapsed most of the learnable-middle band; combined with a raised `min_before_score = 0.10`, the usable training set fell from 433 → 179 pairs, shifted hard (mean before_score 0.26 → 0.42), and the model overtrained past its 150-step peak over 300 steps. Regressions concentrated in `mutual_exclusion` and `workflows_state_machines` (2/3 lost each). R3 pulls only the data and step-budget levers: dedup key widened to `(nl[:120], round(before_score, 2))`, score floor restored to 0.02, `--max-iters` raised 6 → 9 to grow the raw pool, and `--max-steps` cut to 175 with a checkpoint picker that selects the save closest to step 150. v15 remains the production checkpoint until R3 beats 9/30.
213
 
214
- DPO/KTO refinement was used in v11–v13 but was deprecated in the Diamond overhaul: 0/484 specs from those preference-trained checkpoints actually passed Diamond, indicating the model had learned TLA+ syntax without learning semantics.
215
 
216
- ### Training configuration
217
 
218
- | Setting | Value |
219
- |---------|-------|
220
- | SFT method | LoRA (lora_dropout=0) |
221
- | GRPO method | LoRA, 4 generations, 384 max completion |
222
- | GRPO learning rate | 3e-6 |
223
- | GRPO KL β | 0.02 |
224
- | GRPO steps | 300 (best checkpoint ~150) |
225
- | Max sequence length | 2048 |
226
- | TRL | 0.28.0 |
227
- | Transformers | 5.2.0 |
228
- | PyTorch | 2.10.0 |
229
- | Hardware | 2× Quadro RTX 8000 (48 GB each) |
230
 
231
- ---
232
 
233
- ## Files
234
-
235
- ```
236
- EricSpencer00/chattla-20b
237
- ├── config.json # Model architecture
238
- ├── tokenizer.json # Tokenizer
239
- ├── tokenizer_config.json
240
- ├── chat_template.jinja # Chat template
241
- ├── pytorch_model.bin # Full BF16 weights (39 GB)
242
- ├── generation_config.json
243
- └── gguf/
244
- ├── chattla-20b-v15-Q8_0.gguf # Quantised GGUF for Ollama / llama.cpp
245
- └── Modelfile # Ollama Modelfile
246
- ```
247
 
248
- ---
249
 
250
- ## Intended Use
251
 
252
- ChatTLA is designed for:
253
- - Rapid prototyping of TLA+ specifications from natural-language system descriptions
254
- - Educational exploration of formal methods
255
- - Assisting engineers who are learning TLA+
256
 
257
- **Not intended for:** safety-critical or production verification without human review. Always validate generated specs with SANY and TLC before relying on them.
258
 
259
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
 
261
- ## Citation
 
262
 
263
- ```bibtex
264
- @misc{chattla2026,
265
- title = {ChatTLA: Fine-tuned LLM for TLA+ Formal Specification Generation},
266
- author = {Spencer, Eric},
267
- year = {2026},
268
- url = {https://huggingface.co/EricSpencer00/chattla-20b},
269
- }
270
- ```
 
1
  ---
2
+ base_model: EricSpencer00/chattla-20b
3
+ library_name: peft
4
+ pipeline_tag: text-generation
 
 
 
5
  tags:
6
+ - base_model:adapter:EricSpencer00/chattla-20b
7
+ - lora
 
 
 
8
  - sft
9
+ - transformers
10
+ - trl
 
 
 
 
11
  ---
12
 
13
+ # Model Card for Model ID
14
 
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
 
 
17
 
 
18
 
19
+ ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ ### Model Description
22
 
23
+ <!-- Provide a longer summary of what this model is. -->
24
 
 
25
 
 
 
 
26
 
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
 
35
+ ### Model Sources [optional]
36
 
37
+ <!-- Provide the basic links for the model. -->
 
38
 
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
 
 
42
 
43
+ ## Uses
 
 
 
 
 
 
44
 
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
 
47
+ ### Direct Use
 
 
 
 
48
 
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
 
 
50
 
51
+ [More Information Needed]
52
 
53
+ ### Downstream Use [optional]
54
 
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
+ [More Information Needed]
58
 
59
+ ### Out-of-Scope Use
60
 
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
 
63
+ [More Information Needed]
64
 
65
+ ## Bias, Risks, and Limitations
66
 
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
 
69
+ [More Information Needed]
 
 
 
 
70
 
71
+ ### Recommendations
72
 
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
 
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
 
77
+ ## How to Get Started with the Model
78
 
79
+ Use the code below to get started with the model.
 
 
 
 
 
 
 
 
 
 
 
80
 
81
+ [More Information Needed]
82
 
83
+ ## Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
+ ### Training Data
86
 
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
 
89
+ [More Information Needed]
 
 
 
90
 
91
+ ### Training Procedure
92
 
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
 
206
+ [More Information Needed]
207
+ ### Framework versions
208
 
209
+ - PEFT 0.19.1
 
 
 
 
 
 
 
adapter_config.json ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "EricSpencer00/chattla-20b",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": [
18
+ 0,
19
+ 1,
20
+ 2,
21
+ 3,
22
+ 4,
23
+ 5,
24
+ 6,
25
+ 7,
26
+ 8,
27
+ 9,
28
+ 10,
29
+ 11,
30
+ 12,
31
+ 13,
32
+ 14,
33
+ 15,
34
+ 16,
35
+ 17,
36
+ 18,
37
+ 19,
38
+ 20,
39
+ 21,
40
+ 22,
41
+ 23,
42
+ 24,
43
+ 25,
44
+ 26,
45
+ 27,
46
+ 28,
47
+ 29,
48
+ 30,
49
+ 31,
50
+ 32,
51
+ 33,
52
+ 34,
53
+ 35,
54
+ 36,
55
+ 37,
56
+ 38,
57
+ 39,
58
+ 40,
59
+ 41,
60
+ 42,
61
+ 43,
62
+ 44,
63
+ 45,
64
+ 46,
65
+ 47,
66
+ 48,
67
+ 49,
68
+ 50,
69
+ 51,
70
+ 52,
71
+ 53,
72
+ 54,
73
+ 55,
74
+ 56,
75
+ 57,
76
+ 58,
77
+ 59,
78
+ 60,
79
+ 61,
80
+ 62,
81
+ 63
82
+ ],
83
+ "loftq_config": {},
84
+ "lora_alpha": 16,
85
+ "lora_bias": false,
86
+ "lora_dropout": 0.0,
87
+ "lora_ga_config": null,
88
+ "megatron_config": null,
89
+ "megatron_core": "megatron.core",
90
+ "modules_to_save": null,
91
+ "peft_type": "LORA",
92
+ "peft_version": "0.19.1",
93
+ "qalora_group_size": 16,
94
+ "r": 8,
95
+ "rank_pattern": {},
96
+ "revision": null,
97
+ "target_modules": [
98
+ "v_proj",
99
+ "q_proj",
100
+ "k_proj",
101
+ "o_proj"
102
+ ],
103
+ "target_parameters": null,
104
+ "task_type": "CAUSAL_LM",
105
+ "trainable_token_indices": null,
106
+ "use_bdlora": null,
107
+ "use_dora": false,
108
+ "use_qalora": false,
109
+ "use_rslora": false
110
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6af38d6fd9761f7616c0e4ff11847fb3b879987b66e5b441c55431173886fc5
3
+ size 7988016
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:50898c016cfce6af2ef195d037fee70e17813fdbe2e83b38e9d5f9889f697684
3
+ size 16089611
rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4a9f217e852f439efa6bd32fde98d6867f11aa6ea13ddc021ba10af6a0b0934
3
+ size 14645
scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a9e67916fd5f503c55c9f912b2f0e02a7346b0c165f620bf4d5255975c891e22
3
+ size 1465
tokenizer_config.json CHANGED
@@ -3,7 +3,8 @@
3
  "bos_token": "<|startoftext|>",
4
  "clean_up_tokenization_spaces": false,
5
  "eos_token": "<|return|>",
6
- "is_local": true,
 
7
  "model_input_names": [
8
  "input_ids",
9
  "attention_mask"
 
3
  "bos_token": "<|startoftext|>",
4
  "clean_up_tokenization_spaces": false,
5
  "eos_token": "<|return|>",
6
+ "is_local": false,
7
+ "local_files_only": false,
8
  "model_input_names": [
9
  "input_ids",
10
  "attention_mask"
trainer_state.json ADDED
@@ -0,0 +1,824 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": null,
3
+ "best_metric": null,
4
+ "best_model_checkpoint": null,
5
+ "epoch": 3.0,
6
+ "eval_steps": 500,
7
+ "global_step": 396,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "entropy": 1.8488737344741821,
14
+ "epoch": 0.03787878787878788,
15
+ "grad_norm": 2.390625,
16
+ "learning_rate": 8e-05,
17
+ "loss": 2.6958173751831054,
18
+ "mean_token_accuracy": 0.5450957447290421,
19
+ "num_tokens": 35961.0,
20
+ "step": 5
21
+ },
22
+ {
23
+ "entropy": 1.7560989141464234,
24
+ "epoch": 0.07575757575757576,
25
+ "grad_norm": 4.90625,
26
+ "learning_rate": 9.997417925081962e-05,
27
+ "loss": 2.22491512298584,
28
+ "mean_token_accuracy": 0.6006776213645935,
29
+ "num_tokens": 73396.0,
30
+ "step": 10
31
+ },
32
+ {
33
+ "entropy": 1.4386030733585358,
34
+ "epoch": 0.11363636363636363,
35
+ "grad_norm": 2.578125,
36
+ "learning_rate": 9.986932816177258e-05,
37
+ "loss": 1.6638334274291993,
38
+ "mean_token_accuracy": 0.672905695438385,
39
+ "num_tokens": 109782.0,
40
+ "step": 15
41
+ },
42
+ {
43
+ "entropy": 1.3276323318481444,
44
+ "epoch": 0.15151515151515152,
45
+ "grad_norm": 1.1796875,
46
+ "learning_rate": 9.96840020059622e-05,
47
+ "loss": 1.4536287307739257,
48
+ "mean_token_accuracy": 0.6756029307842255,
49
+ "num_tokens": 143815.0,
50
+ "step": 20
51
+ },
52
+ {
53
+ "entropy": 1.389198613166809,
54
+ "epoch": 0.1893939393939394,
55
+ "grad_norm": 0.88671875,
56
+ "learning_rate": 9.94184998476693e-05,
57
+ "loss": 1.4698452949523926,
58
+ "mean_token_accuracy": 0.6868252992630005,
59
+ "num_tokens": 183913.0,
60
+ "step": 25
61
+ },
62
+ {
63
+ "entropy": 1.2845717668533325,
64
+ "epoch": 0.22727272727272727,
65
+ "grad_norm": 0.98046875,
66
+ "learning_rate": 9.907325013268815e-05,
67
+ "loss": 1.267971897125244,
68
+ "mean_token_accuracy": 0.7166933000087738,
69
+ "num_tokens": 220563.0,
70
+ "step": 30
71
+ },
72
+ {
73
+ "entropy": 1.219918441772461,
74
+ "epoch": 0.26515151515151514,
75
+ "grad_norm": 1.0078125,
76
+ "learning_rate": 9.864880999693551e-05,
77
+ "loss": 1.2410313606262207,
78
+ "mean_token_accuracy": 0.7290079295635223,
79
+ "num_tokens": 257973.0,
80
+ "step": 35
81
+ },
82
+ {
83
+ "entropy": 0.9539740681648254,
84
+ "epoch": 0.30303030303030304,
85
+ "grad_norm": 1.453125,
86
+ "learning_rate": 9.814586436738998e-05,
87
+ "loss": 0.9286725997924805,
88
+ "mean_token_accuracy": 0.7912321865558625,
89
+ "num_tokens": 289221.0,
90
+ "step": 40
91
+ },
92
+ {
93
+ "entropy": 0.8536828875541687,
94
+ "epoch": 0.3409090909090909,
95
+ "grad_norm": 1.0546875,
96
+ "learning_rate": 9.756522485681247e-05,
97
+ "loss": 0.9232993125915527,
98
+ "mean_token_accuracy": 0.7938548445701599,
99
+ "num_tokens": 328397.0,
100
+ "step": 45
101
+ },
102
+ {
103
+ "entropy": 0.7762708187103271,
104
+ "epoch": 0.3787878787878788,
105
+ "grad_norm": 1.03125,
106
+ "learning_rate": 9.690782845403164e-05,
107
+ "loss": 0.7707252502441406,
108
+ "mean_token_accuracy": 0.8317423284053802,
109
+ "num_tokens": 363591.0,
110
+ "step": 50
111
+ },
112
+ {
113
+ "entropy": 0.7853556156158448,
114
+ "epoch": 0.4166666666666667,
115
+ "grad_norm": 1.3671875,
116
+ "learning_rate": 9.617473601190743e-05,
117
+ "loss": 0.7815133094787597,
118
+ "mean_token_accuracy": 0.8228329360485077,
119
+ "num_tokens": 396629.0,
120
+ "step": 55
121
+ },
122
+ {
123
+ "entropy": 0.8222164928913116,
124
+ "epoch": 0.45454545454545453,
125
+ "grad_norm": 0.77734375,
126
+ "learning_rate": 9.5367130535413e-05,
127
+ "loss": 0.826558780670166,
128
+ "mean_token_accuracy": 0.8203958511352539,
129
+ "num_tokens": 436793.0,
130
+ "step": 60
131
+ },
132
+ {
133
+ "entropy": 0.7317156493663788,
134
+ "epoch": 0.49242424242424243,
135
+ "grad_norm": 1.8515625,
136
+ "learning_rate": 9.448631527259749e-05,
137
+ "loss": 0.7245882034301758,
138
+ "mean_token_accuracy": 0.8379499971866607,
139
+ "num_tokens": 471517.0,
140
+ "step": 65
141
+ },
142
+ {
143
+ "entropy": 0.6935560047626496,
144
+ "epoch": 0.5303030303030303,
145
+ "grad_norm": 1.0,
146
+ "learning_rate": 9.353371161151032e-05,
147
+ "loss": 0.737287950515747,
148
+ "mean_token_accuracy": 0.8402327358722687,
149
+ "num_tokens": 509690.0,
150
+ "step": 70
151
+ },
152
+ {
153
+ "entropy": 0.7783978283405304,
154
+ "epoch": 0.5681818181818182,
155
+ "grad_norm": 0.72265625,
156
+ "learning_rate": 9.251085678648072e-05,
157
+ "loss": 0.8002090454101562,
158
+ "mean_token_accuracy": 0.8231807351112366,
159
+ "num_tokens": 548057.0,
160
+ "step": 75
161
+ },
162
+ {
163
+ "entropy": 0.6315798878669738,
164
+ "epoch": 0.6060606060606061,
165
+ "grad_norm": 0.76171875,
166
+ "learning_rate": 9.14194013974539e-05,
167
+ "loss": 0.612918758392334,
168
+ "mean_token_accuracy": 0.855874752998352,
169
+ "num_tokens": 586733.0,
170
+ "step": 80
171
+ },
172
+ {
173
+ "entropy": 0.6434627115726471,
174
+ "epoch": 0.6439393939393939,
175
+ "grad_norm": 0.71875,
176
+ "learning_rate": 9.026110674638721e-05,
177
+ "loss": 0.651760482788086,
178
+ "mean_token_accuracy": 0.8549414694309234,
179
+ "num_tokens": 620130.0,
180
+ "step": 85
181
+ },
182
+ {
183
+ "entropy": 0.7037500739097595,
184
+ "epoch": 0.6818181818181818,
185
+ "grad_norm": 0.81640625,
186
+ "learning_rate": 8.903784199500411e-05,
187
+ "loss": 0.7022861957550048,
188
+ "mean_token_accuracy": 0.8460219502449036,
189
+ "num_tokens": 653599.0,
190
+ "step": 90
191
+ },
192
+ {
193
+ "entropy": 0.6765074074268341,
194
+ "epoch": 0.7196969696969697,
195
+ "grad_norm": 0.875,
196
+ "learning_rate": 8.77515811484931e-05,
197
+ "loss": 0.6895392894744873,
198
+ "mean_token_accuracy": 0.841349971294403,
199
+ "num_tokens": 690828.0,
200
+ "step": 95
201
+ },
202
+ {
203
+ "entropy": 0.5865823537111282,
204
+ "epoch": 0.7575757575757576,
205
+ "grad_norm": 0.6953125,
206
+ "learning_rate": 8.640439987001853e-05,
207
+ "loss": 0.5448073863983154,
208
+ "mean_token_accuracy": 0.8746069610118866,
209
+ "num_tokens": 727387.0,
210
+ "step": 100
211
+ },
212
+ {
213
+ "entropy": 0.5738832741975785,
214
+ "epoch": 0.7954545454545454,
215
+ "grad_norm": 0.88671875,
216
+ "learning_rate": 8.49984721311843e-05,
217
+ "loss": 0.5894301891326904,
218
+ "mean_token_accuracy": 0.8635632038116455,
219
+ "num_tokens": 763245.0,
220
+ "step": 105
221
+ },
222
+ {
223
+ "entropy": 0.6615035921335221,
224
+ "epoch": 0.8333333333333334,
225
+ "grad_norm": 0.6953125,
226
+ "learning_rate": 8.353606670385515e-05,
227
+ "loss": 0.6997550010681153,
228
+ "mean_token_accuracy": 0.8429525136947632,
229
+ "num_tokens": 797170.0,
230
+ "step": 110
231
+ },
232
+ {
233
+ "entropy": 0.5302515834569931,
234
+ "epoch": 0.8712121212121212,
235
+ "grad_norm": 0.62109375,
236
+ "learning_rate": 8.201954349899713e-05,
237
+ "loss": 0.5376251697540283,
238
+ "mean_token_accuracy": 0.8770341098308563,
239
+ "num_tokens": 829505.0,
240
+ "step": 115
241
+ },
242
+ {
243
+ "entropy": 0.575181958079338,
244
+ "epoch": 0.9090909090909091,
245
+ "grad_norm": 0.70703125,
246
+ "learning_rate": 8.04513497584452e-05,
247
+ "loss": 0.5972713947296142,
248
+ "mean_token_accuracy": 0.8632317066192627,
249
+ "num_tokens": 865751.0,
250
+ "step": 120
251
+ },
252
+ {
253
+ "entropy": 0.6076352208852768,
254
+ "epoch": 0.946969696969697,
255
+ "grad_norm": 0.60546875,
256
+ "learning_rate": 7.883401610574336e-05,
257
+ "loss": 0.602271556854248,
258
+ "mean_token_accuracy": 0.8631208717823029,
259
+ "num_tokens": 901386.0,
260
+ "step": 125
261
+ },
262
+ {
263
+ "entropy": 0.6081637680530548,
264
+ "epoch": 0.9848484848484849,
265
+ "grad_norm": 0.828125,
266
+ "learning_rate": 7.717015246243011e-05,
267
+ "loss": 0.6359455585479736,
268
+ "mean_token_accuracy": 0.8557994425296783,
269
+ "num_tokens": 939393.0,
270
+ "step": 130
271
+ },
272
+ {
273
+ "entropy": 0.5041210368275643,
274
+ "epoch": 1.0227272727272727,
275
+ "grad_norm": 0.91015625,
276
+ "learning_rate": 7.546244383635928e-05,
277
+ "loss": 0.5249661445617676,
278
+ "mean_token_accuracy": 0.8816903054714202,
279
+ "num_tokens": 973074.0,
280
+ "step": 135
281
+ },
282
+ {
283
+ "entropy": 0.5709114253520966,
284
+ "epoch": 1.0606060606060606,
285
+ "grad_norm": 0.59375,
286
+ "learning_rate": 7.371364598885276e-05,
287
+ "loss": 0.5937416553497314,
288
+ "mean_token_accuracy": 0.8642938137054443,
289
+ "num_tokens": 1008969.0,
290
+ "step": 140
291
+ },
292
+ {
293
+ "entropy": 0.6661858707666397,
294
+ "epoch": 1.0984848484848484,
295
+ "grad_norm": 0.74609375,
296
+ "learning_rate": 7.192658098767686e-05,
297
+ "loss": 0.6871523857116699,
298
+ "mean_token_accuracy": 0.8403465390205384,
299
+ "num_tokens": 1048099.0,
300
+ "step": 145
301
+ },
302
+ {
303
+ "entropy": 0.6694399774074554,
304
+ "epoch": 1.1363636363636362,
305
+ "grad_norm": 1.234375,
306
+ "learning_rate": 7.010413265301888e-05,
307
+ "loss": 0.6669593811035156,
308
+ "mean_token_accuracy": 0.8518058300018311,
309
+ "num_tokens": 1088002.0,
310
+ "step": 150
311
+ },
312
+ {
313
+ "entropy": 0.5742821432650089,
314
+ "epoch": 1.1742424242424243,
315
+ "grad_norm": 0.671875,
316
+ "learning_rate": 6.824924190381256e-05,
317
+ "loss": 0.579827356338501,
318
+ "mean_token_accuracy": 0.8686598777770996,
319
+ "num_tokens": 1125008.0,
320
+ "step": 155
321
+ },
322
+ {
323
+ "entropy": 0.5425577774643898,
324
+ "epoch": 1.2121212121212122,
325
+ "grad_norm": 0.8125,
326
+ "learning_rate": 6.63649020119223e-05,
327
+ "loss": 0.5697774887084961,
328
+ "mean_token_accuracy": 0.8714725434780121,
329
+ "num_tokens": 1160396.0,
330
+ "step": 160
331
+ },
332
+ {
333
+ "entropy": 0.5542260497808457,
334
+ "epoch": 1.25,
335
+ "grad_norm": 0.88671875,
336
+ "learning_rate": 6.445415377184427e-05,
337
+ "loss": 0.5461042404174805,
338
+ "mean_token_accuracy": 0.8678983390331269,
339
+ "num_tokens": 1193022.0,
340
+ "step": 165
341
+ },
342
+ {
343
+ "entropy": 0.6764177441596985,
344
+ "epoch": 1.2878787878787878,
345
+ "grad_norm": 1.0546875,
346
+ "learning_rate": 6.252008059371968e-05,
347
+ "loss": 0.6604482173919678,
348
+ "mean_token_accuracy": 0.8478662848472596,
349
+ "num_tokens": 1230547.0,
350
+ "step": 170
351
+ },
352
+ {
353
+ "entropy": 0.5709842652082443,
354
+ "epoch": 1.3257575757575757,
355
+ "grad_norm": 0.921875,
356
+ "learning_rate": 6.056580352757813e-05,
357
+ "loss": 0.5742652416229248,
358
+ "mean_token_accuracy": 0.8663301408290863,
359
+ "num_tokens": 1269514.0,
360
+ "step": 175
361
+ },
362
+ {
363
+ "entropy": 0.5634667783975601,
364
+ "epoch": 1.3636363636363638,
365
+ "grad_norm": 0.8203125,
366
+ "learning_rate": 5.8594476226840835e-05,
367
+ "loss": 0.5817259788513184,
368
+ "mean_token_accuracy": 0.8571344554424286,
369
+ "num_tokens": 1305450.0,
370
+ "step": 180
371
+ },
372
+ {
373
+ "entropy": 0.5020789414644241,
374
+ "epoch": 1.4015151515151514,
375
+ "grad_norm": 0.890625,
376
+ "learning_rate": 5.660927985921122e-05,
377
+ "loss": 0.5025756359100342,
378
+ "mean_token_accuracy": 0.8809928238391876,
379
+ "num_tokens": 1342076.0,
380
+ "step": 185
381
+ },
382
+ {
383
+ "entropy": 0.5636127024888993,
384
+ "epoch": 1.4393939393939394,
385
+ "grad_norm": 0.74609375,
386
+ "learning_rate": 5.4613417973165106e-05,
387
+ "loss": 0.5667418956756591,
388
+ "mean_token_accuracy": 0.8666031301021576,
389
+ "num_tokens": 1383914.0,
390
+ "step": 190
391
+ },
392
+ {
393
+ "entropy": 0.5760275781154632,
394
+ "epoch": 1.4772727272727273,
395
+ "grad_norm": 0.96484375,
396
+ "learning_rate": 5.26101113283247e-05,
397
+ "loss": 0.594722318649292,
398
+ "mean_token_accuracy": 0.8600034713745117,
399
+ "num_tokens": 1416350.0,
400
+ "step": 195
401
+ },
402
+ {
403
+ "entropy": 0.5440281018614769,
404
+ "epoch": 1.5151515151515151,
405
+ "grad_norm": 0.890625,
406
+ "learning_rate": 5.06025926980586e-05,
407
+ "loss": 0.5633892059326172,
408
+ "mean_token_accuracy": 0.8706209361553192,
409
+ "num_tokens": 1450795.0,
410
+ "step": 200
411
+ },
412
+ {
413
+ "entropy": 0.5666765511035919,
414
+ "epoch": 1.553030303030303,
415
+ "grad_norm": 0.8203125,
416
+ "learning_rate": 4.859410165269499e-05,
417
+ "loss": 0.5622632503509521,
418
+ "mean_token_accuracy": 0.8659008502960205,
419
+ "num_tokens": 1486675.0,
420
+ "step": 205
421
+ },
422
+ {
423
+ "entropy": 0.5974125057458878,
424
+ "epoch": 1.5909090909090908,
425
+ "grad_norm": 0.8984375,
426
+ "learning_rate": 4.658787933176646e-05,
427
+ "loss": 0.6370823383331299,
428
+ "mean_token_accuracy": 0.8563625752925873,
429
+ "num_tokens": 1523640.0,
430
+ "step": 210
431
+ },
432
+ {
433
+ "entropy": 0.4951909102499485,
434
+ "epoch": 1.628787878787879,
435
+ "grad_norm": 2.5625,
436
+ "learning_rate": 4.458716321372259e-05,
437
+ "loss": 0.5116774082183838,
438
+ "mean_token_accuracy": 0.8819014012813569,
439
+ "num_tokens": 1555473.0,
440
+ "step": 215
441
+ },
442
+ {
443
+ "entropy": 0.4781244724988937,
444
+ "epoch": 1.6666666666666665,
445
+ "grad_norm": 0.796875,
446
+ "learning_rate": 4.259518189155048e-05,
447
+ "loss": 0.4745966911315918,
448
+ "mean_token_accuracy": 0.8877918183803558,
449
+ "num_tokens": 1588856.0,
450
+ "step": 220
451
+ },
452
+ {
453
+ "entropy": 0.5352629214525223,
454
+ "epoch": 1.7045454545454546,
455
+ "grad_norm": 0.80859375,
456
+ "learning_rate": 4.0615149862733907e-05,
457
+ "loss": 0.5249689102172852,
458
+ "mean_token_accuracy": 0.8741750419139862,
459
+ "num_tokens": 1623594.0,
460
+ "step": 225
461
+ },
462
+ {
463
+ "entropy": 0.691879364848137,
464
+ "epoch": 1.7424242424242424,
465
+ "grad_norm": 1.046875,
466
+ "learning_rate": 3.8650262341958627e-05,
467
+ "loss": 0.7096127510070801,
468
+ "mean_token_accuracy": 0.8415236473083496,
469
+ "num_tokens": 1663579.0,
470
+ "step": 230
471
+ },
472
+ {
473
+ "entropy": 0.5149153739213943,
474
+ "epoch": 1.7803030303030303,
475
+ "grad_norm": 0.67578125,
476
+ "learning_rate": 3.6703690104934804e-05,
477
+ "loss": 0.5001875400543213,
478
+ "mean_token_accuracy": 0.879623144865036,
479
+ "num_tokens": 1699916.0,
480
+ "step": 235
481
+ },
482
+ {
483
+ "entropy": 0.3942973747849464,
484
+ "epoch": 1.8181818181818183,
485
+ "grad_norm": 0.78515625,
486
+ "learning_rate": 3.477857437165694e-05,
487
+ "loss": 0.37698049545288087,
488
+ "mean_token_accuracy": 0.9060261309146881,
489
+ "num_tokens": 1729598.0,
490
+ "step": 240
491
+ },
492
+ {
493
+ "entropy": 0.5332215487957,
494
+ "epoch": 1.856060606060606,
495
+ "grad_norm": 0.7109375,
496
+ "learning_rate": 3.2878021737358474e-05,
497
+ "loss": 0.5564140796661377,
498
+ "mean_token_accuracy": 0.8692101180553437,
499
+ "num_tokens": 1765449.0,
500
+ "step": 245
501
+ },
502
+ {
503
+ "entropy": 0.5327667541801929,
504
+ "epoch": 1.893939393939394,
505
+ "grad_norm": 0.79296875,
506
+ "learning_rate": 3.100509915934104e-05,
507
+ "loss": 0.551155424118042,
508
+ "mean_token_accuracy": 0.8772011756896972,
509
+ "num_tokens": 1799864.0,
510
+ "step": 250
511
+ },
512
+ {
513
+ "entropy": 0.5360715672373771,
514
+ "epoch": 1.9318181818181817,
515
+ "grad_norm": 0.79296875,
516
+ "learning_rate": 2.91628290077681e-05,
517
+ "loss": 0.5367880821228027,
518
+ "mean_token_accuracy": 0.870383208990097,
519
+ "num_tokens": 1833622.0,
520
+ "step": 255
521
+ },
522
+ {
523
+ "entropy": 0.5413278043270111,
524
+ "epoch": 1.9696969696969697,
525
+ "grad_norm": 0.91796875,
526
+ "learning_rate": 2.735418418840977e-05,
527
+ "loss": 0.5825323104858399,
528
+ "mean_token_accuracy": 0.87061527967453,
529
+ "num_tokens": 1873310.0,
530
+ "step": 260
531
+ },
532
+ {
533
+ "entropy": 0.5841432213783264,
534
+ "epoch": 2.007575757575758,
535
+ "grad_norm": 0.8203125,
536
+ "learning_rate": 2.5582083345209217e-05,
537
+ "loss": 0.5896332740783692,
538
+ "mean_token_accuracy": 0.8582423806190491,
539
+ "num_tokens": 1911297.0,
540
+ "step": 265
541
+ },
542
+ {
543
+ "entropy": 0.4948613867163658,
544
+ "epoch": 2.0454545454545454,
545
+ "grad_norm": 0.69140625,
546
+ "learning_rate": 2.3849386150412378e-05,
547
+ "loss": 0.5073411464691162,
548
+ "mean_token_accuracy": 0.8816523849964142,
549
+ "num_tokens": 1946721.0,
550
+ "step": 270
551
+ },
552
+ {
553
+ "entropy": 0.5449469789862633,
554
+ "epoch": 2.0833333333333335,
555
+ "grad_norm": 0.98828125,
556
+ "learning_rate": 2.2158888689861433e-05,
557
+ "loss": 0.55238938331604,
558
+ "mean_token_accuracy": 0.870269101858139,
559
+ "num_tokens": 1983131.0,
560
+ "step": 275
561
+ },
562
+ {
563
+ "entropy": 0.5343923151493073,
564
+ "epoch": 2.121212121212121,
565
+ "grad_norm": 0.828125,
566
+ "learning_rate": 2.051331895089882e-05,
567
+ "loss": 0.5471882343292236,
568
+ "mean_token_accuracy": 0.8725048422813415,
569
+ "num_tokens": 2022170.0,
570
+ "step": 280
571
+ },
572
+ {
573
+ "entropy": 0.5085220277309418,
574
+ "epoch": 2.159090909090909,
575
+ "grad_norm": 0.859375,
576
+ "learning_rate": 1.8915332420163073e-05,
577
+ "loss": 0.5496640682220459,
578
+ "mean_token_accuracy": 0.8838891804218292,
579
+ "num_tokens": 2056605.0,
580
+ "step": 285
581
+ },
582
+ {
583
+ "entropy": 0.5516770869493485,
584
+ "epoch": 2.196969696969697,
585
+ "grad_norm": 0.76953125,
586
+ "learning_rate": 1.736750779838044e-05,
587
+ "loss": 0.556769609451294,
588
+ "mean_token_accuracy": 0.866687560081482,
589
+ "num_tokens": 2093824.0,
590
+ "step": 290
591
+ },
592
+ {
593
+ "entropy": 0.45218280255794524,
594
+ "epoch": 2.234848484848485,
595
+ "grad_norm": 0.8046875,
596
+ "learning_rate": 1.5872342839067306e-05,
597
+ "loss": 0.45604901313781737,
598
+ "mean_token_accuracy": 0.8922043800354004,
599
+ "num_tokens": 2126566.0,
600
+ "step": 295
601
+ },
602
+ {
603
+ "entropy": 0.3802251495420933,
604
+ "epoch": 2.2727272727272725,
605
+ "grad_norm": 0.8671875,
606
+ "learning_rate": 1.4432250317858675e-05,
607
+ "loss": 0.38388445377349856,
608
+ "mean_token_accuracy": 0.912166029214859,
609
+ "num_tokens": 2157003.0,
610
+ "step": 300
611
+ },
612
+ {
613
+ "entropy": 0.4966444715857506,
614
+ "epoch": 2.3106060606060606,
615
+ "grad_norm": 0.953125,
616
+ "learning_rate": 1.3049554138967051e-05,
617
+ "loss": 0.5617212295532227,
618
+ "mean_token_accuracy": 0.8805790543556213,
619
+ "num_tokens": 2196031.0,
620
+ "step": 305
621
+ },
622
+ {
623
+ "entropy": 0.4347878597676754,
624
+ "epoch": 2.3484848484848486,
625
+ "grad_norm": 0.78125,
626
+ "learning_rate": 1.172648558505477e-05,
627
+ "loss": 0.44159908294677735,
628
+ "mean_token_accuracy": 0.8970139145851135,
629
+ "num_tokens": 2230559.0,
630
+ "step": 310
631
+ },
632
+ {
633
+ "entropy": 0.5470431834459305,
634
+ "epoch": 2.3863636363636362,
635
+ "grad_norm": 0.89453125,
636
+ "learning_rate": 1.0465179716571466e-05,
637
+ "loss": 0.5724119663238525,
638
+ "mean_token_accuracy": 0.8691107213497162,
639
+ "num_tokens": 2268041.0,
640
+ "step": 315
641
+ },
642
+ {
643
+ "entropy": 0.5263485819101333,
644
+ "epoch": 2.4242424242424243,
645
+ "grad_norm": 0.96875,
646
+ "learning_rate": 9.267671926367166e-06,
647
+ "loss": 0.5039608955383301,
648
+ "mean_token_accuracy": 0.8749733746051789,
649
+ "num_tokens": 2302515.0,
650
+ "step": 320
651
+ },
652
+ {
653
+ "entropy": 0.6496276021003723,
654
+ "epoch": 2.462121212121212,
655
+ "grad_norm": 0.8359375,
656
+ "learning_rate": 8.135894655140758e-06,
657
+ "loss": 0.6695892810821533,
658
+ "mean_token_accuracy": 0.8468307316303253,
659
+ "num_tokens": 2343731.0,
660
+ "step": 325
661
+ },
662
+ {
663
+ "entropy": 0.49519053399562835,
664
+ "epoch": 2.5,
665
+ "grad_norm": 0.91015625,
666
+ "learning_rate": 7.071674273024354e-06,
667
+ "loss": 0.5048277378082275,
668
+ "mean_token_accuracy": 0.8870163023471832,
669
+ "num_tokens": 2380175.0,
670
+ "step": 330
671
+ },
672
+ {
673
+ "entropy": 0.48127582371234895,
674
+ "epoch": 2.537878787878788,
675
+ "grad_norm": 1.0078125,
676
+ "learning_rate": 6.076728132335669e-06,
677
+ "loss": 0.492659330368042,
678
+ "mean_token_accuracy": 0.8882244884967804,
679
+ "num_tokens": 2412514.0,
680
+ "step": 335
681
+ },
682
+ {
683
+ "entropy": 0.46426295340061186,
684
+ "epoch": 2.5757575757575757,
685
+ "grad_norm": 0.74609375,
686
+ "learning_rate": 5.152661796254504e-06,
687
+ "loss": 0.46677451133728026,
688
+ "mean_token_accuracy": 0.8887528300285339,
689
+ "num_tokens": 2444160.0,
690
+ "step": 340
691
+ },
692
+ {
693
+ "entropy": 0.5339648842811584,
694
+ "epoch": 2.6136363636363638,
695
+ "grad_norm": 1.1796875,
696
+ "learning_rate": 4.300966447895438e-06,
697
+ "loss": 0.5490293502807617,
698
+ "mean_token_accuracy": 0.8735897600650787,
699
+ "num_tokens": 2481155.0,
700
+ "step": 345
701
+ },
702
+ {
703
+ "entropy": 0.5200017690658569,
704
+ "epoch": 2.6515151515151514,
705
+ "grad_norm": 1.0,
706
+ "learning_rate": 3.5230164839577416e-06,
707
+ "loss": 0.5357878684997559,
708
+ "mean_token_accuracy": 0.8745341539382935,
709
+ "num_tokens": 2516557.0,
710
+ "step": 350
711
+ },
712
+ {
713
+ "entropy": 0.40029837638139726,
714
+ "epoch": 2.6893939393939394,
715
+ "grad_norm": 0.81640625,
716
+ "learning_rate": 2.820067296835799e-06,
717
+ "loss": 0.40105533599853516,
718
+ "mean_token_accuracy": 0.9027827084064484,
719
+ "num_tokens": 2546964.0,
720
+ "step": 355
721
+ },
722
+ {
723
+ "entropy": 0.5041794210672379,
724
+ "epoch": 2.7272727272727275,
725
+ "grad_norm": 0.890625,
726
+ "learning_rate": 2.1932532487688785e-06,
727
+ "loss": 0.5031059265136719,
728
+ "mean_token_accuracy": 0.8806142807006836,
729
+ "num_tokens": 2582064.0,
730
+ "step": 360
731
+ },
732
+ {
733
+ "entropy": 0.5365766167640686,
734
+ "epoch": 2.765151515151515,
735
+ "grad_norm": 1.125,
736
+ "learning_rate": 1.6435858412996275e-06,
737
+ "loss": 0.5377714157104492,
738
+ "mean_token_accuracy": 0.8711750507354736,
739
+ "num_tokens": 2620085.0,
740
+ "step": 365
741
+ },
742
+ {
743
+ "entropy": 0.5815416038036346,
744
+ "epoch": 2.8030303030303028,
745
+ "grad_norm": 0.80078125,
746
+ "learning_rate": 1.1719520829951203e-06,
747
+ "loss": 0.5831816673278809,
748
+ "mean_token_accuracy": 0.8631684601306915,
749
+ "num_tokens": 2655969.0,
750
+ "step": 370
751
+ },
752
+ {
753
+ "entropy": 0.6044020265340805,
754
+ "epoch": 2.840909090909091,
755
+ "grad_norm": 0.86328125,
756
+ "learning_rate": 7.791130580645622e-07,
757
+ "loss": 0.6402379035949707,
758
+ "mean_token_accuracy": 0.8542465031147003,
759
+ "num_tokens": 2698335.0,
760
+ "step": 375
761
+ },
762
+ {
763
+ "entropy": 0.5233386039733887,
764
+ "epoch": 2.878787878787879,
765
+ "grad_norm": 0.8828125,
766
+ "learning_rate": 4.6570269818346224e-07,
767
+ "loss": 0.531651782989502,
768
+ "mean_token_accuracy": 0.8721617937088013,
769
+ "num_tokens": 2733837.0,
770
+ "step": 380
771
+ },
772
+ {
773
+ "entropy": 0.6269441485404968,
774
+ "epoch": 2.9166666666666665,
775
+ "grad_norm": 0.8046875,
776
+ "learning_rate": 2.3222675950627104e-07,
777
+ "loss": 0.6280055522918702,
778
+ "mean_token_accuracy": 0.8480874240398407,
779
+ "num_tokens": 2773117.0,
780
+ "step": 385
781
+ },
782
+ {
783
+ "entropy": 0.5497441381216049,
784
+ "epoch": 2.9545454545454546,
785
+ "grad_norm": 0.78125,
786
+ "learning_rate": 7.906200651819907e-08,
787
+ "loss": 0.5609755039215087,
788
+ "mean_token_accuracy": 0.8707708656787873,
789
+ "num_tokens": 2808716.0,
790
+ "step": 390
791
+ },
792
+ {
793
+ "entropy": 0.5683890715241432,
794
+ "epoch": 2.992424242424242,
795
+ "grad_norm": 0.74609375,
796
+ "learning_rate": 6.455604043331676e-09,
797
+ "loss": 0.5778711795806885,
798
+ "mean_token_accuracy": 0.8663623631000519,
799
+ "num_tokens": 2849722.0,
800
+ "step": 395
801
+ }
802
+ ],
803
+ "logging_steps": 5,
804
+ "max_steps": 396,
805
+ "num_input_tokens_seen": 0,
806
+ "num_train_epochs": 3,
807
+ "save_steps": 20,
808
+ "stateful_callbacks": {
809
+ "TrainerControl": {
810
+ "args": {
811
+ "should_epoch_stop": false,
812
+ "should_evaluate": false,
813
+ "should_log": false,
814
+ "should_save": true,
815
+ "should_training_stop": true
816
+ },
817
+ "attributes": {}
818
+ }
819
+ },
820
+ "total_flos": 5.478377839274039e+17,
821
+ "train_batch_size": 4,
822
+ "trial_name": null,
823
+ "trial_params": null
824
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1320b30f0d3660d9a16861093b02f9cff5211eff22ca115da6ea421ac390673f
3
+ size 5777