Update README.md
Browse files
README.md
CHANGED
|
@@ -220,7 +220,7 @@ We follow the jinja chat template provided below. This template conditionally ad
|
|
| 220 |
|
| 221 |
## Training, Testing, and Evaluation Datasets
|
| 222 |
|
| 223 |
-
The post-training corpus for Nemotron-H-8B-Reasoning-128K consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from R1.
|
| 224 |
|
| 225 |
**Data Collection for Training & Testing Datasets:** Hybrid: Automated, Human, Synthetic
|
| 226 |
|
|
|
|
| 220 |
|
| 221 |
## Training, Testing, and Evaluation Datasets
|
| 222 |
|
| 223 |
+
The post-training corpus for Nemotron-H-8B-Reasoning-128K consists of English and multilingual text (German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese and English). Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including code, legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracies. For several of the domains listed above we used synthetic data, specifically reasoning traces, from DeepSeek R1.
|
| 224 |
|
| 225 |
**Data Collection for Training & Testing Datasets:** Hybrid: Automated, Human, Synthetic
|
| 226 |
|