FineData
community
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Organization Card
๐ท FineData
This is the home of the ๐ท FineData team, a branch of the ๐ค Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.
- ๐ท FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
- ๐ FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
- ๐ฅ FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
- ๐ FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web.
- ๐ FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
- ๐ FinePDFs-Edu: 350B+ highly educational tokens filtered from ๐ FinePDFs
-
HuggingFaceFW/finepdfs
Viewer โข Updated โข 476M โข 37.5k โข 683 -
HuggingFaceFW/finepdfs-edu
Viewer โข Updated โข 49.5M โข 17k โข 52 -
HuggingFaceFW/ocr-annotations
Viewer โข Updated โข 1.62k โข 167 โข 15 -
HuggingFaceFW/finepdfs_lang_classification
Viewer โข Updated โข 3.08M โข 12.8k โข 4
-
HuggingFaceFW/finepdfs
Viewer โข Updated โข 476M โข 37.5k โข 683 -
HuggingFaceFW/finepdfs-edu
Viewer โข Updated โข 49.5M โข 17k โข 52 -
HuggingFaceFW/ocr-annotations
Viewer โข Updated โข 1.62k โข 167 โข 15 -
HuggingFaceFW/finepdfs_lang_classification
Viewer โข Updated โข 3.08M โข 12.8k โข 4
spaces
6
Running
8
FineWiki Viewer
๐
Viewer to explore the finewiki dataset
Running
Featured
1.21k
FineWeb: decanting the web for the finest text data at scale
๐ท
Generate high-quality text data for LLMs using FineWeb
Running
84
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
๐
Evaluate multilingual models using FineTasks
Build error
Tasks Explorer
๐ข
Explore and analyze experiment results
Runtime error
4
Datasets Metrics Explorer
๐
Launch an interactive demo interface
models
105
HuggingFaceFW/finepdfs_edu_classifier_eng_Latn
0.4B
โข
Updated
โข
46
โข
2
HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn
0.4B
โข
Updated
โข
41
HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn
0.4B
โข
Updated
โข
31
HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn
0.4B
โข
Updated
โข
17
HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr
0.3B
โข
Updated
โข
25
HuggingFaceFW/finepdfs_edu_classifier_nno_Latn
0.3B
โข
Updated
โข
19
HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl
0.3B
โข
Updated
โข
17
HuggingFaceFW/finepdfs_edu_classifier_tam_Taml
0.3B
โข
Updated
โข
18
HuggingFaceFW/finepdfs_edu_classifier_azj_Latn
0.3B
โข
Updated
โข
14
HuggingFaceFW/finepdfs_edu_classifier_afr_Latn
0.3B
โข
Updated
โข
21
datasets
15
HuggingFaceFW/finepdfs
Viewer
โข
Updated
โข
476M
โข
37.5k
โข
683
HuggingFaceFW/finepdfs-edu
Viewer
โข
Updated
โข
49.5M
โข
17k
โข
52
HuggingFaceFW/fineweb-2
Viewer
โข
Updated
โข
4.48B
โข
87.9k
โข
704
HuggingFaceFW/finewiki
Viewer
โข
Updated
โข
61.6M
โข
14.7k
โข
265
HuggingFaceFW/clean-wikipedia
Viewer
โข
Updated
โข
61.2M
โข
1.01k
โข
23
HuggingFaceFW/finepdfs_lang_classification_tmp
Updated
โข
22
HuggingFaceFW/ocr-annotations
Viewer
โข
Updated
โข
1.62k
โข
167
โข
15
HuggingFaceFW/finepdfs_lang_classification
Viewer
โข
Updated
โข
3.08M
โข
12.8k
โข
4
HuggingFaceFW/finepdfs_eng_Latn_labeled
Viewer
โข
Updated
โข
1.3M
โข
736
โข
2
HuggingFaceFW/finepdfs_fw_edu_labeled
Viewer
โข
Updated
โข
18.8M
โข
537
โข
3