--- language: pt tags: - audio-classification - wav2vec2 - pytorch - gender-classification - speech - pt-br metrics: - accuracy - f1 base_model: facebook/wav2vec2-xls-r-300m license: mit ---
*Leia em [Portuguรชs](README_pt-br.md) ๐Ÿ‡ง๐Ÿ‡ท | Read this in [English](README.md) ๐Ÿ‡บ๐Ÿ‡ธ* # Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech **A multi-phase fine-tuning approach for robust binary gender classification from raw audio, leveraging cross-domain adaptation for improved generalization.** [![PT-BR](https://img.shields.io/badge/lang-pt--BR-009C3B?style=flat-square)](.) [![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?style=flat-square&logo=pytorch&logoColor=white)](https://pytorch.org) [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Transformers-FFD21E?style=flat-square)](https://huggingface.co) [![CETUC2 F1](https://img.shields.io/badge/CETUC2%20F1--Macro-93.31%25-blue?style=flat-square)](.) [![emoUERJ Acc](https://img.shields.io/badge/emoUERJ%20Acc-90.45%25-blue?style=flat-square)](.) [![License](https://img.shields.io/badge/license-MIT-lightgrey?style=flat-square)](.)
--- ## 1. Abstract This work presents a fine-tuned **Wav2Vec2-XLS-R-300M** model for binary gender classification (Male / Female) from Brazilian Portuguese speech. The model was trained through a three-phase curriculum โ€” linear probing, full fine-tuning, and cross-domain adaptation โ€” and evaluated on two fully held-out benchmarks: **93.32% accuracy** on FalaBrasil CETUC2 (100k+ samples) and **90.45%** on emoUERJ. Audio inputs are resampled to 16 kHz and processed as raw waveforms. | Label | Class | |:-----:|:-------| | `0` | Male | | `1` | Female | --- ## 2. Training The model was trained in three incremental phases, each building on the previous checkpoint: | Phase | Strategy | Encoder | LR | Batch | Epochs | Dataset | Val Acc | |:-----:|:---------------------|:---------|:------:|:-----:|:------:|:---------------------------------|:-------:| | 1 | Linear Probing | Frozen | 2e-5 | 8 | 5 | small subset | 86.63% | | 2 | Full Fine-Tuning | Unfrozen | 2e-5 | 8 | 4 (ES) | 111,212 PT-BR samples | 99.56% | | 3 | Domain Adaptation | Unfrozen | 5e-6 | 4 | 2 (ES) | CV PT-BR โ€” 4,372 balanced | 98.51% | > **Domain Shift.** Phase 2 achieved 99.56% on in-domain data but only 63.65% on Common Voice, revealing acoustic overfitting. Phase 3 resolved this through conservative adaptation with a reduced learning rate to prevent catastrophic forgetting. --- ## 3. Evaluation Both benchmarks below are **fully out-of-domain** โ€” no samples were used during training or validation. ### 3.1 FalaBrasil CETUC2 Large-scale evaluation on **100,998 samples** from the [FalaBrasil CETUC2](https://huggingface.co/datasets/falabrasil/cetuc2) read-speech corpus (50,000 male / 50,998 female). | Metric | Value | |:-------------|:----------:| | **Accuracy** | **93.32%** | | **F1-Macro** | **93.31%** | | Mean Confidence | 95.31% | | Class | Precision | Recall | F1-Score | Support | |:---------|:---------:|:------:|:--------:|:-------:| | Male | 89.51% | 97.99% | 93.56% | 50,000 | | Female | 97.83% | 88.74% | 93.06% | 50,998 | ``` Confusion Matrix: Pred Male | Pred Female True Male | 48,996 | 1,004 True Female | 5,744 | 45,254 ``` > **Note.** The model shows higher recall for Male (97.99%) but higher precision for Female (97.83%), indicating a slight bias toward predicting Male. All top-10 highest-confidence errors were Female samples misclassified as Male. ### 3.2 emoUERJ Evaluated on **377 samples** from the emoUERJ emotion-in-speech dataset โ€” recorded under entirely different acoustic conditions. | Class | Precision | Recall | F1-Score | |:-----------|:---------:|:------:|:--------:| | Male | 0.94 | 0.85 | 0.89 | | Female | 0.87 | 0.95 | 0.91 | | **Macro** | **0.91** | **0.90** | **0.90** | **Accuracy: 90.45%** --- ## 4. Usage ```bash pip install transformers librosa torch ``` ```python import librosa, torch, torch.nn.functional as F from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification model_id = "Soltsuky/wav2vec2-gender-classification-pt-br" processor = AutoFeatureExtractor.from_pretrained(model_id) model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id) model.eval() audio, _ = librosa.load("audio.wav", sr=16000) inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True) with torch.no_grad(): probs = F.softmax(model(**inputs).logits, dim=-1)[0] label = ["MALE", "FEMALE"][torch.argmax(probs).item()] print(f"{label} โ€” {probs.max().item()*100:.2f}%") ``` --- ## 5. Limitations - Trained exclusively on **Brazilian Portuguese**; other variants (PT-PT) were not evaluated. - Audio shorter than **1 second** may produce lower confidence. - The model exhibits a **Male prediction bias** (higher Male recall, lower Female recall), likely due to distributional differences between training and evaluation data. - Voice-based gender classification carries ethical implications. This model is for **research purposes only** and should not be used to identify individuals without consent. --- ## 6. Citation ```bibtex @misc{soltsuky2026wav2vec2gender, title = {Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech}, author = {Soltsuky}, year = {2026}, url = {https://huggingface.co/Soltsuky/wav2vec2-gender-classification-pt-br} } ``` --- ## 7. Acknowledgments - **Base Model:** [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) โ€” Meta AI (MIT) - **Evaluation:** [FalaBrasil CETUC2](https://huggingface.co/datasets/falabrasil/cetuc2), [emoUERJ](https://zenodo.org/records/5427549#.ZDI6jnbMLrf), [Mozilla Common Voice](https://commonvoice.mozilla.org/) (CC-0) - **Fine-tuning:** [Soltsuky](https://huggingface.co/Soltsuky) ---