---
language: pt
tags:
- audio-classification
- wav2vec2
- pytorch
- gender-classification
- speech
- pt-br
metrics:
- accuracy
- f1
base_model: facebook/wav2vec2-xls-r-300m
license: mit
---
*Leia em [Portuguรชs](README_pt-br.md) ๐ง๐ท | Read this in [English](README.md) ๐บ๐ธ*
# Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech
**A multi-phase fine-tuning approach for robust binary gender classification from raw audio, leveraging cross-domain adaptation for improved generalization.**
[](.)
[](https://pytorch.org)
[](https://huggingface.co)
[](.)
[](.)
[](.)
---
## 1. Abstract
This work presents a fine-tuned **Wav2Vec2-XLS-R-300M** model for binary gender classification (Male / Female) from Brazilian Portuguese speech. The model was trained through a three-phase curriculum โ linear probing, full fine-tuning, and cross-domain adaptation โ and evaluated on two fully held-out benchmarks: **93.32% accuracy** on FalaBrasil CETUC2 (100k+ samples) and **90.45%** on emoUERJ. Audio inputs are resampled to 16 kHz and processed as raw waveforms.
| Label | Class |
|:-----:|:-------|
| `0` | Male |
| `1` | Female |
---
## 2. Training
The model was trained in three incremental phases, each building on the previous checkpoint:
| Phase | Strategy | Encoder | LR | Batch | Epochs | Dataset | Val Acc |
|:-----:|:---------------------|:---------|:------:|:-----:|:------:|:---------------------------------|:-------:|
| 1 | Linear Probing | Frozen | 2e-5 | 8 | 5 | small subset | 86.63% |
| 2 | Full Fine-Tuning | Unfrozen | 2e-5 | 8 | 4 (ES) | 111,212 PT-BR samples | 99.56% |
| 3 | Domain Adaptation | Unfrozen | 5e-6 | 4 | 2 (ES) | CV PT-BR โ 4,372 balanced | 98.51% |
> **Domain Shift.** Phase 2 achieved 99.56% on in-domain data but only 63.65% on Common Voice, revealing acoustic overfitting. Phase 3 resolved this through conservative adaptation with a reduced learning rate to prevent catastrophic forgetting.
---
## 3. Evaluation
Both benchmarks below are **fully out-of-domain** โ no samples were used during training or validation.
### 3.1 FalaBrasil CETUC2
Large-scale evaluation on **100,998 samples** from the [FalaBrasil CETUC2](https://huggingface.co/datasets/falabrasil/cetuc2) read-speech corpus (50,000 male / 50,998 female).
| Metric | Value |
|:-------------|:----------:|
| **Accuracy** | **93.32%** |
| **F1-Macro** | **93.31%** |
| Mean Confidence | 95.31% |
| Class | Precision | Recall | F1-Score | Support |
|:---------|:---------:|:------:|:--------:|:-------:|
| Male | 89.51% | 97.99% | 93.56% | 50,000 |
| Female | 97.83% | 88.74% | 93.06% | 50,998 |
```
Confusion Matrix:
Pred Male | Pred Female
True Male | 48,996 | 1,004
True Female | 5,744 | 45,254
```
> **Note.** The model shows higher recall for Male (97.99%) but higher precision for Female (97.83%), indicating a slight bias toward predicting Male. All top-10 highest-confidence errors were Female samples misclassified as Male.
### 3.2 emoUERJ
Evaluated on **377 samples** from the emoUERJ emotion-in-speech dataset โ recorded under entirely different acoustic conditions.
| Class | Precision | Recall | F1-Score |
|:-----------|:---------:|:------:|:--------:|
| Male | 0.94 | 0.85 | 0.89 |
| Female | 0.87 | 0.95 | 0.91 |
| **Macro** | **0.91** | **0.90** | **0.90** |
**Accuracy: 90.45%**
---
## 4. Usage
```bash
pip install transformers librosa torch
```
```python
import librosa, torch, torch.nn.functional as F
from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification
model_id = "Soltsuky/wav2vec2-gender-classification-pt-br"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()
audio, _ = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
probs = F.softmax(model(**inputs).logits, dim=-1)[0]
label = ["MALE", "FEMALE"][torch.argmax(probs).item()]
print(f"{label} โ {probs.max().item()*100:.2f}%")
```
---
## 5. Limitations
- Trained exclusively on **Brazilian Portuguese**; other variants (PT-PT) were not evaluated.
- Audio shorter than **1 second** may produce lower confidence.
- The model exhibits a **Male prediction bias** (higher Male recall, lower Female recall), likely due to distributional differences between training and evaluation data.
- Voice-based gender classification carries ethical implications. This model is for **research purposes only** and should not be used to identify individuals without consent.
---
## 6. Citation
```bibtex
@misc{soltsuky2026wav2vec2gender,
title = {Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech},
author = {Soltsuky},
year = {2026},
url = {https://huggingface.co/Soltsuky/wav2vec2-gender-classification-pt-br}
}
```
---
## 7. Acknowledgments
- **Base Model:** [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) โ Meta AI (MIT)
- **Evaluation:** [FalaBrasil CETUC2](https://huggingface.co/datasets/falabrasil/cetuc2), [emoUERJ](https://zenodo.org/records/5427549#.ZDI6jnbMLrf), [Mozilla Common Voice](https://commonvoice.mozilla.org/) (CC-0)
- **Fine-tuning:** [Soltsuky](https://huggingface.co/Soltsuky)
---