---
language: pt
tags:
  - audio-classification
  - wav2vec2
  - pytorch
  - gender-classification
  - speech
  - pt-br
metrics:
  - accuracy
  - f1
base_model: facebook/wav2vec2-xls-r-300m
license: mit
---

<div align="center">

*Leia em [Português](README_pt-br.md) 🇧🇷 | Read this in [English](README.md) 🇺🇸*

# Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech

**A multi-phase fine-tuning approach for robust binary gender classification from raw audio, leveraging cross-domain adaptation for improved generalization.**

[![PT-BR](https://img.shields.io/badge/lang-pt--BR-009C3B?style=flat-square)](.)
[![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?style=flat-square&logo=pytorch&logoColor=white)](https://pytorch.org)
[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Transformers-FFD21E?style=flat-square)](https://huggingface.co)
[![CETUC2 F1](https://img.shields.io/badge/CETUC2%20F1--Macro-93.31%25-blue?style=flat-square)](.)
[![emoUERJ Acc](https://img.shields.io/badge/emoUERJ%20Acc-90.45%25-blue?style=flat-square)](.)
[![License](https://img.shields.io/badge/license-MIT-lightgrey?style=flat-square)](.)

</div>

---

## 1. Abstract

This work presents a fine-tuned **Wav2Vec2-XLS-R-300M** model for binary gender classification (Male / Female) from Brazilian Portuguese speech. The model was trained through a three-phase curriculum — linear probing, full fine-tuning, and cross-domain adaptation — and evaluated on two fully held-out benchmarks: **93.32% accuracy** on FalaBrasil CETUC2 (100k+ samples) and **90.45%** on emoUERJ. Audio inputs are resampled to 16 kHz and processed as raw waveforms.

| Label | Class  |
|:-----:|:-------|
| `0`   | Male   |
| `1`   | Female |

---

## 2. Training

The model was trained in three incremental phases, each building on the previous checkpoint:

| Phase | Strategy             | Encoder  | LR     | Batch | Epochs | Dataset                          | Val Acc |
|:-----:|:---------------------|:---------|:------:|:-----:|:------:|:---------------------------------|:-------:|
| 1     | Linear Probing       | Frozen   | 2e-5   | 8     | 5      | small subset                     | 86.63%  |
| 2     | Full Fine-Tuning     | Unfrozen | 2e-5   | 8     | 4 (ES) | 111,212 PT-BR samples            | 99.56%  |
| 3     | Domain Adaptation    | Unfrozen | 5e-6   | 4     | 2 (ES) | CV PT-BR — 4,372 balanced        | 98.51%  |

> **Domain Shift.** Phase 2 achieved 99.56% on in-domain data but only 63.65% on Common Voice, revealing acoustic overfitting. Phase 3 resolved this through conservative adaptation with a reduced learning rate to prevent catastrophic forgetting.

---

## 3. Evaluation

Both benchmarks below are **fully out-of-domain** — no samples were used during training or validation.

### 3.1 FalaBrasil CETUC2

Large-scale evaluation on **100,998 samples** from the [FalaBrasil CETUC2](https://huggingface.co/datasets/falabrasil/cetuc2) read-speech corpus (50,000 male / 50,998 female).

| Metric       | Value      |
|:-------------|:----------:|
| **Accuracy** | **93.32%** |
| **F1-Macro** | **93.31%** |
| Mean Confidence | 95.31%  |

| Class    | Precision | Recall | F1-Score | Support |
|:---------|:---------:|:------:|:--------:|:-------:|
| Male     | 89.51%    | 97.99% | 93.56%   | 50,000  |
| Female   | 97.83%    | 88.74% | 93.06%   | 50,998  |

```
Confusion Matrix:
                  Pred Male  |  Pred Female
True Male    |    48,996     |     1,004
True Female  |     5,744     |    45,254
```

> **Note.** The model shows higher recall for Male (97.99%) but higher precision for Female (97.83%), indicating a slight bias toward predicting Male. All top-10 highest-confidence errors were Female samples misclassified as Male.

### 3.2 emoUERJ

Evaluated on **377 samples** from the emoUERJ emotion-in-speech dataset — recorded under entirely different acoustic conditions.

| Class      | Precision | Recall | F1-Score |
|:-----------|:---------:|:------:|:--------:|
| Male       | 0.94      | 0.85   | 0.89     |
| Female     | 0.87      | 0.95   | 0.91     |
| **Macro**  | **0.91**  | **0.90** | **0.90** |

**Accuracy: 90.45%**

---

## 4. Usage

```bash
pip install transformers librosa torch
```

```python
import librosa, torch, torch.nn.functional as F
from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification

model_id  = "Soltsuky/wav2vec2-gender-classification-pt-br"
processor = AutoFeatureExtractor.from_pretrained(model_id)
model     = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
model.eval()

audio, _ = librosa.load("audio.wav", sr=16000)
inputs   = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

with torch.no_grad():
    probs = F.softmax(model(**inputs).logits, dim=-1)[0]

label = ["MALE", "FEMALE"][torch.argmax(probs).item()]
print(f"{label} — {probs.max().item()*100:.2f}%")
```

---

## 5. Limitations

- Trained exclusively on **Brazilian Portuguese**; other variants (PT-PT) were not evaluated.
- Audio shorter than **1 second** may produce lower confidence.
- The model exhibits a **Male prediction bias** (higher Male recall, lower Female recall), likely due to distributional differences between training and evaluation data.
- Voice-based gender classification carries ethical implications. This model is for **research purposes only** and should not be used to identify individuals without consent.

---

## 6. Citation

```bibtex
@misc{soltsuky2026wav2vec2gender,
  title  = {Wav2Vec2-XLS-R-300M for Gender Classification in Brazilian Portuguese Speech},
  author = {Soltsuky},
  year   = {2026},
  url    = {https://huggingface.co/Soltsuky/wav2vec2-gender-classification-pt-br}
}
```

---

## 7. Acknowledgments

- **Base Model:** [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) — Meta AI (MIT)
- **Evaluation:** [FalaBrasil CETUC2](https://huggingface.co/datasets/falabrasil/cetuc2), [emoUERJ](https://zenodo.org/records/5427549#.ZDI6jnbMLrf), [Mozilla Common Voice](https://commonvoice.mozilla.org/) (CC-0)
- **Fine-tuning:** [Soltsuky](https://huggingface.co/Soltsuky)

---