TRL documentation
GFPO
Getting started
Conceptual Guides
How-to guides
Command Line Interface (CLI)Training using JobsCustomizing the TrainingReducing Memory UsageSpeeding Up TrainingDistributing TrainingUsing Trained Models
Integrations
Examples
API
Trainers
Experimental
You are viewing v0.25.1 version. A newer version v1.5.1 is available.
GFPO
This feature implements the GFPO algorithm to enforce concise reasoning in the model’s output generation, as proposed in the paper Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning.
Usage
To activate GFPO in GFPOTrainer:
- set
num_remains_in_groupinGFPOConfig - define a group filter function and set it to
group_filter_funcinGFPOTrainer.group_filter_funcwill score thenum_generationscompletions and The GFPOTrainer filters groups according to their scores to get topnum_remains_in_groupcompletions as a new group. Model will be trained on the filtered group.
# train_gfpo.py
from trl.experimental.gfpo import GFPOConfig, GFPOTrainer
# dummy group filter to scores the completions based on its indice in group
class GroupFilter:
def __call__(self, group_completions, group_rewards, **kwargs):
group_scores = []
for completions, rewards in zip(group_completions, group_rewards):
scores = [float(i) for i in range(len(completions))]
group_scores.append(scores)
return group_scores
training_args = GFPOConfig(
output_dir="Qwen3-0.6B-GFPO",
per_device_train_batch_size=4,
num_remains_in_group=2,
bf16=True,
)
trainer = GFPOTrainer(
model="Qwen/Qwen3-0.6B",
reward_funcs=...,
train_dataset=...,
args=training_args,
group_filter_func=GroupFilter(),
)
trainer.train()