Reinforcement Learning (RL / RLHF)
updated
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
• 2405.07863
• Published
• 71
Understanding and Diagnosing Deep Reinforcement Learning
Paper
• 2406.16979
• Published
• 10
Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences
Paper
• 2404.03715
• Published
• 62
Iterative Nash Policy Optimization: Aligning LLMs with General
Preferences via No-Regret Learning
Paper
• 2407.00617
• Published
• 7
Offline Regularised Reinforcement Learning for Large Language Models
Alignment
Paper
• 2405.19107
• Published
• 15
DogeRM: Equipping Reward Models with Domain Knowledge through Model
Merging
Paper
• 2407.01470
• Published
• 7
Understanding the performance gap between online and offline alignment
algorithms
Paper
• 2405.08448
• Published
• 18
Value-Incentivized Preference Optimization: A Unified Approach to Online
and Offline RLHF
Paper
• 2405.19320
• Published
• 10
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
• 2405.11143
• Published
• 41
Dataset Reset Policy Optimization for RLHF
Paper
• 2404.08495
• Published
• 9
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper
• 2406.11827
• Published
• 17
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
• 2406.20095
• Published
• 18
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous
Reinforcement Learning
Paper
• 2406.11896
• Published
• 20
Measuring memorization in RLHF for code completion
Paper
• 2406.11715
• Published
• 7
Artificial Generational Intelligence: Cultural Accumulation in
Reinforcement Learning
Paper
• 2406.00392
• Published
• 14
Gradient Boosting Reinforcement Learning
Paper
• 2407.08250
• Published
• 13
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper
• 2403.10704
• Published
• 60
Mixtures of Experts Unlock Parameter Scaling for Deep RL
Paper
• 2402.08609
• Published
• 36
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety
and Style
Paper
• 2410.16184
• Published
• 25