Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

Abstract

Recent advancements in reinforcement learning (RL) have achieved great success in fine-tuning diffusion-based generative models. However, fine-tuning continuous flow-based generative models to align with arbitrary user-defined reward functions remains challenging, particularly due to issues such as policy collapse from overoptimization and the prohibitively high computational cost of likelihoods in continuous-time flows. In this paper, we propose an easy-to-use and theoretically sound RL fine-tuning method, which we term Online Reward-Weighted Conditional Flow Matching with Wasserstein-2 Regularization (ORW-CFM-W2). Our method integrates RL into the flow matching framework to fine-tune generative models with arbitrary reward functions, without relying on gradients of rewards or filtered datasets. By introducing an online reward-weighting mechanism, our approach guides the model to prioritize high-reward regions in the data manifold. To prevent policy collapse and maintain diversity, we incorporate Wasserstein-2 (W2) distance regularization into our method and derive a tractable upper bound for it in flow matching, effectively balancing exploration and exploitation of policy optimization. We provide theoretical analyses to demonstrate the convergence properties and induced data distributions of our method, establishing connections with traditional RL algorithms featuring Kullback-Leibler (KL) regularization and offering a more comprehensive understanding of the underlying mechanisms and learning behavior of our approach. Extensive experiments on tasks including target image generation, image compression, and text-image alignment demonstrate the effectiveness of our method, where our method achieves optimal policy convergence while allowing controllable trade-offs between reward maximization and diversity preservation.

Pipeline

Many existing reward-weighted RL fine-tuning methods require train on offline datasets manually collected by humans or derived from pre-trained models. While this approach ensures stable learning, it restricts the model’s ability to explore optimally in an online setting, inducing what is known as the online-offline gap. This limitation reduces the model’s flexibility and can lead to sub-optimal convergence. In contrast, online RL enables the generative model to continuously update its training distribution, allowing for greater adaptability and diverse exploration within the reward space.

Experiments

MNIST

We can successfully train our model on a variety of tasks, including image generation, image compression, and text-image alignment. We can see that our model can generate even numbers during training process.

Scale to Stable Diffusion 3

We scaled our approach to fine-tune Stable Diffusion 3 on complex text-to-image generation tasks. Here are three challenging compositional prompts involving specific spatial relationships and multiple attributes.Notice how our method accurately captures the precise stacking order of colored cubes, the arrangement of plates, and even the specific clothing items on the panda emoji - all while maintaining natural visual quality.

Our method is reward-agnostic and adapts to various reward functions without architectural changes. Here we fine-tuned SD3 with three different reward models: HPS-V2, Pick Score, and Alpha Clip - all using the same complex prompt about a train on a surfboard. Each row shows our model successfully achieving the desired compositional goal while maintaining high visual quality, regardless of which reward model guides the fine-tuning process.

BibTeX

@inproceedings{
          fan2025online,
          title={Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization},
          author={Jiajun Fan and Shuaike Shen and Chaoran Cheng and Yuxin Chen and Chumeng Liang and Ge Liu},
          booktitle={The Thirteenth International Conference on Learning Representations},
          year={2025},
          url={https://openreview.net/forum?id=2IoFFexvuw}
          }