OpenAI says the issue came from reward signals tied to a “Nerdy” personality, where creature metaphors were accidentally over-rewarded. The team traced the behavior through production traffic, RL data, and SFT data, then removed the offending signal and filtered creature-heavy samples.
This matters if you build with Codex or any agent workflow that reuses model outputs in training. It shows how a small prompt or reward choice can spill into broader behavior, even outside the original condition.
Treat style tuning as a system-level risk, not a cosmetic tweak. If you rely on generated rollouts for fine-tuning, audit them for repeated lexical tics, weird overfitting, and unintended personality transfer.
Read Original Post →