"They are training a 3 billion parameter model just to interpret a 1B Llama. Pretty crazy scale there. But the diffusion loss follows a smooth power law. Scaling directly affects downstream tasks. Both steering performance and probing accuracy improve with compute closely tracking the diffusion loss."
Vibhu Sapra
Paper Club Presenter