Rubric RL + Generative Meta-Model Paper Club: AI in Action 18 Feb 2026 — benchmark.space

Latent Space TV (see @LatentSpacePod for Pod)

Rubric RL + Generative Meta-Model Paper Club: AI in Action 18 Feb 2026

2026-03-18 56min 148 views watch on youtube →

Channel: Latent Space TV (see @LatentSpacePod for Pod)

Date: 2026-03-18

Duration: 56min

Views: 148

URL: https://www.youtube.com/watch?v=s5bTZfYUcac

Latent Space Paper Club - 18 Feb 2026

Vibhu Sapra presents "Learning a Generative Meta-Model of LLM Activations" (GLP) exploring how to learn generative models over neural network activations. Vipul Sehgal discusses Rubric-Based RL methods and SDPO (Self-Distillation Policy Optimization) for reasoning models.

Timestamps:

0:00 - Introduction and setup

1:00 - Vibhu Sapra: Generative Meta-Model overview

5:00 - GLP architecture and diffusion framework

10:00 - Training data generation and activatio

went pretty deep pretty deep into the other one. So, um, who's read this? Who's seen this? Sorry, I think we were supposed to do a different one. So, someone paste a link to this, but basically this came out roughly last week. Um, I thought it was an interesting take on Saes and Interp. Basically, we can do some high level first. Um, there's a cool page for this. Let me change my screen share real quick. This one. So, we all know Alec Pratford, the guy behind all the OG GPTs and stuff. Um, you know, left OpenAI a while ago. He's known for basically everything early OpenAI. He started putting out a few inter papers. Not putting out, but you know, uh, advising other people that are doing them. So we covered the last one before. This one just came out. Very different approach. Uh typically in

interp you have stuff like probing, essays, transcoders and whatnot. Uh this is basically applying the concept of diffusion to um interp where you're basically taking activation states and you're going to map from noise to activation state with diffusion. and you're you're purely just mapping the entire activation space. So the the reason they do this is basically when you have stuff like a transorder or an SAPE, you're enforcing the concept of linearity, right? So when you use an SAE, you assume the activations are linear, right? Um, and then as you start to like enforce clamping with SAEs, um, the more you go and clamp stuff, the more you go out of distribution to where a model was never trained doing that, you're you're going to get kind of weird responses. And, you know, in reality,

um, you don't want to be off distribution and not everything is linear, right? So what we're doing here instead is we're mapping the exact activation. I thought this wasn't as dense because they're showing examples of like, okay, you know, this is trained in a few days. It's done on llama 1B. Then you look into it and you're they're training a 3B they're scaling up. They're training a 3B model on a 1B um llama. They're training it on a billion activations. A billion activations. I mean like a billion tokens. Now that seems small, right? Just a billion tokens, but you know 1 billion tokens is not just per token. You're actually training the whole activation. So like whether that's uh you know, 248 hidden dimension times a billion tokens. That's like 8 to 10 terabytes of data right there. So they're actually training these quite a bit. Um but you know for people interested in new interp stuff, they did

put out all the code for this, how to train it, uh demo data sets, all their um what do they call them again? GMM, generative meta models. Um and then the paper kind of goes into pros and cons of this. Um one of the key things that might not stick out at first is this stuff seems to work better than SAEs. Why don't we all jump on this bandwagon? It's not really a replacement, right? When you add clamping within SAPE, you have no overhead. You can just do it like in real time. We show this in the Goodfire episode. They can do live clamping on Kimmy 1 trillion parameters with literally no extra latency. In this case, you want to do clamping with a diffusion model. Uh you know, you have to do 200 steps of diffusion per token generated. So it's not very real time but it does give us a better deeper look into what's happening in the model. Um yeah that's kind of high level the paper you know sorry I read this last week so

I'm not the freshest on it. Looks like we have a resume of Alec. Well this is posting. Yeah basically fresh undergrad from OpenAI and then he's first author on everything. GPT1, GPT2, Whisper, Dolly, GPT3, PO, scaling laws, RHF, the bro basically made LLMs and more than just LLMs. But yeah, um the sad thing, this paper has now been out for more than a week. The repo is all there, trading code is there. Then I was like, I was also hyping up Deep Wiki. You guys should try it. Um, basically you take any repo, you replace GitHub with deep wiki and then cognition gives you quite a good breakdown of the repo. I thought we could play around with this if we have more time, but two papers so probably know it. Um, the sad thing you look at the model checkpoints that they released. Um, no one's really downloaded these. They have 9, five, seven, and 10 downloads. No one's no

one's really using them. Uh, sorry, 24. The one that's 24 is basically the getting started in the thing. So not many people playing around with it, but it's it's pretty cool. Um it is pretty this can be done with flow matching or drifting. Oh yeah. So this is all done with flow matching. It's not just pure diffusion. It's flow matching of activation state changes. I'm reading chat here if anyone has questions, you know. But yeah, it's uh heavy heavy on flow matching. It's the objective. Um, what else have we got? Yeah, not many people used it, but let's get into the paper. Going to switch things. Okay, let me see if there's anything. Oh, okay. Uh, so this is the key thing. Generative meta models of language activations. um instead of SAPE you know the new hype term is GMM generative meta model um they give a little bit of high

level on interp so um generative models offer an alternative they can uncover structure without such assumptions and act as priors improve intervention fidelity training diffusion models on 1 billion residual streams creating meta models that learn distribution of a network's internal states um existing approaches for looking at neural networks such as PCA and SAEs rely on strong structural assumptions. Assumptions are basically that activations are linear which you know they say that that's not necessarily true. They show this as you do stronger and stronger clamping the model kind of enters an area where it was part of distribution not trained. They have a specific term for this I don't really remember but we'll get into it. Um and you know that just kind of screws everything. The other thing they show is that diffusion loss one predicts downstream utility. So um what does

linear activation mean? Well, we we'll get into it in a minute R.J. Um yeah, basically the other thing they do here is they show a bunch of scaling showing that this stuff works. The better your diffusion loss, the better you can produce this stuff. Um, applying meta models learn priors to steering improves fluency. A lot of good fluency. Um, you can do steering with this as well. MetaModel neurons increasingly isolate concepts into individual units with sparse probing scores that scale as loss decreases. Yeah. Um, their their isolation of concepts scales better than SAEs, too. Pretty cool. I clicked the link. Okay. Uh okay let's get into it. Neural networks activations encoderich information reflecting how models represent data. This enables a broad range of applications probing steering. However, existing models for analyzing manipulation activations often assume linearity or other structures uh and are therefore prone to producing corrupted

activations that degrade all fluency. methods that usually da generative models al offer an alternative diffusion. In this work, we look in design to train diffusion models on activations. The GLP is what they call it. Deep diffusion MLP fit on the same activation data commonly used to train SEES. We apply it to steering post-processing with diffusion sampling. Then they look at this concept of meta neurons. um learns to isolate concepts into individual units. GLP scales predictably with compute. They scale it from 0.5B to 3.3 par B parameters. Once again, this is pretty wild, right? They're training a 3 billion parameter model just to interpret a 1B llama. Pretty crazy scale there. Uh but once again, you know, that's not even that much. The biggest one they train is trained on like 5 and a half days of A100 single A100 hours.

Um the diffusion laws follows a smooth power law having the gap between the floor with 60x increasing compute. Scaling directly affects downstream tasks. A lot of background information here. Also, I'm just going to mute someone real quick in the thing. I did it for you. >> Thanks. Thank you. Thank you. >> Yeah, I'll pass you the host. I need to move to move offices. >> Okay. Okay. Um I'll just continue. So, uh more broadly, GLM contributes to a line of work on meta modeling. So, what is this GLP? What is this generative latent prior? Uh if you push too far, they call it off manifold. I'm going to take a pause and look at chat before we get into it. Yeah. before uh if you push too far I think they call it off manifold when the model is in the space where it's supposed to be where it was trained for it's on manifold these terms

are new to me yeah it was also kind of like if we go through my chat GPT reading of this paper I was like dude why do I not understand a lot of these terms uh on manifold and off manifold that's basically saying um as you steer too far you end up out of distribution where a model was trained basically it doesn't ever see this activation set and it produces crazy noise, but yeah, they're going to use this term on and off manifold a lot, but that's basically what it means. You're pushing out of training set distribution and you know, you basically get noise, which makes sense. You clamp a feature all the way, your performance will start to degrade. There's a fine line there. Um, their their thing handles this a lot better as you scale a GLP. So, uh, what is the diffusion objective? This is where flow matching comes in. Someone was talking about flow matching. For those that don't know what diffusion is at a very high level, basically you're generating images from noise, right? So you've seen uh Dali, nano banana, all these image

gen models. The majority of them all work from diffusion. The the training objective for diffusion is pretty interesting. Um you have a bunch of label data sets like images like say a picture of a cat. The way you train it is you start with an image of random noise. So just random blank pixels initiated from randomness and then you train a model to work backwards. So how do you go from noise to generated image? Uh the way is you can get training data steps in the middle. Right? You take your labeled image, you add gshion noise to it and then you train a model to reverse it. You have a very clean labelled data set. Right? I have a perfect labelled image. I add steps of noise to it and I train a model to do the reverse process. You start at some noised image and then you reverse back to clean image. And you can basically start from just pure noise with a label description and then you'll get a good image out. So, uh, how do they apply

this concept here? Um, okay. I'm glad that was useful. I didn't know how off tangent we're going. Basically, there's this concept of flow matching where they're they're they're using activations as continuous vectors. So, as you're generating text, you have um different activation states at different layers, right? So, layer 1, layer 2, layer three, I have different hidden states. Um if I use those different hidden states as activations over time, think of different amounts of noise in an image. uh can I train a model to dn noiseise between activation states? Uh this is a concept kind of called um flow matching where you're predicting over time. So you're not necessarily predicting the exact noise. You're predicting the uh the acceleration or rate of change of this noise. They have some math here. um whose forward process produces ZFT is linear interpolation between data point

Z0 and the noise of this. Okay, I think that's enough high level of how that works. So you're this motivates training a neural network dinoiser to approximate target velocity applied to any model architecture. Uh learning velocity fields transforming distribution. Yeah. Okay. Architecture. We formulate our dinoiser as a stack of feed forward MLP blocks following the design from llama 3. Another interesting thing to note here um to better map their GLP to a model, they use the same MLP blocks in the model that they're modeling. So they're training a GLP on llama 3. They're going to use the same uh design of MLP blocks from Llama 3. Key thing to note here, very basic, very small um architecture. No attention is added here. It's just a stack of feed forward MLP blocks that mirror the model that they're training on. Um it's a swiggloo layer with residual connections

for simplicity. We model single token rather than multi-token activation similar to SAEs thereby removing the need for attention layer. So uh like I said this is single layer single token activation. That's why when doing this at inference time it's pretty dense. You're doing diffusion per token which you know adds up time. The only diffusion specific modification needed is time step conditioning. Um there's some some niche stuff about architecture here we'll skip over. Okay. Data set. How do you train a GLP? We train it on the same uh data used for SAEs. We extract activations from the residual stream at any intermediate layer. So if you know how SAEs are trained, pretty much the same thing. Uh you run a model, you run inference through different tokens and then you look at the um the activations. So you look at the activations and then you basically train an SAP on them. In this case, we're going to train this GLP

on them. So how do they do that? They got to first do inference on a bunch of different um activations. Doing that is pretty easy. They take fine web. They sample a billion tokens. They run inference on them. We collect activations from all token positions in each document except for the beginning of sequence token. So now you have from token to token you have all the activation states. Um their architecture is the same. They're going to train on. They use a max length of 2048. Uh we train on activations from the middlemost layer, layer 7 of llama 1B and layer 15 of llama 8b. Um there's some info about multilayer multi-token training. They talk about layer later. Sorry, I clicked something. Um but yeah, basically that's where they get the data. Now once again, this is not like uh this is not training data of 1 billion tokens. This is 1 billion activations. uh 1 billion activations of I think dimension like 4068 uh sorry what is that for uh 2048 4096 so this is actually like 10

terabytes of data right here that they're training on um that's how they get training data it's just fine web um fine web activations um here's what it looks like is that the end of 2.3 that is um they train different GLPs of different layers different sizes is they look at um a loss term that models how well it's performing scaling it. We train all models for single epoch on 1B activations. Batch size this learning rate this cosign this. Train on a single a 180 gig. Longest training took 5 and 1/2 days. We model we set the model width to 2x activation dimension. Gated MLP. Okay, these are just niche details. Um let's get into the fun stuff in the last segment. So how do we check generation quality? Basically we've trained a diffusion model to just map activation state. Um they have a pretty interesting picture of this up here. So language model is doing inference. You pass text in. You have activation state.

Uh you noise this activation state. You have your activation model just stacked MLP blocks and you're training it to dn noiseise activation. So you have activation, you add noise as your training set. Your GLP is just dnoising it. And you basically output this activation space. Um yeah, you you output the activation space of the real model. It's not sparse. It's not like it's not trained to be sparse. There's no enforced linearity there. Um you're not really off manifold when you do um steering with this. What do they show here? Such as on manifold steering. So essays how to determine the best method for this. So when they push essays it basically gives like noise and trash theirs doesn't. It still works. Um this is scaling experiments. It shows it it scales well. Crazy. Um what else? What else? Scaling

it hyperparameters checking quality. All their results though are going to be on the llama 8bg. uh they use this the fretch distance as their measure for real distribution. We use 50k activation sampled from fine web to trainer GLP we predict a single token per document. Uh as you diffuse for more steps you better match the actual distribution. At 20 you're good enough. At a th00and you're like spot on. This kind of shows it scales with more diffusion steps. So the more compute you throw at it the better you perform. Um this is a PCA plot of this of real activations versus GLP samples. So you take a sample uh this is what the real activations look like. Your GLP can perform it very well with a lot of diffusion. Um okay. When generating with GLP we use a thousand diffusion samples. PCA to better illustrate bad versus good. We decrease the number of

steps. Um reduce sampling until a minimum threshold of 20 steps. Okay, so at 20 is the minimum where you start to get pretty good performance here. Uh we can skip this scaling laws. Basically, this thing scales really well. The more compute you throw at it, the better it performs. It does better than uh diffusion laws scales with compute. The better your diffusion loss, the better your net performance. At the end of the day, both steering performance and probing accuracy improve with compute closely tracking the diffusion loss. That's one thing they wanted to show. Um, this is an example of how steering is done. Basically, at a middle point, uh, you can change what you would want steering towards and then your thing will steer towards that instead and it'll show how it would map it. uh power law fit to real data. Diffusion loss is relative proxy for downstream utility and thus worldwise metric to optimize. Okay, on manifold steering

with GLP um I'll take a pause here before we get into how steering works. Let me check chat. Interesting how neurons still isolate contra concepts well without sparsity constraint like yeah this is the interesting thing right um as you get more and more niche concepts are natively sparse without forcing ses like sapes they basically force um single activations that promotes sparsity they don't have to do that here they just get inbound sparity they look at meta neurons as well where they they probe into this a bit and they see there is very good sparity. Basically, you just throw a lot a lot of compute at it and the model can map this very well. Uh okay, more high level, how does steering work? A fundamental challenge with steering is a trade-off between concept strength and output fluency. So, as you have stronger steering in SAE, you you lose fluency. They kind of show that example up here,

right? um steering task terms related to scientific te testing methods and protocols. If you add a bunch of steering with an SAPE, um it just it just starts repeating itself and it it loses control. You you steered so hard that you're off manifold. You're in a completely out of distribution. The model just doesn't make sense. Uh theirs doesn't do this. It just works. Um I think they have more examples of this later. they they kind of show pseudo code for how this would work where you add in your um you basically add in your prompt and how much steering you would want and I think they talk about it here um that's a challenge activation off manifold leading to degraded outputs post-processing so GLP offers a natural solution by post-processing steered activations while diffusion sampling our goal is to edit off manifold activations back onto to manifold while pre while preserving their sentiment content. The

key is to initialize diffusion sampling from the off manifold activation and at an intermediate time step rather than pure noise. Uh intuitively the time step controls how much of the GLP modifies the input. Earlier time steps more noise give JLP more freedom to correct artifacts while later give it less and per perform more of the signal. you know the basically where how many time stamps of diffusion you give it affects how much clamping you're doing right uh you don't allow many steps um you're not going to get as much of an effect you allow a lot of steps uh you know you give it more no you start from more noise per se you give it more freedom to to add your thing and go on manifold okay um hyperparameter ers improving SAEs. We investigate an application of GLPS improving the alignment between SAPE steering and feature descriptions.

We apply GLPS on top of llama scope SAPE both of which were trained on llama AP base activations. GLP pushes the prao frontier outward suggesting that off manifold artifacts not just encoder decoders misalignment contribute to steering failures. Um they show this here basically how is SAE versus theirs. Uh they can achieve they can push the parade you know they can they can get pure better fluency with their thing. What else? Uh they show an example of evil elicitation. Um persona vector. So what should be done with criminals in society? Persona vector is there's no easy answer to that question. arguing uh we show that this and this achieve the same fluencies score of 34. We truncate generations. What is this concept and fluency error bars are this? So this is just a persona vector versus

GLP. Um what should be done for criminals in society? The GLP also says the concept of punishment is deeply ingrained in human nature. the method of punishment this that what is your perspective on privacy in the digital age I'm a humble observer um yeah eliciting personas we take GLP's trained on lama AP based activations also demonstrating its transferability to instruction tune model we apply GLP on top of persona vector um expanding the paro frontier okay diff mean I don't think there's too much left in There are just a few examples uh interpreting with GLP. Finally, we show that GLPS can be helpful as a feature encoder with 1D probing scaling features uses brand new predictors. We use 1D probing to test whether GLP is promising alternative for interpreting LLM, whether it isolates concepts into single

units or with broad coverage over human understandable concepts. Um yes it does in particular method we encode features with GLP via meta neurons. There's this concept of meta neurons or the internal representations of the metal model itself. We extract meta neurons at each MLP box we glue great from a single forward pass through the diffusion model. We noise the input activations uh time step t setup for each concept. We run probing in two stages. First, we find the heristic from this, set a small sample to this. Then we train and fit 1D classifiers. Basically, they're probing the GLP and it kind of works. Baseline compares to 1D probes. We see that SES are close but slightly worse in performance than raw layer output AP1. Scaling behavior most notably. Uh none of the curves plateau. More compute can get better probes. Um yeah, what else?

Meta neurons discovered by 1D probing. We take documents from fine web, truncate them to 64 tokens, 1 million total tokens, 16k docs. Um yeah, they just show that you can probe a GLP and it works pretty well compared to an S8 related work. Meta models meta model street neural networks is new data modality. However, modeling weights inherently challenging. This says latent space. Uh, activation modeling. Many LM interpretability approach impro impose linear assumptions treating concepts as directions in activation space. These include directory learning methods like SAEES and vector arithmetic methods, diff mean task vectors. Basically linear assumptions are the fact that um there's linear interpretation between concepts but that's not necessarily the case. These approaches typically only represent linear structure while GLP imposes no such restriction. Um the the

high level understanding of that is once again you're mapping the full activation space without forcing it to be sparse, right? And then you probe that and it's natively somewhat has sparse features. Um what else do we have here? Separate line of work develops nonlinear uh diffusion models. This is just related work. Diffusion models. Okay. Discussion. Uh what should we take away from this? We have shown that diffusion models can learn distribution of LM activations and that resulting meta model is useful downstream. Both applications improve with scale tracking diffusion loss limitations um several limitations. First, we model single token activations independently. Multi-token modeling might capture true position structure enable new applications. So um seeing stuff like um transcoders or golden gate quad where you're where you're mapping whole entire features

like you can break down concepts better right like when you have fraudulent code there's different features for um malicious code versus malicious intent versus malicious law uh robbery. you get deeper grain meta features when you have multiple tokens. Uh this doesn't do multi-token modeling at the time. Basically, there's no attention mechanism in this. Uh second, GLP is unconditional and conditioning on the clean activation rather than a noise activation could reduce information loss for applications like steering. Um you're not conditioning it on clean activation, right? We're doing it on noise. Uh third we focus on residual stream activations at a single layer as opposed to multi-layer. So this is like SAPE to transcoder. Um you know extending this to other activation types are further exploring multilayer may re may lead to richer representations.

Future directions uh anal analogies from diffusion also suggest further applications. Uh a simple one is um there was this concept of I'm liking on the term in diffusion where basically you can you can super speed up the amount of steps needed. Um it came out of University of Washington. I don't remember what the thing is but um it's much faster diffusion where you can you can just skip all the multiple amount of steps. uh for instance diffusion loss has been used as a measure of image typ typically higher loss under GLP1 might similarly flag unusual or out of distribution activations yeah I think that's enough highle stuff um I'll pause for questions comments um paper is a bit dense but I don't know thoughts it's at a high level like if you forget all of the you know niche details here I think at a high level it's pretty straightforward right They're they're mapping activation space changes. Um not drifting. I I'll post

what it is later. But yeah, any other um comments, questions before we move to rubrics. >> Um we actually I'm actually in the room with Ted's. >> Ted. Wow. Hey, Ted. >> He's a He's a cog guy now. >> Wow. Uh so Ted actually wanted to say a little bit about last week's uh diffusion thing if you uh if you want to ask questions. >> Go ahead Ted >> line. >> So um so yeah so so we had queued up this lot 2.1 paper and really the the only significant thing from I can share uh do I have it up? Yeah, I do. >> No. Oh, >> yeah. Yeah. >> All right. One sec. Um, this is just going to be a twominut share, but um, okay. So, here's the Lada 2.1 paper. So,

so they're doing text diffusion instead of autogressive Libs. Uh, Lada was about a year ago and then Lada 2 was only a couple months ago or whatever and then they followed up with this 2.1. The main thing is that all of these LA models are are um masked denoising models. So uh in the forward direction, you start with text and you slowly change each token to mask until you're left with nothing but mask. And in the reverse direction, you start with all masks and then uh it gradually unmasks them. There's other tricks involved with diffusion. So now people are doing block diffusion which allows more efficient use of the KV cache and things like that. But the key thing about 2.1 that's different from 2.0 and the predecessors is that if you if you only do unmasking, people found that the performance the the text quality was was poor. Because if you just decide at some point you're going to have the word large at token position

12. By the time you start filling in the rest of the words, you might decide that was kind of a bad decision, but it was one of the first words I picked and now I'm kind of stuck with it. So, there's always been this option where you can remask a word if you if you hate that word later on. The main difference in 2.1 is they they realized it's kind of inefficient to remask a word and then wait around for later that word to get unmasked again. So what what 2.1 introduces is the idea that we're just going to replace the word in one fell swoop instead of doing a masking and then waiting to do another unmasking. So where do they say? So basically they have mask to token is picking words and then they have editing which is token to token. So then you just get to replace a word with some other word. Um, and that's really the the big thing that's different in 2.1. They do then introduce this idea of now that you have editing

as a concept, um, they say there's a quality knob they can use where for less quality important stuff, they um they turn down the threshold for initially choosing tokens. they say, "Yeah, maybe we'll make more bad token choices, but we'll just let the editing clean those up later." So, they're very aggressive at at um choosing tokens faster and earlier, which speeds up generation um knowing that if they make a really really bad choice, they can just use the editing thing to fix it later. And then for their highest quality to do well on the benchmarks, then they set lower thresholds so that um the model's not is not choosing tokens as much. So basically what happens is when you you when you in parallel predict all the token positions in your block if the probabilities are above a certain threshold then you lock them in. So it's possible to have very few tokens um in a forward pass if all the probabilities

are very low and spread out and it's possible to have lots and lots of tokens if the probabilities happen to be high. Um, and so setting that threshold lets them kind of have a knob to say, do you really care about the generation? If you're just like whatever, asking why is the sky blue, you can set it low. If you're doing coding or something, maybe you set it high. They had some other stuff. They that there's really not a lot of details in the paper. They had some other stuff where they refining their RL for for text diffusion, but I think that's that's the main TLDDR for for that paper. >> That's a good one. Uh, okay. >> Okay. There's some questions. Vo, you want to take it before I take over? >> I think we can just do it in chat. Just uh >> in chat. >> Uh, there was also a Vipple, we're not forgetting you. Uh, he actually tried SPDLO. Uh, do you want to just unmute and

share your experience? People's hiding now because he's he said too much. What's up people? Don't hear anything. Oh, give me a minute. Okay. Um I I I can cover the rubric stuff in like 15 minutes. So we can give people like five minutes to chat. Okay. Um All right. I don't know if people can sort out his microphones. All right people, we will I I'll keep an eye but otherwise I'll just uh move on to cover um this stuff until you figure out your AV setup. Okay, cool. Um, so the re uh I already explained the reason why I'm interested in rubrics which we won't cover here but um it just so happened

that Cameron Wolf published this survey blog post which I love because it basically covers like five papers at once and then you know obviously you can double click on the papers if you want but uh I just love survey blog posts. I everyone should do more survey blog post. Living way used to do them now she's uh at thinking machines. Um and uh the basically I you know a lot of people are asking what happens after RLVR and I think the emerging consensus is rubrics. This is not really super surprising to anyone paying attention but um I think a good survey on rubrics and what people are doing with rubrics uh was not something I've seen yet until this one which is why I picked it. Um the way I'm doing it is this is a very very very very very long blog post. some of which like the the initial like one-third of it is like very introductory stuff where it explains what LMS judge is. Um so you can definitely read that if you want. So by the way I'm sure people are asking what the link is. Uh oh uh people go ahead go ahead. I don't like uh people because because uh we are in the section

where we were talking about last week's stuff. Uh go ahead and share if you if you're ready. >> Yes. Yes. Yes. So I'll just share. Yeah. So basically what I did was I just took a Quinn 8 billion model basic model and I tried to uh just train it on ITJ advanced problems. IJ is basically a uh engineering uh class grade 12 level examination which we uh in India take to uh basically enter into universities. It's one of the hardest uh uh exams in the world. So I what I did was I just uh did a I ran a benchmark with a uh with a base model and then ran it with the SFT SFT and then also with this uh uh SDPO. Uh so the results in the sense interesting why it was interesting is

SFT was still better than uh BQUIN and SDPO showed a regression in a lot of areas. uh this might be due to uh some bias in the data but uh what I also noticed in the model was it was very verbose and was just uh generating tokens uh way too much than the base model. So for example the base model the cutff percentage was low. However, in this one, the cutff for was very high because there were a lot of questions in which the model just kept on generating token and it hit the uh I think uh 1024 cutff which I had uh kept because it I was running it on my local uh Mac. So, I'll just share it across uh in the in the chat. I'll share the methodology and whatever I did and also all the data sets if you want to dig a little deeper. That's interesting. So, so did you have

um a train and a test set for this exam? >> Yes. Yes. Yes. Yes. Yes. >> Yeah. because I think I think the the thinking around SDPO and and GRPO is is sort of when you're trying to take a non-reasoning model and teach it how to reason, but you don't actually have a train set. >> So then you just use this feedback um you need this environmental feedback so that it can tell you possibly what you did wrong. So, you wrote this code and it says, "Hey, you got a division by zero error." Uh, you know, hey, you got the wrong answer when I input five. I don't know, some some kind of feedback. So, I don't know if you had any kind of feedback for the wrong answers other than just it's wrong. >> No. So what I did was in order to uh get that feedback what I did was I ran all

the questions which I had in the question bank uh on claude opus to get a full answer with the entire reasoning. I asked Opus to basically uh give your chain of thought answers so that uh it gives the feedback not just uh for whether it is right or wrong but also on the entire chain of thought. So that is what I did. Cool. Okay. Yeah. So, sounds like if you have some kind of expert or teacher uh like Opus, then you're still better off using that. >> Yes. >> And and like your SFT. It's really when you're in a bind and you you you're you're trying to push frontier. You have no expert to go to, then you can use this selfdistilled thing to try to >> get a little bit of progress. Maybe >> that's cool. I'm I'm I'm glad you you got a chance to try it. >> Yes. Thank you. Yes. >> Uh okay, jump back in face link and I will share.

>> Yes. Um whenever you're ready. >> Yes. >> Okay. Oh, I can allow myself to share. There is some uh amount of overlap with what uh this expert thing was just covered. Um where is my um H wait where's the I accidentally closed it. Um okay so TLDDR uh we are covering rubrics uh and there there is some uh uh self teaching that's uh living going on that the spoiler alert is that uh there's everyone's doing synthetic rubrics and I didn't know about it. Um so you can read the full post. I already posted it in Zoom. Basically, he starts off the post by comparing uh between RLVR versus RHF, Rif versus RHF. Uh and basically saying it's a kind of like middle ground between verifiers and preference labels, which kind of these two are the the sort of two extremes. Um and uh most people impressions of rubrics is

I think this is a good baseline. This is from Healthbench where uh you have an example and a and a response and then you sort of evaluate the quality of response and sort of give a give it a number. Um a lot of people uh would use uh humans to to sort of bootstrap this process but then very quickly this I mean the humans one is super boring two they're expensive uh very quickly you start to replace with model grading. Um so this is a version of LMS judge that is focused on uh just like creating like breaking out analysis of response to individual rubrics and giving giving numbers to it and scoring those numbers. So this is a survey paper of uh all these all these papers that uh we're going to cover a little bit and I'll just give like high level takeaways because we don't have that much time and to be honest I don't know that much either. Um uh I do I do think like this scale paper is like even though it's like fairly recent um I do think it's like a really good baseline for like okay what's what's at stake and what is the sort of industry standard um and and then we can discuss like all the other um variations

of of this rubric setup. Um so they call it RA which uh I think is as good a name as any. Um there's all these other um cute names that other people try to use but R is probably going to stick. Um I do think that uh it's basically like I think it's this is kind of like the mathematical version of what you would expect from here but there's like different >> um terminologies and and techniques that they discuss right so um you do you want to have a reference so for example do you want to give a golden reference answer do you want to have a fixed set or do you want to have uh you know explicit set um and uh the the TLDDR is that like the with rubrics as opposed to other forms of like sort of uh preference rating or other sort of fine-tuning uh you do just get an immediate uh jump um and sometimes more so or like matching uh what a what a sort of human investment would take uh which is surprising for Scalia to admit um uh I do think like uh what is

emerging is sort of rubric classes as well uh and so here they have like expert grounding importance subcontained coverage u and we'll cover some of this in in a future paper as But um I do think like uh basically you want to have like a good coverage of like what is like absolutely must this is like a past fail thing and then the others are like nice to have whereas it's like like an optional thing and basically everyone is is focusing on um what are the right categories of rubrics to include in order to to be effective. Um so we go on to the second paper. It's from ad group. Uh and here here they explicitly say like just sheer number of rubrics doesn't matter because it's uh obviously you know if you test like the same thing three times it's the same as testing it one time. Uh you must carefully create high quality rubrics that surprises absolutely nobody. Um, I do think like uh basically this sort of rubric design uh sort of workflow uh is probably what you're going to end up doing if you sort of set up a rubric system. They publish their example rubric system which

honestly looks quite different than other people's uh rubrics, but I just liked collecting all these so you can just kind of pick whatever you think is better as a sort of rubric um setup. Um everyone has one. uh opening has published a couple of theirs by the way which uh I find pretty helpful in setting up my own rubrics and then I I like this idea of like um talking about both veto mechanism where like it just overrides others because basically rubric aggregation uh if you come back to this model like a lot of it is just like okay simple like add everything up or you average it or you know like how do you weight them and how do you sort of make rules on them right and so they bring in the concept of veto mechanism whereas like well I don't care if everything else is great like if this one thing fails everything fails um or like uh you know over like saturation uh or pair-wise relationships or like you know where where like things are correlated and so you're actually double counting a a single criteria. Um so I think like all these like factors are really interesting and this is basically

the only rubric quality paper that we're going to cover today. Um everything else it starts to evolve into like categorizations and and sort of uh principles of like what kinds of rubrics you want. Um so here for example this um you're going to see these guys come up again. They've been it looks like this group of people have been pursuing rubric research. Uh so so first of all they publish open rubrics which I think is a very good uh thing to sort of rope off of like open anything is good open thoughts open whatever. Um, so they they publish a lot of like useful data and like I I really like the distribution visualization gear like basically everyone should do this and these are the only guys who do it. Uh, which is kind of sad but I'll try to you know do it for my publishing as well. um where they they like you gives like summary statistics of like here's the distribution here's like how many rules here's like distribution of like everything and you get a sense of like okay is there any clustering or how how much work went into this thing and like you know it you you can like visually see if you had this same chart for another set of rubrics you can see like they they test a completely different

thing or if they test mostly the same thing right so like you kind of should do this um just people don't really do it um what other thing can I say yeah hard rules versus principles I Right. Um basically this is a form of veto hard rules. Uh but it's it these are the sort of verifiable things that people sort of uh constrain on and in principles are sort of more um lm like just figure things out. Uh not a single paper in this survey by the way um address the issue of um stoastic uh non-determinism like you run it judge once you get one answer you run it again you get a different answer. Like what the hell? Um, and so I yeah, nobody nobody has addressed that. Uh, probably the answer is is run it three times and take the majority, but that's a pretty unsatisfying answer. Um, the next one is Dr. Tulu. This is my favorite paper. Uh, obviously from Alen Institute. We love we love AI2 in his house. Um, and uh, this one this one is basically rubrics

for deep research. This there's the doctor and and the Dr. Tulu. Um, and what deep research is really is is a super long roll out, right? And so what they what they focus on is search guided rubrics where you kind of have to figure out the rubrics as you go um and train uh on that uh policy or on that path. Um and and so uh I am surprised this works to be super honest. uh they don't give a lot of detail at least in you know in the abstracts that I saw there wasn't a ton of detail but AI was super open so and I'm sure we can ask Nathan about it in discord but um uh I like so in in contrast to the hard rules versus principles their framing is positive rubik versus negative um this one also very similar to bbench pro I think which came out from scale where they have pass to pass and fail to pass fail to pass is the positive rubrics where you you capture a you you you capture a um a

change that you want to see past the plan being you want you don't want to see regressions um and so here negative rubrics are are kind of like to to guard against reward hacking and and in in in that sense like that's what they were sort of targeting for and obviously yeah too has positive examples of that um I also like that uh they measure the power of generated rubric so here here we're really in the this this this these next few papers these three papers uh they all all in the sort of synthetic rubric generation field um here like literally we're doing deep research step one step two generate more rubrics step two set three generate more rubrics step two like it's it's really really um a lot of compute and like uh generation going on in in every single one of these steps uh which I think is is really fascinating so I like this idea of measuring the discriminative power by uh computing advantage um no one else has like really come across that and like the that that actually does show like um the number of rubrics just doesn't blow up because we know that number of rubrics is not something that we really care about. We want to have uh additional advantage

every single time. Um and so it's pretty constant after a while. Uh which I think is good. Uh and it and it and they really show that uh they they they do clamp down on report hacking a lot. Uh which they which they like to see. Um so uh I think like obviously um I we have no idea if like the big labs actually use this kind of stuff for deep research but um the fact that it's working for Dr. Tulu I think um is pretty meaningful I think as far as like the sort of synthetic very very longunning um RLG guided uh rubric guided RL is concerned. Okay. Um last main paper and then there's like a couple miscellaneous papers which honestly you can you can ignore uh is alternating RL. This is a version of Dr. Tulu basically uh where they alternate between generator and judge. Um and uh it's the same group as the ones that did uh the the the paper above. You can sort of visually see is the same format. Um I don't super Yeah, I wasn't super impressed by these guys.

Uh I I think basically Dr. Tul is a superet of uh what the the other two did. uh except that open rubrics I think gives you a lot more examples and like nicer charts which I think everyone should copy. Uh but otherwise you know synthetic rubric generation computing by advantage um having uh classifications like vetos or principles um I do think like I would steal this categorization where there's a lot more details uh this is effectively what we do at POG uh which we're going to publish in like a couple months I guess. Um yeah, and then there's that other uh this is where I talked about the cute names. Uh people try to call it RL from checklist feedback and try to avoid the word rubric and then they call it karmo context aware reward modeling. Doesn't matter. I think has one. Um but you know, everyone tries to do their own thing. That's it. I think we ended on time. The two people have questions. I haven't been looking at the chat. Ah, paper bench. There we go. Uh two

more recommendations in uh uh am I down to share skele link or is it a work document? No, actually I I made this a standalone so you guys can How the hell do I share it? Share. Um it's just screenshots from the do from the document. If you just read the blog post, you can it's just screenshots. Uh oh god. Okay, there we go. Um yeah. Uh any other questions? Please uh unmute and whatever. >> Okay. Uh >> I had a question just about Oh, sorry. >> Yeah. >> You when you spoke to Jeff, you kind of pushed a couple of times asking him about tackling non non-verifiable domains. He didn't really he didn't really imply that. >> Yeah, >> I know. I know that he can't share it, but he also said something about like um getting other models to evaluate the behavior of um of each other. Is that what he's

talking about? Like what do you think is happening here? Is it is it converging toward this method or not? >> It's definitely converging. look at all these papers like and like yeah I mean I I can see it myself like immediately after reading this post I there's like three things I want to change it called uh uh because this just makes more sense like especially the the computer advantage one from Dr. Dr. Pulu. Um, yeah. I mean, look, he's not going to share anything cuz honestly, he's like five levels above the guys actually doing it. Um, so I don't know, like I'm impressing too much. >> This has also been like all the hype in all the RL environment companies. Like every few weeks I'll catch up with people actually selling oral environments to labs and they're so up to date on which rubric paper came out like the day of. So they're definitely used in environments. >> Yeah. Um let's get on the Zoom. Uh people posted some recommendations or is

it who posted two more recommendations? >> I had a phase of posting a bunch of these in Discord a while ago. >> Yeah. Yeah. Let's uh let's cover these on the Zoom recording so that we >> One is a scale AI one. >> Um this is more high level. Yeah. The other one is open AAI is you were talking at the time of no major lab is shown or whatever the paper bench is basically if you read the abstract here it's how they go about evaluating um with rubrics. >> Yeah our uh our podcast tomorrow is with uh Mia and one of uh team members. Uh so we'll cover this tomorrow. Uh this one I don't know I haven't seen before. Yeah. Uh, cool. Uh, well, if people are interested in Rubik's, they can talk more in the Discord. But otherwise, we are going to end on time and we need volunteers. Next week, next week cover. Okay. Thank

you so much. >> Take care everyone.