What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own) — benchmark.space

Adam Lucek

What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own)

2024-12-02 30min 8,501 views watch on youtube →

Channel: Adam Lucek

Date: 2024-12-02

Duration: 30min

Views: 8,501

URL: https://www.youtube.com/watch?v=1JaL5eVqFq0

Interpreting and running standardized language model benchmarks and evaluation datasets for both generalized and task specific performance assessments!

Resources:

lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness

lm-evaluation-harness setup script: https://drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5pylXEcDp/view?usp=sharing

OpenLLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

YALL Leaderboard: https://huggingface.co/spaces/

hello everybody Adam lucc here and today we're going to be talking about standardized evaluations for language models when new models come from the major Labs like anthropic open AI or meta they tend to come with large charts that show all sorts of different numbers and percentages across all sorts of different categories here you tend to see all sorts of different things like MML GP QA human eval and then comparisons of the actual metrics of these between different models to better understand and compare the performance across a variety of tasks and as someone who's actually been getting into the post trainining and fine-tuning of language models myself understanding how these evaluations actually are created how they're ran and how to interpret them is very crucial for being able to accurately show how well your models performing post some sort of training so over the last few weeks I've released a couple fine-tune versions of llama 3.2

1B the 1 billion parameter model from meta and well running all of the evaluations I thought it would be decent to put together some sort of resource to show how you can run these evaluations yourself what the different benchmarks mean and how to interpret and kind of compare them on things like leaderboards so the first question is when we're looking at a chart like this which is GPT 4 R's release chart that shows all of these benchmarks what what do these actual benchmarks mean and how can we interpret them taking a look at this first one MML U that actually corresponds to this paper here measuring massive multitask language understanding mlu has 57 different categories some of which you can see here like computer security College physics high school chemistry all filled with different multiple choice questions and the intent here is to essentially assess how accurate the language model that you're evaluating can actually answer these questions and then be able to see across

all of these different categories where it might Excel and where it might be lacking the questions are all set up with one question and then four multiple choice options one of which of course being the correct answer the actual accuracy of the language model's ability to correctly answer these questions are usually then grouped under the kind of main categories that MML has such as Humanities social science stem and other and then the final reported number is usually the combined average accuracy across all of the categories which then usually becomes the final reported score on these evaluation metrics here so we can see that GP 40 has an average accuracy of 88.7% correctness across all of the categories of MML we can then accurately compare the performance to something like gp4 turbo here which has about an 86.5% accuracy so 40 is slightly more

accurate and better at this General understanding across multiple different categories for the MML U data set Benchmark so finally what we can really say is looking at this MML U evaluation we can analyze a model's actual academic and professional understanding and then understand a little bit more specifically how well it's going to perform in comparison to a few other comparable models generally people like to compare models that are similar parameter sizes or different kind of calibers as we can see right here with four row evaluations it's looking at stuff like clae 3 Opus Gemini 1.5 Pro llama 3400b all of the massive state-of-the-art llms being compared here an important distinction to make here is that sometimes when you look at General benchmarks like this one of the base llama 3.2 1B it'll say number of shots or n shot or few shot numbers and

for example this one says five shot accuracy what this means is that for each question some number of additional questions and then also their answers are provided as context along with the answers generally these follow the same subject or category that the question that you want the answer for is being asked on and that's where the actual F shot evaluation comes in so from the paper here for mlu for five shot evaluation what they do is they add five demonstration examples with answers and then ask the final question so as we know language models tend to perform decently if not poorly sometimes at zero shot evaluation or just right off the bat answering a question but usually after a few given examples or few shot examples tend to perform a little bit better or through going some sort of Chain of Thought reasoning other benchmarks tend to then follow a little bit of the same kind of setup as we just

went over with MML U some sort of specific or domain specific data set of questions AIMEd to kind of Benchmark the language models actual performance across some sort of subject topic or task something like claw 3 .5 sonets here with GP QA we can see that that's a graduate level Google proof Q&A Benchmark and is essentially a benchmark made to understand how comparable a language model's performance is to actual graduate level understanding and thinking with a PhD curated data set to list off a few more for some examples of popular benchmarks we've got something like AGI eval which is designed to ass ESS Foundation models in the context of human Centric standardized exams such as college entrance exams law school admissions tests MATH competitions and lawyer qualification tests there's gsmk or grade school MATH 8th through

kindergarten which assesses a model's actual performance on multi-step MATHematical reasoning with relatively basic MATH problems there's H swag which is a bit of a funny name but is actually testing Common Sense natural language inference so they're going to provide some sort of first part of a sentence and then ask which one is the most common sense ending of the sentence to see if a model actually has relative accuracy for this trivial task for humans but finishing the sentence for language models and there are plenty more different specialized examples for things like yes or no questions reasoning about physical common sense in natural language measuring how models actually mimic human falsehood or the truthfulness and generating answers to questions and all sorts of different evaluations that you can run to test how a language model has actually picked up

and performed on these different areas and so while a little bit more of these language or reasoning based evaluations are good to actually show how well a language model can do its language modeling there are also task specific and multimodal data sets things like software engineer bench measures how well a language model can actually resolve real world GitHub issues which is all about coding and then there's also other things like the gorilla Berkeley function calling leaderboard which shows and measures how well a language model can actually perform different function calling tasks and not just text as well we're to see things like a massive multi-disciplined multimodal understanding Benchmark for actually testing image and image and text understanding across Vision language models and then there's also other stuff for things like audio understanding or video understanding so as you can tell there are a ton of

different data sets benchmarks and different evaluations that you can run across all sorts of different language models Vision language models or other different skills and tasks that you want to test generally to compare a model's performance compared to another similar model it's important to note that these evaluations are very different from some of the evaluations that I reported on in this video here which is going over the actual language models performance using something like Lang Smith for evaluating performance in an application and so while these evaluations are really good for assessing how well your application is actually performing the evaluations and benchmarks that we're talking about here are more designed to show exct exactly how well your model is actually created and Performing so before going into showing how to actually find and run evaluations yourself just a little bit about how you would actually use and compare these evaluations so going back to my example of my orpo llama 3.2 1B 40K and the 15K models that I created

which is a orpo fine-tuned version of llama 3.2 1B on the orpo DPO mix 40 K data set one of them on a shuffled subset of 50k entries and one of them on a full Epoch of all 40 well about 43,000 actual examples I ran through all sorts of these different benchmarks covering AGI evl truthful QA MML U Arc Challenge and what we can do then is actually compare how well this fine-tuning actually worked on increasing its understanding across these General reasoning benchmarks in comparison to each other of course the best way to compare language models is to run the same evaluations across both of them so that is exactly what I did each one of these has a different note with the exact setup of how I'm actually measuring these things so for AGI evl let's take a look at that first as we mentioned before this is the one

that has all of the different human Centric standardized exam questions and covers all sorts of different things here like the SAT law school admissions tests the GMAT and gr and other different kinds of subjects so then jumping back to our table here let's look at the normalized accuracy across these benchmarks coming from a zero shot average across all of the reasoning tasks in the AGI eval data set the normalized accuracy here is actually accounting for the length output from the answers of the language model which can slightly alter things because of the way that the actual calculations of the accuracy are ran we'll get into how that actually is set up once we start running an evaluation ourselves but what we can look at is that for my 15K model I have an accuracy of about 21.01% normalized on AGI evl at a zero

shot average and then very nicely we can see that my 40K model which I was hypothesizing and hoping for an increase in accuracy actually has a 23.268 five% which is great so really what these benchmarks and evaluations allow us to do is compare and contrast similar language models together to kind of understand what we should be expecting from the performance across these different tasks or across just different regular things like reasoning to better understand how the model is either improving or set up in this case I am directly running these benchmarks to compare my training methodology to see how well further training on the specialized data set actually allows us to increase its reasoning ability with something like

the AGI eval Benchmark this is how you should really start to be interpreting and understanding these benchmarks yourselves not necessarily the actual end task performance that's where something like Lang Smith and running all of those evaluators will come into play but in terms of direct model comparison across different things like General code MATH reasoning or tool use you can directly compare each individual similar model together to see a little bit about what to expect or the improvements that the labs have been able to push there are also many open source leaderboards like the open llm leaderboard or the y'all or yet another llm leaderboard that run and compare models directly through the averages of a couple of these different benchmarks so something like y'all is looking at the new Benchmark Suite which covers the average between AI eval GPT for all truthful QA and big bench which will all

then be averaged together to show kind of the leaderboard here and then stuff like open llm is looking at ifel BBH which is Big bench hard MATH GP QA which we saw with anthropic mlu Pro which is a subset of more difficult MML U questions to actually compare all of these so that's a little bit about how these actual leaderboards are starting to get set up and how to compare differently on them so really the final statement here about the comparison stuff is that each one of these individual benchmarks do tell you a little bit about one specific task or one specific thing that you're trying to measure of your model but isn't necessarily going to fully show the whole picture across every single thing especially not the end use of your model in whatever application you're using so it's good to understand how to interpret these how to compare them to get a general understanding of how your

model might perform but it's best to start actually running specific evaluations yourself if you're interested in one kind of specific capability which is exactly what we're going to do now so one of the things that I wanted to test is looking at my llama 3.2 1B 40K model and then also looking at the base llama 3.2 1B model I see that in their base pre-train model statistics and benchmark marks they have the ark Challenge 25 shot accuracy metric and so in my 40K model here I also have Arc challenge but I did it at a zero shot evaluation between these two models so now what I want to do is actually take my 40K model and compare it more directly to the bass pre-train model to see what sort of setup and if my fine-tuning actually made a difference on this so the arc Benchmark and the data set com from this paper here and stands for the ai2 reasoning

challenge essentially it has two subsets a Arc Challenge and an arc easy subset that consists of roughly around 8,000 different science questions we can see from the distribution that this includes questions from about third through 9th grade level observing the actual raw data set that's been uploaded here to hugging face if we look at the view and look at this first example problem we can see that it tends to follow this format of having a question which this says George wants to warm his hands quickly by rubbing them which skin surface will produce the most heat and then a few choices the text here is dry palms wet palms palms covered with oil or Palms covered with lotion along with a Ab b c d label and then we can see the answer key which is the ground truth which of course we'll be using to calculate our accuracy is going to be a which is dry palms so now that we

understand the actual Benchmark that we're going to apply what it is going to actually tell us about our model and then of course the subset the challenge subset that we want to run it on what we have to do now is actually run the Benchmark which is where we will be introducing the LM evaluation harness from Uther AI so what the LM evaluation harness repo is is it a framework for f shot evaluation of language models essentially a lot of people have actually implemented the data sets and ways to actually run these evaluations into this repository and it allows for a nice standardized way of running and comparing and doing all of these benchmarks I'll have this link in the description below of course and we're going to go into actually setting up and running with this but I definitely implore you to take a closer look at some of the details that have gone into this repo as it's a very powerful tool

to be able to use and understand if you have the time to spend effectively getting up to speed with it within the repository you can go to the LM eval tab here and then furthermore into the tasks folder to see all of the different tasks that have been set up for you to be able to run and these are all sorts of different benchmarks that have already been implemented nicely into the repository that you can just call with their scripts so right at the top here we can see that the ai2 reasoning challenge or Arc folder is right here and then for most if not all of the evaluations there's usually a little read me here that will explain a little bit about the Benchmark as well as links to any sort of additional resources or maybe the papers that are published along with them you can also see down here more specifically the actual task names that you'll have to pass through with your configuration of what you actually want

to test so of course we're interested in the arc Challenge and as mentioned we want to do that with 25 shots which as a reminder what this means is that 25 different examples of similar subjects or similar questions along with their answers are going to be provided plus the question being asked before actually getting the final answer from the language model with all of that found and understood now it's time to swap over to some code to actually get this up and running so I'll be hopping over to my terminal here which is backed by a cloud GPU so if we actually pull that up you can see that I've got an Nvidia a100 with 40 gigabyt of vram running here on a server that I am connected to you can use whatever preferred provider you would like or even your own home system but generally you're going to need a little bit of Power with something like

a server ready GPU to run some of this stuff as we'll be doing evaluations that involve running the models themselves but with that being said the first thing that I'm going to do is just go ahead and set up a python virtual environment like that and then install a couple of libraries so I'll have all of these different lines of actually setting this up and how I'm doing this here linked in a file in the description below as it tends to follow a couple steps to actually get these things nicely set up but I'll be back once all these dependencies install Perfect so we'll be using the accelerate library from hugging face to actually run these files but first we need to pass in a default config you can of course configure this however you'd like depending on how many gpus or specifics you have but the general default one is usually fine

we're then of course going to clone down the eval harness repo and install all the dependencies there so let's go ahead and run that and we'll see the repo pop up here and we'll start installing everything now nicely from the requirements file and finally we'll use the hugging face command line interface package to log in and pass our hugging face token through so that we can access the actual models from the hugging face Hub now what we should be able to do is actually check and see if everything's working nicely by passing in dash to just make sure that we can see e r LM Val repo which after passing a-h which is the just help argument it shows us everything all of our options and all of our different usages that we can have here so from this you can see all of the different arguments that you can pass in

to actually run the evaluation with all of these different options so what we need to do is then put together some sort of lines and pass in all of these arguments here to affect ly set up our Arc Challenge 25 F shot evaluation run so that's what we've got here we're going to be using accelerate to actually run the evaluation file so that it can be more effectively optimized for a GPU but then a few of the arguments here we are specifying that the model is going to be a hugging face model our model arguments are going to be a pre-train model and then we are going to have the hugging face directory to to my llama 3.2 1B 40K model we're going to have trust remote code equal to true in case it runs into anything from loading the model that it doesn't want to run and of course the data type is going to be set to Auto which will

generally be bf16 or so it doesn't actually tell us what the based pre-trained model is running in with the data type but we can see that their instruction tune models are running in brain float 16 precision and if I go to my model here you can actually see that I'm just running this in floating Point 16 Precision the actual Precision of your model is also going to be another variable to consider when trying to most effectively compare together the actual benchmarks but sometimes we don't have the perfect information that these labs are actually outputting here so we'll do our best floating Point 16 Precision should be fine we're then going to pass in our tasks here which I want to do Arc challenge this of course will come from the ark file here where we can specify specifically that we want to do the ark challenge task and then a few more of these things we've got number of fuse shots set to 25 and then I'm just

setting the device to explicitly be Cuda zero which is the only GPU that we have running running here and set an automatic batch size so it'll assess our gpu's performance and capabilities and then do all of these evaluations as a batch depending on how much our GPU can actually take we're then going to specify an output path so that all of the nice little statistics and information from our evaluation in a Json file can be output and I'm just going to put that to arcore 25 so now that we have all of that set up we'll go ahead and hit enter and what it will start to do is parse all of these out download the actual model this llama 3.2 1B model and make sure that everything's set up nicely so I'll cut a little bit of the boring Parts here but you can see it start to download the actual model here which is

going to take a couple minutes and I'll tell yall when anything interesting happens sweet so with that just finishing up what it's going to do is then build all of the arc Challenge questions and context essentially doing all of the prompt creation that's going to then be passed through the language model and one of the interesting nuances here is that for especially something like this which is a multiple choice based data set which a lot of if you noticed the actual benchmarks are set up similarly too is that we're not actually going to be running any sort of generation or text generation through the language model when doing this assessment instead we are going to be running these log likelihood requests which what this means is that essentially we're going to pass in the full prompt and the full answer with each of the individual answers through the language model in this case we will

be doing that in batches as this is currently being set up and and what this is actually going to be doing is instead of generating and then somehow extracting out the answer and then comparing it to the ground truth we're going to be looking at the actual probabilities that the language model itself applies to generating each of the answer choices this is super efficient because we don't actually have to run any sort of generation we can just pass in our entire prompt plus the answer see whether or not the language model is confident about predicting that expected answer and then choose the one with the highest likelihood of being predicted as the actual answer that the language model would say and then based on that we can determine our accuracies by comparing what the predicted answer is to the actual ground truth or the true answer and find the actual accuracy there so with this just about finishing up we can see that for a 25 shot Arc

challenge we got a normalized accuracy of about 38483 21b Foundation model here we can see that they have a 32.8% accuracy at 25 shots so we can already see that we have a nice almost 6ish per increase well not 6% increase but 6 percentage Point increase here so already know that our accurate see in training has done something to improve that and then if we actually take a look back we can compare this 25 shot average to our zero shot average which squeezes out about another percentage here from our normalized scores of the 40K model and about 2% from the 15K here looking at the Json file output we can see some

more of the metadata and specifics all of the different accuracies normalized accuracies and standard errors that we have here as well as how the actual data set is being pulled from The Hub and then how it's splitting up everything and setting up all of the nice actual prompts that are being ran through the language model to detect these log likelihood probabilities and you can look at all of this nice stuff in here to see all of that one of the things to bring up here is that that there are multiple ways that people actually Implement different ways of scoring across the different evaluations for something like this on the if eval metric here you can see that in its respective Json file which I ran a bunch of different metrics on my 40K model here you can see that it also passes in a process results function which if we ask our good friend chat GPT to just print out nicely you can see

that it is a custom function for actually taking in the input and returning out the different accuracies or actually the actual computations here and there are a few other ways that these are actually calculated too for something like GSM 8K they actually do regular expression based matching to run a literal generation from the language model out and then extract the answer to compare through some regx filtering so check out and look into all of the different ways that these evaluations are set up this tends to be in each one of these individual task files here so for GSM 8K they have a yaml file that you can actually see all of the different setups and implementations of how the accuracy metrics are put together but it's crucial to understand a little bit about how these things are set up so that you can more effectively interpret how your actual benchmarks are doing but with our successful Benchmark

of the arc Challenge on 25 F shot examples you can see a bit of an example of actually using the LM evaluation harness to both set up and run the standardized bench marks to more effectively assess the performance of your language models and compare and contrast language models against each other so the next time you see a confusing looking chart like this that compares multiple models performance across all of these different metrics together you know exactly where to look where to verify and validate and how to interpret all of these results that being said I will have further resources as well as all of the different codes and scripts that I used in the description below if you enjoyed the video and like the video leave a like on the video if you want to see more and support the channel consider subscribing thank you and have a great day