To see results on github go to https://github.com/steampunque/benchlm

To see results on hf go to https://huggingface.co/spaces/steampunque/benchlm

Independent LLM benchmarks for a wide range of open weight models using custom prompts
including category and discipline summaries.  The model list is actively updated
with latest models releases.  Older obsoleted model results are not kept.

The primary model families being tracked as of 6/2/25 are:
Meta (Llama), Mistral, Google (Gemma), Qwen (Qwen2.5, Qwen2.5 coder, Qwen3, QwQ)

Secondary model being tracked as of 6/2/25 are:
Microsoft (Phi), Deepseek (Qwen R1 distills), Falcon family, internlm family, GLM family, ultravox family.

Models being tracked are selected based on general high popularity, high
performance or other innovation, with non-restrictive open source license terms.

Tests are run using a modified llama.cpp server (supporting logprob completion mode).

MODEL CATEGORIES:
   GENERAL : general purpose text in text out
   THINK   : RL tuned reasoning models with <think> </think> block or equivalent
   CODE    : coding optimized models
   MATH    : math optimized models applied to Hendryks MATH500 set
   VISION  : image + text in, text out
   AUDIO   : audio + text in, text out

METHODOLOGY:
   -All CoT, code, and math tests are zero shot.  A few BBH tests use fewshot examples.
   -Math CoT test such as GSM8K, APPLE, MATH etc. are self graded against correct answer using LLM under test
      If self grade does not work reliably (such as with very small model) the result is zeroed to mark invalid test.
   -All non-CoT MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1.
      To score a correct answer in MC both queries must answer correctly.
   -Cot MC tests (e.g. MMLUPRO, GPQA, etc.) do one query only.
   -Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases).
   -The new SQA test is not run on all models.  When it is run, the result is only added to SQA test.
      The SQA result is not added to knowledge and composite averages.  The test is not meant for
      small models and results are given for information only.
   -MMMU uses the validation split (30 questions from 30 categories) and CoT prompting
   -MMMUPRO uses the 10-question split and CoT prompting
   -All new tests are run with a maximum of 250 questions per test category on CoT tests.
      This is necessary to contain test time with new thinking models when can generate very lengthy responses.
      The result is printed in italics if there were more than 10 skipped questions in a test category.
      Note some very old runs had skips due to JSON errors in questions but these will not significantly impact averages.

TESTS:
   KNOWLEDGE:
      TQA - Truthful QA
      SQA - Simple QA 4333 question arcane knowledge quiz
      JEOPARDY - 100 Question JEOPARDY quiz
   LANGUAGE:
      LAMBADA - Language Modeling Broadened to Account for Discourse Aspects
   UNDERSTANDING:
      WG - Winogrande
      BOOLQ - Boolean questions
      STORYCLOZE - Story questions
      OBQA - Open Book Question / Answer
      SIQA - Social IQ
      RACE - Reading comprehension dataset from examinations
      MMLU - massive multitask language understanding
      MEDQA - medical QA
   REASONING
      CSQA - Common Sense Question Answer
      COPA - Choice of Plausible Alternatives
      HELLASWAG -Harder Endings, Longer contexts, and Low-shot Activities
                 for Situations With Adversarial Generations
      PIQA - Physical Interaction: Question Answering
      ARC - A12 Reasoning Challenge
      AGIEVAL - AGIEval logiqa, lsat, sat
      AGIEVALC  - Gaokao SAT, logiqa, jec (Chinese)
      MUSR - Multimodal Semantic Reasoning
   COT:
      GSM8K - Grade School Math CoT
      BBH  - Beyond the Imitation Game Bench Hard CoT
      GPQA - Google-Proof QA science CoT
      MMLUPRO - massive multitask language understanding pro CoT
      AGIEVAL - satmath, aquarat
      AGIEVALC  - mathcloze, mathqa (Chinese)
      MUSR - Multimodal Semantic Reasoning
      APPLE - 100 custom Apple Questions
   MATH:
      MATH1..MATH5 - MATH Datasets level 1 through 5 (Hendrycks et al.)
   CODE:
      HUMANEVAL - Python
      HUMANEVALP - Python, extended test
      HUMANEVALX - Python, Java, Javascript, C++
      MBPP - Python
      MBPPP - Python, extendend test
      CRUXEVAL - Python
      USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM
   VISION:
      CHARTQA - Chart Question/Answer
      DOCVQA  - Document Vision QA
      MMMU - Massive Multi-discipline Multimodal Understanding (CoT)
      MMMUPRO - Massive Multi-discipline Multimodal Understanding Pro (CoT)
   AUDIO:
      BBA - Big Bench Audio

GENERAL MODELS:

MODEL Falcon3-1B-Instruct Falcon3-7B-Instruct Falcon3-10B-Instruct gemma-2-9b-it gemma-2-27b-it gemma-3-1b-it gemma-3-4b-it gemma-3-12b-it gemma-3-12b-it gemma-3-27b-it glm-4-9b-chat glm-4-9b-chat internlm3-8b-instruct Llama-3.1-8B-Instruct Llama-3.2-3B-Instruct Llama-4-Scout-17B-16E-Instruct Llama-4-Scout-17B-16E-Instruct Llama-4-Scout-17B-16E-Instruct Mistral-7B-Instruct-v0.3 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.2-24B-Instruct-2506 Phi-3.5-mini-8k-instruct Phi-3.5-mini-128k-instruct Phi-4 Qwen2.5-3B-32k-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-14B-32k-Instruct Qwen2.5-32B-Instruct
params 1.67B 7.46B 10.31B 9.24B 27.23B 0.99989B 3.88B 11.77B 11.77B 27.01B 9.40B 9.40B 8.80B 8.03B 3.21B 107.77B 107.77B 107.77B 7.25B 23.57B 23.57B 23.57B 23.57B 3.82B 3.82B 14.66B 3.09B 3.09B 7.62B 7.62B 14.77B 32.76B
quant IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS Q8_0 Q6_K IQ4_XS Q4_K_H Q4_K_H IQ4_XS Q6_K IQ4_XS Q6_K Q6_K Q2_K_H Q3_K_H Q4_K_H Q8_0 Q2_K_H Q3_K_H Q4_K_H Q4_K_H Q6_K Q6_K IQ4_XS IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS IQ4_XS
engine llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 3266 llama.cpp version: 3389 llama.cpp version: 4877 llama.cpp version: 4888 llama.cpp version: 4938 llama.cpp version: 5572 llama.cpp version: 5586 llama.cpp version: 3496 llama.cpp version: 3334 llama.cpp version: 4488 llama.cpp version: 3428 llama.cpp version: 3825 llama.cpp version: 5236 llama.cpp version: 5279 llama.cpp version: 5335 llama.cpp version: 3262 llama.cpp version: 5509 llama.cpp version: 5509 llama.cpp version: 5509 llama.cpp version: 5742 llama.cpp version: 3609 llama.cpp version: 3600 llama.cpp version: 4295 llama.cpp version: 4038 llama.cpp version: 4038 llama.cpp version: 3943 llama.cpp version: 3870 llama.cpp version: 3821 llama.cpp version: 3821
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc
WG 0.600 0.670 0.700 0.762 0.772 0.576 0.692 0.743 0.741 0.748 0.759 0.753 0.708 0.741 0.685 - - - 0.751 0.775 0.772 0.784 0.780 0.744 0.734 0.708 0.687 0.695 0.709 0.709 0.754 0.746
LAMBADA 0.524 0.688 0.692 0.735 0.755 0.504 0.635 0.724 0.721 0.742 0.786 0.783 0.662 0.747 0.705 - - - 0.766 0.786 0.789 0.798 0.792 0.677 0.613 0.750 0.685 0.682 0.722 0.724 0.769 0.781
HELLASWAG 0.308 0.684 0.716 0.775 0.810 0.307 0.527 0.779 0.767 0.802 0.834 0.840 0.846 0.696 0.559 - - - 0.591 0.866 0.877 0.899 0.872 0.716 0.669 0.801 0.670 0.713 0.820 0.822 0.863 0.894
BOOLQ 0.364 0.591 0.621 0.687 0.739 0.521 0.603 0.669 - 0.701 0.633 0.625 0.562 0.610 0.478 - - - 0.658 - - 0.646 0.684 0.562 0.573 0.653 0.517 0.533 0.617 0.623 0.647 0.701
STORYCLOZE 0.774 0.949 0.947 0.958 0.973 0.685 0.900 0.948 - 0.964 0.967 0.976 0.982 0.895 0.870 - - - 0.917 - - 0.968 0.969 0.531 0.921 0.754 0.913 0.896 0.920 0.915 0.938 0.981
CSQA 0.488 0.725 0.746 0.751 0.763 0.339 0.614 0.716 - 0.741 0.727 0.733 0.730 0.686 0.642 - - - 0.627 - - 0.756 0.751 0.669 0.660 0.740 0.701 0.717 0.768 0.781 0.795 0.823
OBQA 0.380 0.761 0.745 0.846 0.860 0.334 0.648 0.807 - 0.855 0.821 0.802 0.801 0.765 0.709 - - - 0.676 - - 0.866 0.880 0.751 0.720 0.857 0.700 0.731 0.802 0.804 0.863 0.904
COPA 0.612 0.870 0.903 0.925 0.949 0.415 0.785 0.932 - 0.944 0.955 0.944 0.927 0.889 0.749 - - - 0.812 - - 0.924 0.932 0.884 0.870 0.934 0.841 0.858 0.925 0.919 0.935 0.958
PIQA 0.233 0.696 0.732 0.801 0.841 0.386 0.653 0.784 - 0.818 0.773 0.779 0.777 0.725 0.637 - - - 0.708 - - 0.826 0.831 0.733 0.677 0.832 0.695 0.713 0.794 0.807 0.848 0.870
SIQA 0.425 0.658 0.688 0.693 0.731 0.385 0.588 0.699 - 0.716 0.664 0.665 0.706 0.648 0.622 - - - 0.620 - - 0.737 0.710 0.667 0.661 0.639 0.656 0.663 0.721 0.712 0.746 0.742
MEDQA 0.141 0.420 0.430 0.501 0.549 0.073 0.292 0.503 - 0.553 0.436 0.445 0.457 0.500 0.413 - - - 0.334 - - 0.593 0.597 0.423 0.395 0.560 0.344 0.363 0.453 0.458 0.542 0.610
SQA - 0.033 - - 0.117 - 0.052 0.092 - 0.092 - - 0.039 0.073 - - - - - - - 0.066 0.073 - - - - - - - - -
JEOPARDY 0.010 0.400 0.310 0.580 0.760 - 0.350 0.550 0.560 0.830 0.370 0.420 0.210 0.510 0.350 0.680 0.580 0.540 0.490 0.680 - 0.740 0.640 0.320 0.250 0.390 0.120 0.120 0.300 0.290 0.540 0.600
GSM8K 0.485 0.890 0.918 0.890 0.899 - 0.843 0.928 0.928 0.964 0.855 0.839 0.890 0.872 0.822 - - - 0.611 - - 0.940 0.968 0.855 0.714 0.946 0.829 0.856 0.909 0.880 0.938 0.950
APPLE 0.150 0.810 0.740 0.750 0.730 - 0.630 0.740 0.770 0.850 0.630 0.610 0.670 0.690 0.610 0.840 0.860 0.860 0.390 0.830 0.780 0.820 0.890 0.560 0.560 0.910 0.640 0.560 0.740 0.750 0.830 0.860
HUMANEVAL 0.115 0.737 0.774 0.658 0.743 0.408 0.701 0.859 0.829 0.890 0.737 0.731 0.804 0.652 0.585 - - - 0.390 0.841 0.823 0.853 0.871 0.682 0.621 0.847 0.695 0.780 0.798 0.817 0.804 0.884
HUMANEVALP 0.073 0.628 0.664 0.548 0.615 0.317 0.597 0.713 - 0.719 0.615 0.634 0.713 0.536 0.475 - - - 0.329 - - 0.731 0.750 0.591 0.524 0.725 0.615 0.682 0.670 0.658 0.676 0.768
HUMANEVALFIM - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MBPP 0.334 0.677 0.653 0.595 0.642 0.536 0.614 0.692 - 0.677 0.579 0.591 0.552 0.564 0.498 - - - 0.451 - - 0.618 0.642 0.610 0.498 0.673 0.595 0.599 0.669 0.661 0.669 0.684
MBPPP 0.312 0.629 0.611 0.584 0.638 0.531 0.598 0.642 - 0.625 0.562 0.575 0.477 0.540 0.482 - - - 0.397 - - 0.593 0.647 0.575 0.477 0.651 0.540 0.584 0.633 0.651 0.633 0.700
HUMANEVALX_cpp 0.054 0.506 0.603 0.512 0.579 0.158 0.585 0.756 - 0.780 0.439 0.432 0.402 0.457 0.323 - - - 0.225 - - 0.292 0.713 0.280 0.219 0.676 0.420 0.237 0.475 0.554 0.323 0.701
HUMANEVALX_java 0.042 0.640 0.719 0.640 0.768 0.317 0.658 0.804 - 0.810 0.207 0.628 0.597 0.487 0.439 - - - 0.256 - - 0.804 0.829 0.079 0.060 0.634 0.640 0.615 0.695 0.737 0.780 0.865
HUMANEVALX_js 0.115 0.676 0.652 0.579 0.743 0.359 0.664 0.835 - 0.841 0.628 0.628 0.670 0.560 0.067 - - - 0.402 - - 0.786 0.786 0.560 0.451 0.786 0.646 0.689 0.719 0.750 0.798 0.847
HUMANEVALX 0.071 0.607 0.658 0.577 0.697 0.278 0.636 0.798 - 0.810 0.424 0.563 0.556 0.502 0.276 - - - 0.294 - - 0.628 0.776 0.306 0.243 0.699 0.569 0.514 0.630 0.680 0.634 0.804
CRUXEVAL_input 0.210 0.411 0.448 0.462 0.485 0.038 0.388 0.440 - 0.528 0.416 0.406 0.477 0.435 0.353 - - - 0.276 - - 0.547 0.550 0.398 0.388 0.447 0.350 0.331 0.387 0.412 0.541 0.517
CRUXEVAL_output 0.152 0.355 0.410 0.375 0.482 0.196 0.348 0.457 - 0.491 0.356 0.338 0.372 0.360 0.291 - - - 0.303 - - 0.516 0.498 0.342 0.296 0.463 0.275 0.311 0.382 0.386 0.471 0.455
CRUXEVAL 0.181 0.383 0.429 0.418 0.483 0.117 0.368 0.448 - 0.510 0.386 0.372 0.425 0.397 0.322 - - - 0.290 - - 0.531 0.524 0.370 0.342 0.455 0.312 0.321 0.385 0.399 0.506 0.486
CRUXEVALFIM_input - 0.418 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM_output - 0.356 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM - 0.387 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TQA_mc 0.146 0.523 0.510 0.701 0.767 0.115 0.468 0.663 - 0.696 0.636 0.640 0.637 0.564 0.555 - - - 0.549 - - 0.767 0.713 0.621 0.581 0.725 0.516 0.548 0.654 0.657 0.747 0.804
TQA_tf 0.381 0.410 0.431 0.692 0.725 0.390 0.491 0.677 - 0.634 0.484 0.457 0.593 0.512 0.566 - - - 0.548 - - 0.735 0.670 0.483 0.487 0.686 0.414 0.300 0.574 0.568 0.706 0.731
TQA 0.354 0.423 0.440 0.693 0.730 0.358 0.488 0.676 - 0.641 0.502 0.478 0.598 0.518 0.565 - - - 0.548 - - 0.739 0.675 0.499 0.498 0.691 0.426 0.329 0.583 0.578 0.711 0.740
ARC_challenge 0.374 0.809 0.819 0.882 0.897 0.319 0.699 0.869 - 0.899 0.835 0.853 0.871 0.776 0.706 - - - 0.688 - - 0.912 0.907 0.813 0.802 0.911 0.750 0.777 0.843 0.851 0.911 0.934
ARC_easy 0.598 0.925 0.933 0.952 0.963 0.563 0.875 0.955 - 0.971 0.933 0.940 0.945 0.906 0.843 - - - 0.843 - - 0.970 0.968 0.934 0.932 0.970 0.895 0.904 0.945 0.946 0.969 0.978
ARC 0.524 0.886 0.895 0.929 0.941 0.482 0.817 0.927 - 0.947 0.901 0.911 0.920 0.863 0.798 - - - 0.792 - - 0.951 0.948 0.894 0.889 0.950 0.847 0.862 0.911 0.915 0.950 0.963
RACE_high 0.431 0.698 0.730 0.802 0.833 0.338 0.633 0.795 - 0.829 0.788 0.787 0.830 0.679 0.589 - - - 0.607 - - 0.853 0.853 0.613 0.625 0.819 0.698 0.712 0.779 0.788 0.852 0.882
RACE_middle 0.463 0.777 0.793 0.849 0.883 0.398 0.713 0.858 - 0.883 0.816 0.825 0.866 0.734 0.680 - - - 0.696 - - 0.894 0.885 0.706 0.692 0.861 0.775 0.776 0.841 0.853 0.887 0.923
RACE 0.440 0.721 0.748 0.816 0.847 0.355 0.656 0.813 - 0.844 0.796 0.798 0.840 0.695 0.615 - - - 0.633 - - 0.865 0.862 0.640 0.645 0.831 0.720 0.730 0.797 0.807 0.862 0.894
MMLU
abstract_algebra 0.180 0.410 0.450 0.330 0.310 0.110 0.160 0.350 - 0.410 0.220 0.210 0.300 0.200 0.270 - - - 0.190 - - 0.370 0.430 0.300 0.210 0.410 0.240 0.250 0.440 0.430 0.570 0.600
anatomy 0.318 0.577 0.592 0.626 0.607 0.296 0.459 0.577 - 0.629 0.503 0.511 0.637 0.555 0.540 - - - 0.447 - - 0.718 0.725 0.570 0.585 0.703 0.525 0.562 0.622 0.622 0.644 0.733
astronomy 0.263 0.736 0.756 0.760 0.828 0.296 0.559 0.769 - 0.848 0.644 0.651 0.802 0.677 0.565 - - - 0.573 - - 0.888 0.888 0.703 0.703 0.776 0.618 0.657 0.763 0.769 0.868 0.875
business_ethics 0.260 0.570 0.560 0.620 0.670 0.270 0.480 0.630 - 0.710 0.570 0.610 0.670 0.550 0.480 - - - 0.520 - - 0.730 0.720 0.620 0.620 0.740 0.630 0.590 0.680 0.710 0.750 0.800
clinical_knowledge 0.373 0.652 0.683 0.743 0.788 0.316 0.554 0.716 - 0.784 0.618 0.622 0.716 0.675 0.592 - - - 0.581 - - 0.811 0.777 0.713 0.698 0.781 0.633 0.645 0.709 0.713 0.803 0.815
college_biology 0.340 0.763 0.777 0.854 0.895 0.256 0.618 0.826 - 0.847 0.687 0.715 0.777 0.722 0.625 - - - 0.625 - - 0.888 0.902 0.805 0.763 0.868 0.694 0.694 0.784 0.784 0.854 0.923
college_chemistry 0.180 0.470 0.430 0.470 0.430 0.180 0.260 0.400 - 0.470 0.380 0.380 0.440 0.400 0.310 - - - 0.350 - - 0.480 0.500 0.460 0.430 0.520 0.310 0.370 0.480 0.490 0.460 0.530
college_computer_science 0.110 0.540 0.590 0.460 0.580 0.200 0.360 0.500 - 0.560 0.470 0.480 0.640 0.400 0.350 - - - 0.320 - - 0.650 0.580 0.480 0.410 0.600 0.390 0.460 0.620 0.590 0.630 0.720
college_mathematics 0.090 0.320 0.320 0.260 0.300 0.080 0.170 0.300 - 0.400 0.240 0.280 0.280 0.260 0.210 - - - 0.180 - - 0.340 0.380 0.270 0.170 0.340 0.200 0.180 0.380 0.350 0.490 0.540
college_medicine 0.283 0.566 0.612 0.658 0.716 0.260 0.462 0.624 - 0.682 0.572 0.589 0.676 0.589 0.491 - - - 0.456 - - 0.722 0.728 0.612 0.566 0.728 0.560 0.606 0.606 0.624 0.710 0.739
college_physics 0.186 0.372 0.411 0.352 0.421 0.098 0.196 0.411 - 0.470 0.313 0.323 0.362 0.313 0.303 - - - 0.254 - - 0.529 0.539 0.333 0.294 0.529 0.382 0.392 0.401 0.372 0.519 0.656
computer_security 0.370 0.710 0.690 0.730 0.710 0.350 0.590 0.740 - 0.760 0.710 0.730 0.720 0.690 0.620 - - - 0.600 - - 0.710 0.720 0.700 0.650 0.730 0.650 0.690 0.720 0.710 0.730 0.800
conceptual_physics 0.234 0.680 0.680 0.638 0.727 0.174 0.404 0.634 - 0.748 0.561 0.587 0.646 0.463 0.361 - - - 0.365 - - 0.744 0.731 0.565 0.553 0.748 0.485 0.519 0.642 0.642 0.800 0.834
econometrics 0.122 0.649 0.587 0.557 0.587 0.140 0.315 0.535 - 0.570 0.456 0.464 0.578 0.482 0.359 - - - 0.318 - - 0.587 0.614 0.456 0.421 0.596 0.421 0.438 0.605 0.596 0.649 0.675
electrical_engineering 0.220 0.641 0.648 0.558 0.593 0.296 0.393 0.558 - 0.627 0.544 0.572 0.655 0.524 0.462 - - - 0.393 - - 0.703 0.662 0.496 0.475 0.634 0.441 0.434 0.606 0.606 0.648 0.703
elementary_mathematics 0.113 0.505 0.497 0.476 0.476 0.058 0.288 0.502 - 0.719 0.367 0.373 0.481 0.357 0.280 - - - 0.222 - - 0.653 0.621 0.423 0.388 0.544 0.407 0.417 0.560 0.568 0.791 0.838
formal_logic 0.182 0.444 0.484 0.293 0.468 0.142 0.269 0.452 - 0.507 0.325 0.357 0.420 0.420 0.253 - - - 0.277 - - 0.563 0.484 0.452 0.380 0.531 0.325 0.341 0.452 0.428 0.539 0.626
global_facts 0.120 0.190 0.290 0.330 0.370 0.070 0.110 0.300 - 0.420 0.200 0.240 0.330 0.150 0.110 - - - 0.160 - - 0.520 0.450 0.240 0.130 0.320 0.140 0.200 0.260 0.260 0.470 0.430
high_school_biology 0.348 0.764 0.774 0.851 0.890 0.358 0.645 0.816 - 0.854 0.800 0.809 0.825 0.729 0.677 - - - 0.654 - - 0.874 0.870 0.793 0.774 0.887 0.722 0.754 0.803 0.806 0.845 0.896
high_school_chemistry 0.216 0.522 0.507 0.586 0.600 0.167 0.359 0.517 - 0.610 0.546 0.517 0.527 0.467 0.433 - - - 0.310 - - 0.650 0.640 0.512 0.492 0.655 0.413 0.463 0.532 0.536 0.596 0.724
high_school_computer_science 0.250 0.740 0.740 0.710 0.770 0.270 0.570 0.750 - 0.830 0.660 0.660 0.760 0.610 0.540 - - - 0.490 - - 0.810 0.790 0.610 0.580 0.870 0.600 0.660 0.770 0.770 0.830 0.870
high_school_european_history 0.490 0.757 0.745 0.806 0.830 0.363 0.678 0.818 - 0.824 0.812 0.830 0.787 0.709 0.672 - - - 0.678 - - 0.830 0.830 0.727 0.672 0.812 0.733 0.733 0.787 0.800 0.824 0.818
high_school_geography 0.393 0.717 0.747 0.878 0.888 0.424 0.646 0.818 - 0.858 0.792 0.818 0.792 0.757 0.671 - - - 0.671 - - 0.853 0.853 0.792 0.737 0.888 0.712 0.732 0.833 0.833 0.868 0.883
high_school_government_and_politics 0.487 0.875 0.875 0.926 0.963 0.450 0.772 0.911 - 0.937 0.875 0.870 0.880 0.818 0.725 - - - 0.805 - - 0.937 0.948 0.849 0.834 0.937 0.772 0.797 0.917 0.917 0.958 0.968
high_school_macroeconomics 0.235 0.653 0.687 0.717 0.758 0.238 0.474 0.682 - 0.771 0.651 0.653 0.733 0.556 0.497 - - - 0.478 - - 0.766 0.743 0.646 0.635 0.807 0.564 0.592 0.684 0.684 0.802 0.825
high_school_mathematics 0.088 0.344 0.337 0.277 0.325 0.033 0.211 0.337 - 0.422 0.237 0.240 0.285 0.255 0.233 - - - 0.162 - - 0.362 0.348 0.214 0.203 0.274 0.270 0.244 0.440 0.422 0.500 0.537
high_school_microeconomics 0.268 0.823 0.827 0.801 0.852 0.315 0.533 0.798 - 0.831 0.760 0.773 0.798 0.684 0.575 - - - 0.540 - - 0.873 0.878 0.794 0.743 0.861 0.672 0.697 0.827 0.827 0.857 0.907
high_school_physics 0.099 0.509 0.496 0.423 0.496 0.112 0.198 0.403 - 0.562 0.344 0.364 0.443 0.317 0.211 - - - 0.165 - - 0.596 0.582 0.377 0.384 0.569 0.317 0.311 0.470 0.456 0.635 0.695
high_school_psychology 0.445 0.827 0.853 0.896 0.910 0.445 0.750 0.882 - 0.900 0.840 0.858 0.862 0.834 0.761 - - - 0.764 - - 0.913 0.900 0.855 0.844 0.904 0.803 0.796 0.858 0.856 0.882 0.902
high_school_statistics 0.185 0.564 0.625 0.574 0.615 0.129 0.337 0.574 - 0.555 0.509 0.500 0.638 0.462 0.342 - - - 0.361 - - 0.648 0.648 0.569 0.523 0.643 0.481 0.518 0.615 0.648 0.717 0.782
high_school_us_history 0.436 0.764 0.794 0.829 0.867 0.348 0.705 0.843 - 0.872 0.833 0.867 0.774 0.784 0.696 - - - 0.699 - - 0.897 0.921 0.759 0.735 0.877 0.715 0.759 0.843 0.852 0.882 0.906
high_school_world_history 0.535 0.759 0.818 0.872 0.881 0.392 0.696 0.864 - 0.907 0.810 0.827 0.801 0.789 0.725 - - - 0.720 - - 0.869 0.873 0.746 0.742 0.869 0.776 0.793 0.818 0.827 0.869 0.877
human_aging 0.309 0.596 0.627 0.690 0.739 0.345 0.524 0.645 - 0.699 0.582 0.591 0.641 0.618 0.569 - - - 0.542 - - 0.713 0.704 0.582 0.547 0.726 0.569 0.587 0.681 0.690 0.717 0.771
human_sexuality 0.351 0.648 0.694 0.746 0.755 0.358 0.519 0.740 - 0.763 0.648 0.633 0.648 0.671 0.587 - - - 0.569 - - 0.839 0.816 0.664 0.587 0.740 0.625 0.625 0.740 0.717 0.786 0.839
international_law 0.404 0.727 0.776 0.801 0.760 0.495 0.677 0.801 - 0.809 0.735 0.752 0.743 0.776 0.710 - - - 0.710 - - 0.834 0.818 0.735 0.727 0.892 0.710 0.685 0.768 0.785 0.834 0.867
jurisprudence 0.444 0.740 0.768 0.785 0.833 0.379 0.648 0.740 - 0.796 0.675 0.722 0.777 0.731 0.574 - - - 0.626 - - 0.833 0.805 0.722 0.750 0.787 0.694 0.712 0.759 0.750 0.824 0.824
logical_fallacies 0.380 0.711 0.730 0.811 0.797 0.300 0.644 0.779 - 0.871 0.730 0.754 0.717 0.736 0.687 - - - 0.660 - - 0.797 0.785 0.785 0.754 0.779 0.705 0.723 0.773 0.766 0.834 0.877
machine_learning 0.196 0.508 0.491 0.437 0.571 0.169 0.285 0.464 - 0.482 0.419 0.401 0.500 0.366 0.285 - - - 0.321 - - 0.571 0.571 0.437 0.375 0.544 0.339 0.321 0.437 0.410 0.526 0.642
management 0.417 0.825 0.786 0.825 0.844 0.475 0.708 0.864 - 0.825 0.737 0.766 0.834 0.737 0.669 - - - 0.708 - - 0.844 0.864 0.786 0.776 0.854 0.689 0.718 0.805 0.825 0.825 0.864
marketing 0.517 0.820 0.854 0.863 0.893 0.508 0.782 0.858 - 0.897 0.850 0.858 0.888 0.837 0.799 - - - 0.756 - - 0.893 0.888 0.820 0.803 0.914 0.811 0.816 0.888 0.893 0.897 0.901
medical_genetics 0.340 0.720 0.750 0.780 0.810 0.240 0.510 0.720 - 0.790 0.630 0.640 0.710 0.720 0.660 - - - 0.600 - - 0.850 0.880 0.710 0.700 0.860 0.660 0.690 0.770 0.770 0.820 0.900
miscellaneous 0.420 0.749 0.768 0.830 0.854 0.401 0.687 0.825 - 0.879 0.775 0.796 0.768 0.773 0.736 - - - 0.727 - - 0.872 0.872 0.777 0.759 0.864 0.724 0.726 0.807 0.814 0.871 0.885
moral_disputes 0.323 0.609 0.618 0.680 0.736 0.332 0.511 0.664 - 0.719 0.604 0.612 0.635 0.621 0.560 - - - 0.524 - - 0.748 0.754 0.615 0.621 0.748 0.537 0.566 0.664 0.676 0.725 0.760
moral_scenarios 0.115 0.165 0.411 0.325 0.366 0.117 0.143 0.207 - 0.489 0.307 0.360 0.188 0.205 0.410 - - - 0.122 - - 0.482 0.377 0.366 0.404 0.582 0.130 0.058 0.318 0.368 0.546 0.565
nutrition 0.313 0.650 0.666 0.683 0.758 0.333 0.565 0.676 - 0.764 0.643 0.653 0.751 0.689 0.620 - - - 0.555 - - 0.843 0.826 0.669 0.620 0.771 0.647 0.630 0.745 0.745 0.790 0.797
philosophy 0.327 0.681 0.675 0.658 0.713 0.363 0.536 0.726 - 0.742 0.652 0.659 0.688 0.617 0.578 - - - 0.587 - - 0.736 0.781 0.630 0.588 0.784 0.562 0.565 0.675 0.688 0.774 0.778
prehistory 0.308 0.660 0.697 0.728 0.783 0.342 0.577 0.759 - 0.827 0.635 0.663 0.641 0.700 0.604 - - - 0.580 - - 0.805 0.805 0.697 0.663 0.805 0.641 0.666 0.762 0.756 0.836 0.861
professional_accounting 0.184 0.418 0.432 0.496 0.514 0.152 0.280 0.436 - 0.531 0.404 0.425 0.429 0.393 0.336 - - - 0.336 - - 0.531 0.517 0.418 0.386 0.510 0.386 0.414 0.457 0.460 0.560 0.631
professional_law 0.202 0.397 0.417 0.478 0.528 0.177 0.323 0.441 - 0.489 0.404 0.408 0.417 0.397 0.369 - - - 0.333 - - 0.518 0.505 0.410 0.401 0.492 0.340 0.337 0.401 0.402 0.477 0.541
professional_medicine 0.235 0.639 0.636 0.756 0.794 0.113 0.481 0.761 - 0.783 0.654 0.680 0.680 0.724 0.713 - - - 0.564 - - 0.827 0.827 0.687 0.658 0.823 0.573 0.580 0.680 0.683 0.812 0.845
professional_psychology 0.300 0.647 0.665 0.728 0.805 0.272 0.495 0.718 - 0.753 0.598 0.609 0.684 0.642 0.509 - - - 0.521 - - 0.790 0.777 0.655 0.617 0.799 0.586 0.591 0.707 0.702 0.776 0.810
public_relations 0.409 0.563 0.600 0.700 0.672 0.354 0.509 0.690 - 0.681 0.572 0.627 0.618 0.518 0.545 - - - 0.554 - - 0.736 0.736 0.554 0.572 0.727 0.563 0.572 0.627 0.645 0.736 0.663
security_studies 0.240 0.608 0.644 0.746 0.763 0.285 0.632 0.661 - 0.755 0.624 0.632 0.718 0.665 0.616 - - - 0.600 - - 0.787 0.767 0.669 0.673 0.730 0.620 0.653 0.718 0.718 0.767 0.775
sociology 0.412 0.781 0.791 0.815 0.860 0.517 0.666 0.820 - 0.850 0.736 0.741 0.810 0.786 0.741 - - - 0.716 - - 0.860 0.875 0.820 0.781 0.870 0.716 0.736 0.815 0.825 0.855 0.860
us_foreign_policy 0.510 0.780 0.790 0.868 0.840 0.460 0.740 0.890 - 0.860 0.780 0.800 0.840 0.800 0.800 - - - 0.757 - - 0.920 0.910 0.760 0.770 0.890 0.750 0.780 0.820 0.820 0.890 0.880
virology 0.246 0.433 0.445 0.472 0.506 0.283 0.415 0.469 - 0.481 0.415 0.439 0.469 0.439 0.415 - - - 0.387 - - 0.512 0.512 0.403 0.367 0.500 0.373 0.427 0.463 0.457 0.487 0.518
world_religions 0.403 0.748 0.801 0.800 0.847 0.350 0.684 0.795 - 0.836 0.766 0.766 0.748 0.789 0.742 - - - 0.747 - - 0.871 0.853 0.742 0.725 0.836 0.783 0.760 0.818 0.818 0.859 0.871
MMLU 0.285 0.591 0.623 0.647 0.687 0.269 0.477 0.631 - 0.701 0.580 0.595 0.617 0.570 0.525 - - - 0.486 - - 0.717 0.704 0.599 0.578 0.710 0.532 0.540 0.639 0.643 0.721 0.757
AGIEVAL
aquarat 0.374 0.602 0.562 0.665 0.602 0.409 0.763 0.846 - 0.844 0.653 0.637 0.783 0.598 0.633 - - - 0.279 - - 0.516 0.764 0.409 0.574 0.834 0.732 0.728 0.799 0.830 0.822 0.870
logiqa 0.208 0.356 0.337 0.447 0.477 0.145 0.342 0.479 - 0.509 0.399 0.416 0.433 0.328 0.265 - - - 0.264 - - 0.468 0.447 0.281 0.267 0.445 0.316 0.342 0.427 0.436 0.493 0.554
lsatar 0.213 0.213 0.282 0.208 0.260 0.217 0.213 0.365 - 0.317 0.073 0.217 0.308 0.295 0.239 - - - 0.186 - - 0.269 0.639 0.256 0.247 0.369 0.230 0.226 0.260 0.300 0.321 0.400
lsatlr 0.203 0.486 0.537 0.635 0.654 0.115 0.374 0.596 - 0.686 0.505 0.515 0.592 0.441 0.327 - - - 0.366 - - 0.709 0.686 0.415 0.386 0.621 0.452 0.449 0.598 0.603 0.729 0.811
lsatrc 0.312 0.594 0.646 0.750 0.754 0.208 0.475 0.702 - 0.717 0.635 0.643 0.706 0.624 0.486 - - - 0.520 - - 0.814 0.806 0.531 0.524 0.762 0.553 0.617 0.661 0.687 0.810 0.836
saten 0.470 0.791 0.810 0.834 0.868 0.305 0.728 0.854 - 0.893 0.815 0.820 0.844 0.781 0.689 - - - 0.679 - - 0.893 0.873 0.713 0.708 0.830 0.733 0.776 0.810 0.844 0.888 0.922
satmath 0.559 0.790 0.822 0.886 0.768 0.468 0.945 0.981 - 0.936 0.863 0.868 0.968 0.618 0.845 - - - 0.400 - - 0.804 0.813 0.713 0.754 0.977 0.900 0.922 0.963 0.963 0.990 0.981
AGIEVAL 0.294 0.503 0.523 0.598 0.602 0.226 0.488 0.639 - 0.663 0.525 0.546 0.611 0.480 0.433 - - - 0.359 - - 0.615 0.665 0.429 0.438 0.638 0.501 0.520 0.599 0.616 0.681 0.734
AGIEVALC_biology - 0.365 - - - 0.104 0.334 0.595 - 0.665 0.756 0.778 0.869 - - - - - - - - 0.721 0.739 - - - 0.660 0.700 0.804 0.813 0.834 0.582
AGIEVALC_chemistry - 0.269 - - - 0.078 0.289 0.446 - 0.480 0.642 0.691 0.715 - - - - - - - - 0.509 0.509 - - - 0.441 0.470 0.583 0.627 0.696 0.789
AGIEVALC_chinese - 0.247 - - - 0.048 0.231 0.373 - 0.439 0.642 0.650 0.723 - - - - - - - - 0.569 0.577 - - - 0.508 0.504 0.585 0.593 0.760 0.735
AGIEVALC_english - 0.774 - - - 0.444 0.728 0.862 - 0.866 0.823 0.833 0.905 - - - - - - - - 0.892 0.892 - - - 0.794 0.839 0.856 0.849 0.915 0.924
AGIEVALC_geography - 0.407 - - - 0.246 0.396 0.608 - 0.678 0.728 0.728 0.814 - - - - - - - - 0.718 0.718 - - - 0.643 0.633 0.753 0.778 0.804 0.839
AGIEVALC_history - 0.374 - - - 0.225 0.421 0.642 - 0.689 0.829 0.834 0.872 - - - - - - - - 0.736 0.736 - - - 0.740 0.744 0.774 0.800 0.842 0.923
AGIEVALC_jecqaca - 0.221 - - - 0.142 0.258 0.292 - 0.348 0.414 0.440 0.660 - - - - - - - - 0.416 0.410 - - - 0.425 0.424 0.482 0.487 0.564 0.622
AGIEVALC_jecqakd - 0.223 - - - 0.118 0.229 0.356 - 0.400 0.549 0.559 0.759 - - - - - - - - 0.465 0.461 - - - 0.498 0.526 0.592 0.605 0.732 0.747
AGIEVALC_logiqa - 0.310 - - - 0.193 0.328 0.488 - 0.523 0.479 0.490 0.556 - - - - - - - - 0.525 0.525 - - - 0.399 0.405 0.497 0.500 0.565 0.588
AGIEVALC_mathcloze - 0.508 - - - - 0.567 0.779 - 0.855 0.491 0.542 0.508 - - - - - - - - 0.754 0.915 - - - 0.508 0.440 0.694 0.686 0.737 0.805
AGIEVALC_mathqa - 0.569 - - - 0.322 0.616 0.779 - 0.744 0.621 0.648 0.845 - - - - - - - - 0.664 0.844 - - - 0.595 0.683 0.779 0.755 0.808 0.834
AGIEVALC_physics - 0.327 - - - 0.091 0.206 0.304 - 0.471 0.396 0.425 0.563 - - - - - - - - 0.431 0.477 - - - 0.390 0.413 0.431 0.500 0.683 0.770
AGIEVALC - 0.361 - - - 0.187 0.368 0.514 - 0.554 0.589 0.607 0.724 - - - - - - - - 0.583 0.603 - - - 0.529 0.548 0.627 0.636 0.716 0.734
BBH
boolean_expressions 0.544 0.860 0.876 0.768 0.460 0.632 0.880 0.880 - 0.732 0.848 0.868 0.800 0.844 0.480 - - - 0.764 - - 0.872 0.860 0.852 0.832 0.936 0.756 0.796 0.864 0.880 0.888 0.808
causal_judgement 0.550 0.577 0.582 0.598 0.604 0.550 0.582 0.652 - 0.620 0.550 0.550 0.641 0.540 0.518 - - - 0.588 - - 0.631 0.582 0.588 0.593 0.647 0.497 0.529 0.508 0.513 0.647 0.700
date_understanding 0.324 0.668 0.748 0.748 0.788 0.408 0.868 0.920 - 0.760 0.580 0.572 0.832 0.716 0.664 - - - 0.548 - - 0.728 0.920 0.696 0.576 0.932 0.616 0.648 0.764 0.740 0.856 0.872
disambiguation_qa 0.400 0.712 0.668 0.660 0.720 0.284 0.432 0.448 - 0.612 0.584 0.636 0.716 0.516 0.472 - - - 0.600 - - 0.388 0.516 0.720 0.752 0.768 0.544 0.556 0.656 0.636 0.764 0.780
dyck_languages 0.424 0.704 0.712 0.728 0.600 0.344 0.636 0.824 - 0.892 0.516 0.544 0.592 0.796 0.680 - - - 0.744 - - 0.792 0.684 0.580 0.468 0.776 0.596 0.628 0.868 0.836 0.648 0.820
formal_fallacies 0.624 0.740 0.660 0.832 0.760 0.612 0.876 0.832 - 0.820 0.568 0.660 0.984 0.984 0.816 - - - 0.852 - - 0.964 0.692 0.808 0.808 0.804 0.928 0.852 0.628 0.628 0.784 0.812
geometric_shapes 0.056 0.544 0.456 0.436 0.420 0.128 0.376 0.456 - 0.544 0.392 0.400 0.812 0.440 0.416 - - - 0.288 - - 0.280 0.716 0.416 0.292 0.648 0.204 0.212 0.544 0.604 0.584 0.640
hyperbaton 0.512 0.572 0.680 0.884 0.836 0.108 0.940 0.976 - 0.932 0.740 0.824 0.884 0.880 0.624 - - - 0.656 - - 0.884 0.892 0.936 0.936 0.996 0.636 0.676 0.832 0.792 0.868 0.956
logical_deduction_five_objects 0.176 0.700 0.532 0.568 0.608 0.284 0.604 0.840 - 0.784 0.528 0.516 0.784 0.568 0.484 - - - 0.352 - - 0.600 0.968 0.632 0.532 0.940 0.468 0.528 0.752 0.728 0.876 0.924
logical_deduction_seven_objects 0.152 0.556 0.492 0.560 0.552 0.212 0.640 0.740 - 0.776 0.444 0.500 0.756 0.488 0.408 - - - 0.296 - - 0.616 0.944 0.568 0.500 0.920 0.420 0.436 0.668 0.656 0.792 0.864
logical_deduction_three_objects 0.376 0.868 0.820 0.844 0.892 0.428 0.860 0.992 - 0.912 0.836 0.840 0.960 0.804 0.652 - - - 0.608 - - 0.840 0.988 0.844 0.804 0.992 0.696 0.720 0.940 0.956 0.980 0.992
movie_recommendation 0.424 0.652 0.676 0.552 0.508 0.372 0.536 0.664 - 0.632 0.604 0.648 0.740 0.536 0.456 - - - 0.508 - - 0.684 0.672 0.520 0.508 0.992 0.604 0.568 0.556 0.536 0.672 0.648
multistep_arithmetic_two 0.136 0.944 0.968 0.488 0.472 - 0.868 0.888 - 0.972 0.580 0.524 0.508 0.700 0.532 - - - 0.108 - - 0.832 0.956 0.836 0.420 0.984 0.852 0.876 0.896 0.948 0.964 0.976
navigate 0.540 0.580 0.588 0.596 0.648 0.592 0.648 0.724 - 0.744 0.420 0.420 0.580 0.580 0.580 - - - 0.600 - - 0.680 0.464 0.588 0.584 0.640 0.576 0.572 0.596 0.596 0.624 0.684
object_counting 0.464 0.764 0.820 0.848 0.856 - 0.908 0.908 - 0.976 0.616 0.660 0.892 0.864 0.808 - - - 0.608 - - 0.832 0.984 0.836 0.344 0.996 0.740 0.764 0.848 0.804 0.892 0.896
penguins_in_a_table 0.369 0.842 0.746 0.890 0.842 0.267 0.876 0.986 - 0.739 0.917 0.917 0.958 0.856 0.801 - - - 0.623 - - 0.616 0.993 0.883 0.712 1.000 0.821 0.849 0.945 0.924 0.958 0.986
reasoning_about_colored_objects 0.276 0.860 0.800 0.744 0.900 0.180 0.752 0.888 - 0.844 0.876 0.796 0.940 0.824 0.568 - - - 0.608 - - 0.804 0.992 0.808 0.656 0.968 0.700 0.764 0.904 0.868 0.944 0.984
ruin_names 0.176 0.484 0.636 0.716 0.760 0.172 0.468 0.696 - 0.816 0.696 0.652 0.716 0.744 0.532 - - - 0.400 - - 0.764 0.748 0.612 0.600 0.816 0.396 0.324 0.440 0.544 0.692 0.760
salient_translation_error_detection 0.212 0.448 0.508 0.548 0.568 0.172 0.560 0.640 - 0.564 0.476 0.488 0.580 0.512 0.464 - - - 0.444 - - 0.600 0.656 0.520 0.532 0.636 0.452 0.432 0.560 0.572 0.612 0.700
snarks 0.483 0.685 0.707 0.691 0.719 0.033 0.724 0.803 - 0.634 0.702 0.707 0.769 0.651 0.657 - - - 0.606 - - 0.634 0.786 0.747 0.786 0.882 0.662 0.623 0.747 0.780 0.831 0.865
sports_understanding 0.584 0.672 0.692 0.788 0.816 0.488 0.696 0.804 - 0.844 0.472 0.468 0.668 0.720 0.644 - - - 0.716 - - 0.796 0.708 0.596 0.600 0.740 0.620 0.616 0.676 0.684 0.680 0.748
temporal_sequences 0.164 0.528 0.540 0.708 0.748 0.436 0.988 0.996 - 0.940 0.756 0.840 0.956 0.856 0.712 - - - 0.404 - - 0.844 0.992 0.784 0.508 1.000 0.324 0.388 0.800 0.820 0.988 0.992
tracking_shuffled_objects_five_objects 0.208 0.560 0.616 0.600 0.692 0.508 0.924 1.000 - 0.648 0.544 0.536 0.864 0.656 0.500 - - - 0.344 - - 0.716 0.992 0.940 0.712 1.000 0.420 0.452 0.840 0.908 0.924 0.972
tracking_shuffled_objects_seven_objects 0.140 0.324 0.524 0.572 0.640 0.228 0.884 0.988 - 0.660 0.512 0.436 0.764 0.592 0.420 - - - 0.296 - - 0.744 0.984 0.896 0.612 0.984 0.292 0.312 0.800 0.868 0.848 0.980
tracking_shuffled_objects_three_objects 0.288 0.696 0.732 0.732 0.848 0.808 0.972 0.992 - 0.548 0.620 0.696 0.956 0.728 0.608 - - - 0.436 - - 0.880 0.996 0.960 0.788 1.000 0.604 0.664 0.832 0.872 0.856 0.996
web_of_lies 0.476 0.576 0.520 0.520 0.488 0.488 0.516 0.540 - 0.532 0.476 0.488 0.512 0.512 0.544 - - - 0.488 - - 0.560 0.504 0.488 0.492 0.512 0.512 0.512 0.528 0.532 0.544 0.624
word_sorting 0.056 0.204 0.292 0.404 0.540 0.080 0.236 0.424 - 0.536 0.404 0.392 0.144 0.512 0.360 - - - 0.280 - - 0.632 0.592 0.204 0.152 0.360 0.156 0.156 0.212 0.220 0.292 0.400
BBH 0.334 0.638 0.650 0.664 0.674 0.355 0.711 0.794 - 0.743 0.596 0.608 0.749 0.681 0.566 - - - 0.506 - - 0.714 0.806 0.696 0.592 0.846 0.554 0.567 0.709 0.718 0.775 0.827
MUSR
murder_mystery 0.552 0.640 0.592 0.668 0.576 0.528 0.592 0.608 - 0.552 0.616 0.584 0.620 0.584 0.576 - - - 0.516 - - 0.712 0.680 0.636 0.620 0.708 0.544 0.612 0.604 0.584 0.652 0.640
object_placements 0.429 0.535 0.578 0.519 0.542 0.296 0.480 0.542 - 0.448 0.492 0.531 0.460 0.546 0.523 - - - 0.453 - - 0.516 0.532 0.503 0.457 0.464 0.472 0.476 0.531 0.554 0.519 0.265
team_allocation 0.436 0.512 0.496 0.460 0.476 0.328 0.400 0.560 - 0.572 0.572 0.588 0.448 0.460 0.396 - - - 0.356 - - 0.612 0.576 0.536 0.480 0.628 0.444 0.384 0.512 0.476 0.556 0.592
MUSR 0.472 0.562 0.555 0.548 0.531 0.383 0.490 0.570 - 0.524 0.559 0.567 0.509 0.530 0.498 - - - 0.441 - - 0.613 0.596 0.558 0.518 0.599 0.486 0.490 0.548 0.538 0.575 0.497
GPQA_diamond - - - - - - - - - 0.388 - - - - - 0.479 0.469 - - - - 0.358 0.540 - - - - - - - - -
GPQA - - - - - - - - - 0.388 - - - - - 0.479 0.469 - - - - 0.358 0.540 - - - - - - - - -
MMLUPRO
biology 0.324 0.708 0.702 0.747 0.772 0.361 0.640 0.794 - 0.752 0.676 0.695 0.750 0.686 0.623 - - - 0.582 - - 0.776 0.584 0.702 0.662 0.835 0.610 0.638 0.709 0.729 0.797 0.764
business 0.190 0.624 0.525 0.583 0.626 0.173 0.518 0.659 - 0.616 0.522 0.562 0.628 0.558 0.458 - - - 0.335 - - 0.612 0.756 0.571 0.509 0.785 0.504 0.558 0.647 0.661 0.718 0.755
chemistry 0.166 0.639 0.500 0.503 0.546 0.115 0.380 0.574 - 0.536 0.465 0.467 0.589 0.467 0.390 - - - - - - 0.488 0.728 0.463 0.296 0.765 0.387 0.451 0.559 0.580 0.684 0.701
computer_science 0.197 0.602 0.590 0.482 0.560 0.170 0.421 0.643 - 0.560 0.497 0.502 0.585 0.485 0.414 - - - - - - 0.556 0.680 0.475 0.448 0.734 0.434 0.402 0.590 0.604 0.663 0.734
economics 0.236 0.663 0.662 0.668 0.678 0.206 0.534 0.699 - 0.660 0.617 0.610 0.662 0.568 0.492 - - - - - - 0.648 0.612 0.609 0.587 0.792 0.521 0.550 0.674 0.687 0.721 0.787
engineering 0.157 0.437 0.424 0.406 0.414 0.138 0.253 0.373 - 0.420 0.303 0.298 0.454 0.378 0.302 - - - - - - 0.488 0.544 0.297 0.283 0.589 0.296 0.309 0.418 0.420 0.512 0.573
health 0.158 0.503 0.517 0.545 0.621 0.156 0.399 0.596 - 0.548 0.492 0.496 0.544 0.558 0.437 - - - - - - 0.544 0.616 0.515 0.466 0.700 0.388 0.416 0.556 0.569 0.643 0.690
history 0.149 0.406 0.467 0.493 0.490 0.152 0.354 0.540 - 0.588 0.425 0.438 0.459 0.451 0.380 - - - - - - 0.592 0.568 0.380 0.380 0.627 0.333 0.367 0.459 0.464 0.566 0.624
law 0.123 0.268 0.295 0.343 0.405 0.158 0.263 0.372 - 0.400 0.299 0.284 0.307 0.303 0.243 - - - - - - 0.384 0.328 0.276 0 0.500 0.220 0.237 0.300 0.292 0.366 0.455
math 0.203 0.694 0.564 0.538 0.570 0.180 0.586 0.739 - 0.664 0.490 0.523 0.617 0.555 0.511 - - - - - - 0.536 0.812 0.522 0.458 0.816 0.581 0.603 0.712 0.723 0.775 0.814
other 0.164 0.450 0.496 0.551 0.574 0.173 0.428 0.580 - 0.552 0.464 0.458 0.536 0.487 0.389 - - - - - - 0.484 0.592 0.500 0.433 0.706 0.410 0.405 0.529 0.551 0.611 0.664
philosophy 0.148 0.442 0.462 0.448 0.488 0.176 0.356 0.555 - 0.560 0.408 0.412 0.424 0.382 0.326 - - - - - - 0.476 0.580 0.406 0.390 0.633 0.376 0.364 0.480 0.464 0.557 0.599
physics 0.159 0.583 0.493 0.501 0.559 0.125 0.397 0.595 - 0.512 0.441 0.461 0.587 0.488 0.397 - - - - - - 0.488 0.724 0.455 0.425 0.765 0.419 0.456 0.589 0.602 0.702 0.543
psychology 0.258 0.621 0.645 0.647 0.692 0.273 0.567 0.685 - 0.632 0.586 0.602 0.665 0.637 0.518 - - - - - - 0.680 0.544 0.621 0.572 0.759 0.526 0.563 0.636 0.644 0.721 0.749
MMLUPRO 0.186 0.552 0.517 0.528 0.568 0.177 0.436 0.597 - 0.571 0.471 0.480 0.559 0.499 0.419 - - - 0.453 - - 0.553 0.619 0.482 0.408 0.719 0.430 0.457 0.564 0.575 0.649 0.671
CATEGORIES
REASONING 0.367 0.713 0.738 0.788 0.814 0.344 0.598 0.787 0.767 0.809 0.804 0.811 0.815 0.713 0.606 - - - 0.628 0.866 0.877 0.863 0.848 0.724 0.691 0.809 0.689 0.719 0.805 0.809 0.850 0.874
UNDERSTANDING 0.366 0.644 0.670 0.707 0.742 0.327 0.552 0.695 0.741 0.746 0.661 0.670 0.691 0.631 0.579 - - - 0.563 0.775 0.772 0.764 0.756 0.614 0.622 0.728 0.605 0.613 0.692 0.696 0.761 0.793
LANGUAGE 0.524 0.688 0.692 0.735 0.755 0.504 0.635 0.724 0.721 0.742 0.786 0.783 0.662 0.747 0.705 - - - 0.766 0.786 0.789 0.798 0.792 0.677 0.613 0.750 0.685 0.682 0.722 0.724 0.769 0.781
KNOWLEDGE 0.354 0.442 0.496 0.690 0.733 0.353 0.478 0.626 0.560 0.630 0.553 0.543 0.615 0.547 0.536 0.680 0.580 0.540 0.582 0.680 - 0.676 0.653 0.517 0.519 0.676 0.469 0.426 0.595 0.597 0.693 0.725
COT 0.220 0.552 0.530 0.550 0.582 0.201 0.470 0.616 - 0.630 0.485 0.500 0.586 0.530 0.446 0.479 0.469 - 0.498 - - 0.600 0.651 0.506 0.440 0.725 0.443 0.462 0.570 0.581 0.653 0.684
MATHCOT 0.369 0.730 0.752 0.735 0.740 0.417 0.793 0.879 0.882 0.790 0.682 0.679 0.813 0.728 0.647 0.840 0.860 0.860 0.493 0.830 0.780 0.743 0.908 0.767 0.638 0.919 0.667 0.694 0.823 0.821 0.869 0.903
CODE 0.176 0.460 0.534 0.495 0.568 0.241 0.485 0.582 0.829 0.618 0.456 0.475 0.500 0.463 0.366 - - - 0.321 0.841 0.823 0.409 0.619 0.427 0.376 0.568 0.437 0.445 0.510 0.528 0.578 0.612
DISCIPLINES
NLP 0.408 0.647 0.670 0.755 0.786 0.392 0.595 0.748 0.751 0.761 0.729 0.728 0.737 0.677 0.609 - - - 0.642 0.834 0.841 0.808 0.792 0.647 0.637 0.755 0.632 0.630 0.731 0.734 0.791 0.818
MATH 0.294 0.669 0.659 0.637 0.653 0.298 0.659 0.775 0.882 0.727 0.590 0.597 0.720 0.629 0.556 0.840 0.860 0.860 0.451 0.830 0.780 0.678 0.789 0.646 0.543 0.817 0.576 0.599 0.741 0.742 0.799 0.843
SCIENCE 0.350 0.706 0.713 0.739 0.769 0.304 0.580 0.737 - 0.797 0.686 0.698 0.756 0.676 0.605 0.479 0.469 - 0.673 - - 0.806 0.821 0.696 0.660 0.845 0.629 0.657 0.738 0.748 0.815 0.806
ENGINEERING 0.166 0.464 0.453 0.426 0.438 0.158 0.271 0.397 - 0.496 0.334 0.333 0.480 0.397 0.323 - - - 0.393 - - 0.567 0.587 0.323 0.308 0.595 0.315 0.325 0.443 0.444 0.530 0.590
MEDICINE 0.216 0.524 0.540 0.595 0.648 0.182 0.411 0.598 - 0.642 0.521 0.530 0.570 0.577 0.496 - - - 0.447 - - 0.681 0.684 0.537 0.501 0.672 0.459 0.478 0.574 0.580 0.655 0.702
HUMANITIES 0.291 0.550 0.615 0.645 0.679 0.272 0.495 0.641 0.560 0.705 0.593 0.610 0.622 0.578 0.529 0.680 0.580 0.540 0.536 0.680 - 0.710 0.698 0.588 0.567 0.739 0.527 0.533 0.629 0.638 0.716 0.742
BUSINESS 0.252 0.679 0.655 0.678 0.709 0.245 0.537 0.704 - 0.743 0.623 0.637 0.696 0.598 0.517 - - - 0.466 - - 0.749 0.762 0.637 0.604 0.801 0.565 0.596 0.701 0.710 0.759 0.802
LAW 0.200 0.362 0.427 0.483 0.524 0.172 0.316 0.443 - 0.504 0.417 0.429 0.494 0.406 0.344 - - - 0.370 - - 0.537 0.543 0.392 0.310 0.541 0.374 0.383 0.451 0.456 0.541 0.604
COMPOSITE AVERAGE
AVG 0.342 0.612 0.641 0.692 0.724 0.324 0.555 0.701 0.753 0.729 0.648 0.654 0.689 0.629 0.561 0.620 0.595 0.700 0.578 0.833 0.841 0.740 0.757 0.616 0.585 0.748 0.578 0.586 0.686 0.691 0.754 0.783

THINKING MODELS:

MODEL Qwen3-0.6B Qwen3-1.7B Qwen3-4B Qwen3-4B Qwen3-8B Qwen3-8B Qwen3-8B Qwen3-14B Qwen3-14B Qwen3-30B-A3B Qwen3-30B-A3B Qwen3-32B Qwen3-32B QwQ-32B-Preview QwQ-32B
params 0.75163B 2.03B 4.02B 4.02B 8.19B 8.19B 8.19B 14.77B 14.77B 30.53B 30.53B 32.8B 32.8B 32.76B 32.76B
quant Q8_0 Q8_0 Q8_0 Q8_0_H Q4_K_H Q6_K_H Q6_K IQ4_XS Q4_K_H IQ4_XS Q4_K_H IQ4_XS Q4_K_H IQ4_XS Q4_K_H
engine llama.cpp version: 5679 llama.cpp version: 5415 llama.cpp version: 5242 llama.cpp version: 5509 llama.cpp version: 5279 llama.cpp version: 5223 llama.cpp version: 5153 llama.cpp version: 5223 llama.cpp version: 5379 llama.cpp version: 5279 llama.cpp version: 5353 llama.cpp version: 5466 llama.cpp version: 5466 llama.cpp version: 4273 llama.cpp version: 6118
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc
WG 0.564 0.610 0.662 0.642 0.651 0.689 0.678 0.722 0.726 0.699 0.700 0.712 0.731 0.750 0.689
LAMBADA 0.471 0.590 0.644 0.638 0.660 0.700 0.700 0.729 0.714 0.698 0.725 0.692 0.701 0.780 0.651
HELLASWAG 0.277 0.553 0.721 0.726 0.718 0.787 0.787 0.827 0.792 0.815 0.832 0.838 0.812 0.875 0.883
BOOLQ 0.449 0.531 0.626 0.626 0.641 0.608 0.611 0.662 0.632 - 0.502 0.603 0.574 0.629 0.658
STORYCLOZE 0.764 0.833 0.849 0.843 0.809 0.852 0.873 - 0.905 - 0.917 0.960 0.959 0.964 0.935
CSQA 0.377 0.567 0.705 0.705 0.685 0.740 0.748 - 0.749 - 0.742 - 0.778 0.796 0.680
OBQA 0.394 0.584 0.756 0.754 0.719 0.767 0.774 - 0.787 - 0.836 - 0.869 0.882 0.815
COPA 0.569 0.765 0.872 0.865 0.829 0.828 0.864 - 0.919 - 0.919 - 0.946 0.936 0.962
PIQA 0.393 0.574 0.710 0.710 0.744 0.769 0.781 - 0.798 - 0.845 - 0.815 0.829 0.871
SIQA 0.363 0.569 0.664 0.664 0.637 0.671 0.679 - 0.689 - 0.693 - 0.714 0.714 0.747
MEDQA 0.135 0.278 0.435 0.428 0.448 0.499 0.509 - 0.531 - 0.597 - 0.553 0.598 0.518
SQA 0.249 0.032 0.039 0.040 0.036 0.039 0.042 - 0.045 - 0.055 - 0.047 - 0.060
JEOPARDY 0.640 0.270 0.280 0.220 0.410 0.280 0.240 0.480 0.490 0.520 0.470 0.470 0.470 0.600 0.440
GSM8K 0.748 0.920 0.946 0.960 0.946 0.953 0.956 - 0.948 - 0.962 - 0.972 0.962 0.964
APPLE 0.460 0.790 0.850 0.840 0.790 0.880 0.890 0.910 0.920 0.820 0.850 0.910 0.910 0.870 0.880
HUMANEVAL 0.445 0.682 0.817 0.804 0.835 0.865 0.859 - 0.859 - 0.884 - 0.890 0.414 0.512
HUMANEVALP 0.335 0.591 0.682 0.676 0.725 0.713 0.731 - 0.737 - 0.750 - 0.780 0.359 0.432
HUMANEVALFIM - - - - - - - - - - - - - - -
MBPP 0.408 0.544 0.645 0.642 0.571 0.618 0.630 - 0.700 - 0.677 - 0.684 0.404 0.568
MBPPP 0.388 0.482 0.598 0.602 0.580 0.611 0.566 - 0.651 - - - 0.678 0.392 0.584
HUMANEVALX_cpp 0.231 0.359 0.463 0.353 0.524 0.615 0.554 - 0.652 - - - 0.737 0.378 0.603
HUMANEVALX_java 0.274 0.548 0.731 0.737 0.737 0.780 0.798 - 0.841 - - - 0.847 0.097 0.280
HUMANEVALX_js 0.256 0.518 0.719 0.695 0.762 0.774 0.774 - 0.786 - - - 0.817 0.493 0.493
HUMANEVALX 0.254 0.475 0.638 0.595 0.674 0.723 0.709 - 0.760 - - - 0.800 0.323 0.459
CRUXEVAL_input 0.353 0.406 0.457 0.453 0.445 0.528 0.510 - 0.537 - - - 0.450 0.200 0.498
CRUXEVAL_output 0.241 0.338 0.420 0.403 0.405 0.446 0.447 - 0.501 - - - 0.431 0.368 0.513
CRUXEVAL 0.297 0.372 0.438 0.428 0.425 0.487 0.478 - 0.519 - - - 0.440 0.284 0.506
CRUXEVALFIM_input - - - - - - - - - - - - - - -
CRUXEVALFIM_output - - - - - - - - - - - - - - -
CRUXEVALFIM - - - - - - - - - - - - - - -
TQA_mc 0.261 0.406 0.600 0.598 0.592 0.641 0.635 - 0.676 - - - 0.742 0.795 0.701
TQA_tf 0.429 0.445 0.502 0.500 0.513 0.430 0.458 - 0.614 - - - 0.456 0.523 0.628
TQA 0.409 0.441 0.514 0.511 0.523 0.455 0.479 - 0.621 - - - 0.490 0.554 0.637
ARC_challenge 0.275 0.686 0.854 0.852 0.833 0.882 0.882 - 0.896 - - - 0.910 0.917 0.843
ARC_easy 0.502 0.850 0.937 0.933 0.934 0.952 0.955 - 0.964 - - - 0.974 0.975 0.906
ARC 0.427 0.796 0.910 0.906 0.901 0.929 0.931 - 0.942 - - - 0.953 0.956 0.886
RACE_high 0.359 0.594 0.759 0.756 0.747 0.794 0.798 - 0.826 - - - 0.822 0.871 0.862
RACE_middle 0.397 0.652 0.808 0.808 0.818 0.842 0.844 - 0.873 - - - 0.881 - 0.889
RACE 0.370 0.611 0.774 0.771 0.768 0.808 0.811 - 0.839 - - - 0.839 0.871 0.870
MMLU
abstract_algebra 0.100 0.240 0.410 0.420 0.360 0.430 0.470 - 0.500 - - - 0.470 - 0.470
anatomy 0.274 0.437 0.540 0.555 0.592 0.622 0.607 - 0.651 - - - 0.644 - 0.681
astronomy 0.328 0.611 0.723 0.730 0.796 0.822 0.828 - 0.861 - - - 0.868 - 0.802
business_ethics 0.290 0.460 0.670 0.670 0.610 0.650 0.650 - 0.720 - - - 0.730 - 0.720
clinical_knowledge 0.350 0.528 0.690 0.709 0.728 0.758 0.743 - 0.762 - - - 0.781 - 0.766
college_biology 0.388 0.604 0.770 0.784 0.784 0.805 0.812 - 0.847 - - - 0.881 - 0.819
college_chemistry 0.230 0.340 0.420 0.420 0.420 0.480 0.490 - 0.550 - - - 0.500 - 0.460
college_computer_science 0.230 0.380 0.580 0.580 0.610 0.650 0.700 - 0.630 - - - 0.700 - 0.580
college_mathematics 0.160 0.300 0.370 0.390 0.390 0.450 0.500 - 0.450 - - - 0.400 - 0.350
college_medicine 0.312 0.537 0.676 0.664 0.676 0.716 0.722 - 0.739 - - - 0.722 - 0.722
college_physics 0.137 0.254 0.519 0.509 0.500 0.529 0.578 - 0.578 - - - 0.558 - 0.421
computer_security 0.430 0.650 0.690 0.700 0.710 0.750 0.740 - 0.770 - - - 0.770 - 0.630
conceptual_physics 0.255 0.514 0.685 0.685 0.697 0.736 0.761 - 0.821 - - - 0.821 - 0.668
econometrics 0.140 0.394 0.587 0.614 0.552 0.614 0.631 - 0.622 - - - 0.605 - 0.543
electrical_engineering 0.324 0.475 0.586 0.579 0.586 0.648 0.648 - 0.717 - - - 0.724 - 0.517
elementary_mathematics 0.148 0.391 0.582 0.574 0.584 0.640 0.650 - 0.701 - - - 0.679 - 0.547
formal_logic 0.238 0.357 0.547 0.531 0.460 0.476 0.476 - 0.515 - - - 0.579 - 0.611
global_facts 0.140 0.110 0.180 0.220 0.240 0.260 0.280 - 0.300 - - - 0.310 - 0.430
high_school_biology 0.432 0.600 0.822 0.822 0.822 0.848 0.858 - 0.883 - - - 0.906 - 0.783
high_school_chemistry 0.147 0.423 0.600 0.591 0.551 0.635 0.650 - 0.709 - - - 0.665 - 0.596
high_school_computer_science 0.340 0.580 0.750 0.750 0.730 0.820 0.830 - 0.830 - - - 0.810 - 0.800
high_school_european_history 0.387 0.600 0.703 0.690 0.727 0.818 0.812 - 0.787 - - - 0.806 - 0.824
high_school_geography 0.393 0.631 0.803 0.808 0.752 0.787 0.808 - 0.853 - - - 0.878 - 0.858
high_school_government_and_politics 0.284 0.621 0.849 0.854 0.834 0.891 0.906 - 0.901 - - - 0.958 - 0.886
high_school_macroeconomics 0.292 0.474 0.661 0.656 0.669 0.712 0.712 - 0.787 - - - 0.810 - 0.692
high_school_mathematics 0.166 0.292 0.348 0.355 0.351 0.418 0.396 - 0.474 - - - 0.381 - 0.296
high_school_microeconomics 0.390 0.609 0.773 0.768 0.802 0.882 0.886 - 0.911 - - - 0.894 - 0.676
high_school_physics 0.119 0.350 0.556 0.549 0.549 0.602 0.602 - 0.649 - - - 0.662 - 0.456
high_school_psychology 0.526 0.746 0.842 0.844 0.849 0.877 0.877 - 0.891 - - - 0.921 - 0.790
high_school_statistics 0.333 0.462 0.648 0.657 0.625 0.694 0.689 - 0.703 - - - 0.736 - 0.537
high_school_us_history 0.338 0.553 0.784 0.764 0.710 0.823 0.848 - 0.862 - - - 0.897 - 0.872
high_school_world_history 0.459 0.645 0.793 0.776 0.797 0.839 0.831 - 0.827 - - - 0.864 - 0.877
human_aging 0.331 0.439 0.587 0.578 0.600 0.609 0.623 - 0.668 - - - 0.771 - 0.721
human_sexuality 0.374 0.549 0.641 0.664 0.664 0.740 0.755 - 0.770 - - - 0.809 - 0.770
international_law 0.429 0.537 0.669 0.652 0.628 0.694 0.710 - 0.826 - - - 0.801 - 0.809
jurisprudence 0.398 0.527 0.675 0.694 0.675 0.731 0.731 - 0.805 - - - 0.777 - 0.787
logical_fallacies 0.319 0.613 0.791 0.785 0.717 0.779 0.803 - 0.822 - - - 0.803 - 0.644
machine_learning 0.276 0.339 0.526 0.508 0.392 0.491 0.455 - 0.562 - - - 0.455 - 0.616
management 0.514 0.640 0.786 0.805 0.834 0.844 0.873 - 0.825 - - - 0.796 - 0.718
marketing 0.602 0.739 0.816 0.807 0.820 0.884 0.876 - 0.876 - - - 0.876 - 0.773
medical_genetics 0.370 0.580 0.750 0.710 0.750 0.780 0.750 - 0.790 - - - 0.840 - 0.750
miscellaneous 0.390 0.597 0.752 0.744 0.757 0.789 0.795 - 0.831 - - - 0.854 - 0.786
moral_disputes 0.289 0.473 0.580 0.583 0.540 0.609 0.615 - 0.638 - - - 0.699 - 0.760
moral_scenarios 0.109 0 0.145 0.140 0.234 0.322 0.269 - 0.330 - - - 0.292 - 0.518
nutrition 0.316 0.509 0.660 0.669 0.660 0.702 0.702 - 0.771 - - - 0.771 - 0.771
philosophy 0.225 0.501 0.623 0.617 0.575 0.636 0.643 - 0.675 - - - 0.710 - 0.755
prehistory 0.345 0.530 0.688 0.688 0.688 0.753 0.743 - 0.774 - - - 0.796 - 0.833
professional_accounting 0.237 0.308 0.439 0.443 0.404 0.482 0.482 - 0.510 - - - 0.546 - 0.588
professional_law 0.201 0.288 0.378 0.382 0.369 0.414 0.424 - 0.431 - - - 0.475 - 0.505
professional_medicine 0.209 0.463 0.698 0.705 0.676 0.757 0.775 - 0.786 - - - 0.830 - 0.790
professional_psychology 0.303 0.459 0.651 0.655 0.601 0.676 0.679 - 0.733 - - - 0.766 - 0.712
public_relations 0.345 0.500 0.572 0.563 0.581 0.600 0.636 - 0.645 - - - 0.709 - 0.672
security_studies 0.412 0.604 0.636 0.636 0.677 0.730 0.730 - 0.746 - - - 0.755 - 0.804
sociology 0.427 0.656 0.731 0.746 0.766 0.800 0.815 - 0.781 - - - 0.855 - 0.840
us_foreign_policy 0.470 0.610 0.720 0.710 0.780 0.830 0.830 - 0.830 - - - 0.840 - 0.860
virology 0.319 0.379 0.433 0.433 0.409 0.463 0.475 - 0.487 - - - 0.487 - 0.542
world_religions 0.362 0.637 0.719 0.719 0.783 0.783 0.777 - 0.818 - - - 0.807 - 0.865
MMLU 0.298 0.457 0.598 0.598 0.598 0.651 0.654 - 0.684 - - - 0.698 - 0.674
AGIEVAL
aquarat 0.572 0.760 0.866 0.840 0.834 0.866 0.897 - 0.885 - - - 0.860 - 0.848
logiqa 0.062 0.230 0.451 0.453 0.393 0.420 0.431 - 0.465 - - - 0.520 - 0.586
lsatar 0.208 0.313 0.486 0.500 0.430 0.486 0.517 - 0.469 - - - 0.495 - 0.678
lsatlr 0.164 0.372 0.601 0.594 0.574 0.641 0.658 - 0.725 - - - 0.768 - 0.813
lsatrc 0.327 0.464 0.669 0.665 0.657 0.687 0.713 - 0.713 - - - 0.806 - 0.828
saten 0.412 0.655 0.830 0.825 0.825 0.820 0.820 - 0.834 - - - 0.873 - 0.898
satmath 0.772 0.950 0.990 0.981 0.986 0.990 0.990 - 0.990 - - - 0.995 - 0.972
AGIEVAL 0.282 0.458 0.641 0.636 0.608 0.643 0.659 - 0.678 - - - 0.717 - 0.764
AGIEVALC_biology 0.152 0.539 0.765 0.760 0.769 0.834 0.847 - 0.856 - - - 0.878 - 0.708
AGIEVALC_chemistry 0.117 0.397 0.622 0.602 0.568 0.647 0.656 - 0.720 - - - 0.803 - 0.754
AGIEVALC_chinese 0.081 0.365 0.516 0.504 0.581 0.609 0.634 - 0.678 - - - 0.739 - 0.707
AGIEVALC_english 0.477 0.728 0.856 0.849 0.820 0.856 0.856 - 0.866 - - - 0.872 - 0.888
AGIEVALC_geography 0.281 0.547 0.708 0.683 0.648 0.758 0.743 - 0.768 - - - 0.829 - 0.819
AGIEVALC_history 0.319 0.612 0.736 0.727 0.702 0.761 0.770 - 0.821 - - - 0.872 - 0.889
AGIEVALC_jecqaca 0.183 0.303 0.397 0.392 0.378 0.425 0.439 - 0.482 - - - 0.566 - 0.652
AGIEVALC_jecqakd 0.123 0.359 0.480 0.484 0.513 0.561 0.574 - 0.613 - - - 0.676 - 0.701
AGIEVALC_logiqa 0.122 0.317 0.496 0.483 0.471 0.499 0.497 - 0.562 - - - 0.599 - 0.642
AGIEVALC_mathcloze 0.728 0.669 0.923 0.957 0.838 0.830 0.923 - 0.881 - - - 0.932 0.864 0.889
AGIEVALC_mathqa 0.500 0.704 0.828 0.812 0.764 0.863 0.851 - 0.813 - - - 0.852 0.828 0.904
AGIEVALC_physics 0.080 0.333 0.436 0.431 0.454 0.563 0.545 - 0.626 - - - 0.701 0.741 0.528
AGIEVALC 0.225 0.448 0.602 0.589 0.580 0.639 0.646 - 0.680 - - - 0.729 0.811 0.733
BBH
boolean_expressions 0.724 0.728 0.820 0.812 0.612 0.560 0.620 - 0.900 - - - 0.832 - 0.768
causal_judgement 0.491 0.561 0.593 0.582 0.540 0.588 0.572 - 0.604 - - - 0.631 - 0.636
date_understanding 0.504 0.752 0.880 0.888 0.852 0.936 0.912 - 0.916 - - - 0.940 - 0.884
disambiguation_qa 0.448 0.464 0.648 0.588 0.464 0.544 0.520 - 0.636 - - - 0.448 - 0.436
dyck_languages 0.412 0.524 0.580 0.572 0.672 0.696 0.688 - 0.772 - - - 0.816 - 0.848
formal_fallacies 0.800 0.800 0.748 0.776 0.568 0.992 0.604 - 0.768 - - - 0.728 - 0.976
geometric_shapes 0.228 0.572 0.536 0.556 0.692 0.716 0.676 - 0.688 - - - 0.728 - 0.780
hyperbaton 0.576 0.692 0.872 0.856 0.912 0.952 0.960 - 0.976 - - - 0.948 - 0.940
logical_deduction_five_objects 0.416 0.772 0.884 0.868 0.856 0.872 0.928 - 0.936 - - - 0.972 - 0.988
logical_deduction_seven_objects 0.360 0.664 0.856 0.840 0.816 0.860 0.880 - 0.888 - - - 0.924 - 0.968
logical_deduction_three_objects 0.612 0.932 0.988 0.988 0.988 0.984 0.980 - 0.996 - - - 1.000 - 0.988
movie_recommendation 0.360 0.416 0.528 0.504 0.492 0.520 0.544 - 0.572 - - - 0.616 - 0.668
multistep_arithmetic_two 0.896 0.988 0.996 0.988 0.984 1.000 1.000 - 0.572 - - - 0.996 - 1.000
navigate 0.516 0.576 0.580 0.580 0.508 0.992 0.608 - 0.680 - - - 0.728 - 0.996
object_counting 0.664 0.872 0.992 0.996 0.996 0.992 0.996 - 0.996 - - - 1.000 - 0.996
penguins_in_a_table 0.602 0.897 0.945 0.958 0.993 1.000 0.993 - 1.000 - - - 1.000 - 1.000
reasoning_about_colored_objects 0.520 0.792 0.952 0.960 0.928 0.940 0.960 - 0.948 - - - 0.984 - 0.964
ruin_names 0.164 0.512 0.508 0.516 0.604 0.656 0.652 - 0.772 - - - 0.776 - 0.768
salient_translation_error_detection 0.316 0.488 0.612 0.632 0.604 0.628 0.572 - 0.660 - - - 0.680 - 0.576
snarks 0.471 0.573 0.730 0.685 0.735 0.792 0.735 - 0.780 - - - 0.837 - 0.831
sports_understanding 0.472 0.524 0.624 0.596 0.540 0.644 0.636 - 0.560 - - - 0.776 - 0.464
temporal_sequences 0.136 0.400 0.912 0.892 0.940 0.992 0.992 - 0.980 - - - 0.992 - 0.948
tracking_shuffled_objects_five_objects 0.280 0.648 0.940 0.964 0.968 0.956 0.936 - 0.996 - - - 0.996 - 0.988
tracking_shuffled_objects_seven_objects 0.232 0.564 0.852 0.884 0.924 0.952 0.972 - 0.944 - - - 0.948 - 0.980
tracking_shuffled_objects_three_objects 0.408 0.736 0.896 0.884 0.920 0.920 0.952 - 0.996 - - - 0.996 - 0.992
web_of_lies 0.456 0.460 0.552 0.544 0.488 - 0.488 - 0.540 - - - 0.492 - 1.000
word_sorting 0.080 0.136 0.220 0.228 0.292 0.292 0.288 - 0.324 - - - 0.324 - 0.400
BBH 0.446 0.628 0.748 0.744 0.734 0.763 0.763 - 0.791 - - - 0.817 - 0.825
MUSR
murder_mystery 0.524 0.560 0.640 0.668 0.584 0.636 0.652 - 0.672 - - - 0.636 - 0.560
object_placements 0.480 0.512 0.566 0.556 0.536 0.582 0.578 - 0.528 - - - 0.516 - 0.436
team_allocation 0.280 0.468 0.628 0.612 0.648 0.656 0.668 - 0.564 - - - 0.632 - 0.728
MUSR 0.428 0.513 0.611 0.612 0.589 0.624 0.632 - 0.588 - - - 0.594 - 0.574
GPQA_diamond 0.262 0.282 0.434 0.489 0.398 0.530 - - 0.439 - - - 0.555 - 0.555
GPQA 0.262 0.282 0.434 0.489 0.398 0.530 - - 0.439 - - - 0.555 - 0.555
MMLUPRO
biology 0.268 0.596 0.799 0.784 0.804 0.822 0.831 - 0.824 - - - 0.852 - 0.812
business 0.348 0.548 0.717 0.692 0.724 0.740 0.738 - 0.784 - - - 0.812 - 0.788
chemistry 0.300 0.564 0.720 0.700 0.700 0.746 0.747 - 0.796 - - - 0.780 - 0.740
computer_science 0.232 0.532 0.680 0.664 0.676 0.704 0.707 - 0.704 - - - 0.804 - 0.780
economics 0.276 0.544 0.716 0.732 0.732 0.741 0.759 - 0.808 - - - 0.832 - 0.796
engineering 0.232 0.456 0.557 0.596 0.536 0.587 0.600 - 0.620 - - - 0.676 - 0.476
health 0.192 0.312 0.559 0.564 0.632 0.630 0.639 - 0.652 - - - 0.680 - 0.652
history 0.200 0.300 0.488 0.524 0.536 0.561 0.556 - 0.600 - - - 0.680 - 0.588
law 0.076 0.208 0.295 0.304 0.348 0.353 0.379 - 0.384 - - - 0.468 - 0.392
math 0.444 0.676 0.832 0.800 0.780 0.824 0.827 - 0.844 - - - 0.872 - 0.860
other 0.188 0.356 0.484 0.528 0.560 0.590 0.593 - 0.656 - - - 0.704 - 0.636
philosophy 0.168 0.336 0.504 0.520 0.492 0.559 0.567 - 0.580 - - - 0.664 - 0.612
physics 0.232 0.580 0.708 0.740 0.720 0.753 0.752 - 0.756 - - - 0.824 - 0.736
psychology 0.212 0.512 0.672 0.632 0.668 0.725 0.692 - 0.728 - - - 0.748 - 0.672
MMLUPRO 0.240 0.465 0.622 0.627 0.636 0.674 0.679 - 0.695 - - - 0.742 - 0.681
CATEGORIES
REASONING 0.332 0.593 0.744 0.746 0.738 0.787 0.792 0.827 0.802 0.815 0.832 0.838 0.824 0.885 0.850
UNDERSTANDING 0.353 0.521 0.653 0.651 0.646 0.693 0.697 0.722 0.725 0.699 0.747 0.849 0.743 0.809 0.731
LANGUAGE 0.471 0.590 0.644 0.638 0.660 0.700 0.700 0.729 0.714 0.698 0.725 0.692 0.701 0.780 0.651
KNOWLEDGE 0.367 0.456 0.552 0.548 0.558 0.527 0.541 0.657 0.632 0.520 0.501 0.599 0.563 0.581 0.659
COT 0.307 0.478 0.616 0.611 0.611 0.667 0.670 - 0.682 - - - 0.717 - 0.674
MATHCOT 0.545 0.766 0.900 0.890 0.882 0.884 0.895 0.910 0.904 0.820 0.954 0.910 0.924 0.927 0.948
CODE 0.317 0.443 0.538 0.524 0.532 0.582 0.573 - 0.618 - 0.755 - 0.586 0.321 0.506
DISCIPLINES
NLP 0.402 0.562 0.674 0.672 0.671 0.694 0.700 0.767 0.739 0.769 0.764 0.768 0.723 0.772 0.765
MATH 0.433 0.638 0.789 0.767 0.779 0.808 0.815 0.910 0.824 0.820 0.954 0.910 0.828 0.927 0.829
SCIENCE 0.340 0.648 0.783 0.789 0.782 0.808 0.818 - 0.847 - - - 0.866 0.946 0.793
ENGINEERING 0.265 0.463 0.561 0.589 0.554 0.595 0.606 - 0.655 - - - 0.693 - 0.491
MEDICINE 0.232 0.392 0.554 0.552 0.566 0.613 0.620 - 0.642 - 0.597 - 0.668 0.598 0.642
HUMANITIES 0.283 0.449 0.595 0.593 0.598 0.642 0.639 0.480 0.677 0.520 0.470 0.470 0.707 0.600 0.717
BUSINESS 0.358 0.555 0.717 0.717 0.725 0.756 0.762 - 0.807 - - - 0.815 - 0.724
LAW 0.200 0.329 0.438 0.458 0.448 0.473 0.488 - 0.535 - - - 0.586 - 0.625
COMPOSITE AVERAGE
AVG 0.363 0.539 0.664 0.661 0.663 0.693 0.699 0.767 0.732 0.768 0.764 0.767 0.731 0.759 0.743

CODE MODELS:

MODEL Codestral-22B-v0.1 Codestral-22B-Instruct-v0.1 Deepseek-Coder-V2-Lite-Instruct Qwen2.5-Coder-0.5B-32k-Instruct Qwen2.5-Coder-1.5B-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-Coder-3B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-Coder-7B Qwen2.5-Coder-7B Qwen2.5-Coder-14B-Instruct Qwen2.5-Coder-14B Qwen2.5-Coder-32B-Instruct Qwen3-Coder-30B-A3B-Instruct
params 22B 22B 14.77B 0.49403B 1.54B 3.09B 3.09B 7.62B 7.62B 7.62B 14.77B 14.77B 32.76B 30.53B
quant IQ4_XS IQ4_XS IQ4_XS Q6_K Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q6_K IQ4_XS IQ4_XS IQ4_XS Q4_K_H
engine llama.cpp version: 4132 llama.cpp version: 4191 llama.cpp version: 4488 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4094 llama.cpp version: 4295 llama.cpp version: 4132 llama.cpp version: 4120 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 5935
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc
HUMANEVAL 0.664 0.810 0.847 0.518 0.676 0.780 0.835 0.829 0.640 0.713 0.878 0.676 0.884 0.939
HUMANEVALP 0.554 0.682 - 0.432 0.567 0.682 0.719 0.707 0.530 0.579 0.756 0.536 0.756 0.810
HUMANEVALFIM 0.719 0.719 0.621 0.518 0.524 - 0.634 0.493 0.713 0.756 0.829 0.518 0.890 0.713
MBPP 0.630 0.653 - 0.408 0.560 0.599 0.618 0.735 0.614 0.571 0.727 0.661 0.715 0.696
MBPPP 0.558 0.593 - 0.352 0.504 0.584 0.589 0.687 0.540 0.513 0.665 0.558 0.669 0.669
HUMANEVALX_cpp 0.640 0.621 - 0.286 0.426 0.237 0.567 0.676 0.548 0.475 0.506 0.573 0.689 0.817
HUMANEVALX_java 0.756 0.670 - 0.512 0.609 0.615 0.743 0.798 0.725 0.652 0.201 0.762 0.841 0.884
HUMANEVALX_js 0.658 0.621 - 0.493 0.615 0.682 0.670 0.798 0.628 0.658 0.817 0.695 0.835 0.871
HUMANEVALX 0.684 0.638 - 0.430 0.550 0.512 0.660 0.758 0.634 0.595 0.508 0.676 0.788 0.857
CRUXEVAL_input 0.438 0.351 - 0.435 0.416 0.347 0.481 0.578 0.255 0.267 0.677 0.281 0.676 0.577
CRUXEVAL_output 0.465 0.447 - 0.278 0.332 0.311 0.413 0.507 0.381 0.435 0.577 0.422 0.610 0.558
CRUXEVAL 0.451 0.399 - 0.356 0.374 0.329 0.447 0.543 0.318 0.351 0.627 0.351 0.643 0.568
CRUXEVALFIM_input 0.295 0.351 - 0.017 0.155 - 0.208 0.322 0.296 0.313 0.421 0.346 0.515 0.440
CRUXEVALFIM_output 0.441 0.355 - 0.098 0.222 - 0.323 0.481 0.352 0.365 0.546 0.481 0.557 0.340
CRUXEVALFIM 0.368 0.353 - 0.058 0.188 - 0.266 0.401 0.324 0.339 0.483 0.413 0.536 0.390
CODE 0.483 0.467 0.734 0.278 0.368 0.449 0.453 0.548 0.413 0.427 0.593 0.458 0.648 0.576

MATH MODELS:

MODEL Deepseek-R1-Distill-Llama-8B Deepseek-R1-Distill-Llama-8B Deepseek-R1-Distill-Qwen-1.5B Deepseek-R1-Distill-Qwen-7B Deepseek-R1-Distill-Qwen-14B Deepseek-R1-Distill-Qwen-32B GLM-Z1-9B-0414 Qwen2.5-Math-1.5B-Instruct Qwen2.5-Math-7B-Instruct Qwen3-32B QwQ-32B QwQ-32B
params 8.03B 8.03B 1.78B 7.62B 14.77B 32.76B 9.40B 1.54B 7.62B 32.8B 32.76B 32.76B
quant Q6_K Q6_K_H Q8_0 IQ4_XS IQ4_XS IQ4_XS Q6_K_H IQ4_XS Q6_K Q4_K_H IQ4_XS Q4_K_H
engine llama.cpp version: 4707 llama.cpp version: 5898 llama.cpp version: 4763 llama.cpp version: 4644 llama.cpp version: 4657 llama.cpp version: 4559 llama.cpp version: 5935 llama.cpp version: 4406 llama.cpp version: 4394 llama.cpp version: 5633 llama.cpp version: 4820 llama.cpp version: 6026
TEST acc acc acc acc acc acc acc acc acc acc acc acc
GSM8K - 0.888 - - - - 0.964 - - - - 0.964
APPLE - 0.870 - - - - 0.880 - - - - 0.880
GPQA_diamond - 0.308 - - - - 0.434 - - - - 0.555
GPQA - 0.308 - - - - 0.434 - - - - 0.555
MATH1_algebra 0.933 0.977 0.918 0.962 0.925 0.962 0.992 0.859 0.955 0.992 0.992 1.000
MATH1_counting_and_probability 0.820 0.948 0.794 0.948 0.923 0.948 1.000 0.897 0.974 1.000 0.974 1.000
MATH1_geometry 0.842 0.868 0.710 0.736 0.868 0.921 0.947 0.710 0.842 0.842 0.921 0.921
MATH1_intermediate_algebra 0.923 0.980 0.730 0.903 0.865 0.961 0.961 0.730 0.711 0.923 1.000 0.980
MATH1_number_theory 0.700 0.900 0.866 0.800 0.700 0.933 0.900 0.766 1.000 0.666 0.800 0.833
MATH1_prealgebra 0.813 0.941 0.883 0.965 0.883 0.953 0.988 0.837 0.883 0.930 0.953 0.976
MATH1_precalculus 0.684 1.000 0.596 0.859 0.842 1.000 0.947 0.631 0.789 0.947 0.982 0.929
MATH1 0.842 0.956 0.814 0.910 0.878 0.958 0.972 0.794 0.885 0.931 0.963 0.965
MATH2_algebra 0.845 0.975 0.825 0.930 0.900 0.995 0.980 0.910 0.860 0.970 0.975 0.990
MATH2_counting_and_probability 0.831 0.930 0.782 0.851 0.841 0.950 0.980 0.683 0.861 0.970 0.990 0.970
MATH2_geometry 0.841 0.963 0.743 0.914 0.792 0.914 0.987 0.621 0.743 0.792 0.963 0.963
MATH2_intermediate_algebra 0.859 0.953 0.664 0.875 0.835 0.968 0.968 0.671 0.710 0.953 0.960 0.984
MATH2_number_theory 0.826 0.913 0.782 0.826 0.891 0.934 0.956 0.695 0.880 0.891 0.945 0.967
MATH2_prealgebra 0.898 0.966 0.887 0.909 0.875 0.971 0.988 0.836 0.881 0.932 0.971 0.960
MATH2_precalculus 0.787 0.964 0.663 0.902 0.805 0.955 0.955 0.557 0.725 0.858 0.964 0.964
MATH2 0.846 0.956 0.777 0.893 0.856 0.963 0.975 0.742 0.817 0.921 0.968 0.973
MATH3_algebra 0.873 0.938 0.854 0.934 0.911 0.992 0.980 0.881 0.850 0.969 0.996 0.984
MATH3_counting_and_probability 0.800 0.890 0.730 0.770 0.830 0.930 0.970 0.710 0.880 0.950 1.000 1.000
MATH3_geometry 0.794 0.901 0.627 0.911 0.794 0.901 0.970 0.696 0.764 0.833 0.970 0.931
MATH3_intermediate_algebra 0.825 0.969 0.635 0.902 0.882 0.964 0.969 0.574 0.738 0.933 0.969 0.938
MATH3_number_theory 0.819 0.934 0.696 0.754 0.770 0.926 0.926 0.655 0.819 0.811 0.942 0.918
MATH3_prealgebra 0.875 0.950 0.763 0.883 0.892 0.946 0.986 0.816 0.883 0.946 0.982 0.977
MATH3_precalculus 0.661 0.929 0.582 0.874 0.818 0.968 0.968 0.480 0.685 0.858 0.897 0.905
MATH3 0.822 0.937 0.719 0.876 0.859 0.954 0.970 0.714 0.810 0.915 0.969 0.955
MATH4_algebra 0.848 0.950 0.805 0.897 0.922 0.957 0.989 0.851 0.865 0.968 0.992 0.985
MATH4_counting_and_probability 0.729 0.882 0.639 0.738 0.711 0.945 0.963 0.558 0.783 0.945 0.981 0.981
MATH4_geometry 0.792 0.896 0.576 0.776 0.768 0.832 0.920 0.432 0.616 0.712 0.872 0.840
MATH4_intermediate_algebra 0.778 0.947 0.588 0.858 0.850 0.935 0.939 0.512 0.649 0.911 0.947 0.907
MATH4_number_theory 0.795 0.950 0.697 0.809 0.725 0.894 0.929 0.619 0.823 0.823 0.943 0.936
MATH4_prealgebra 0.806 0.931 0.785 0.874 0.827 0.921 0.958 0.748 0.801 0.879 0.926 0.942
MATH4_precalculus 0.719 0.956 0.570 0.868 0.728 0.947 0.947 0.333 0.578 0.859 0.973 0.868
MATH4 0.792 0.935 0.684 0.845 0.816 0.925 0.953 0.620 0.746 0.887 0.952 0.930
MATH5_algebra 0.768 0.947 0.752 0.899 0.853 0.970 0.960 0.674 0.762 0.964 0.964 0.977
MATH5_counting_and_probability 0.699 0.910 0.569 0.756 0.699 0.910 0.934 0.495 0.642 0.910 0.934 0.902
MATH5_geometry 0.712 0.886 0.545 0.810 0.727 0.840 0.878 0.348 0.507 0.734 0.833 0.742
MATH5_intermediate_algebra 0.682 0.900 0.453 0.821 0.778 0.810 0.889 0.253 0.389 0.807 0.860 0.800
MATH5_number_theory 0.811 0.909 0.707 0.727 0.792 0.935 0.961 0.525 0.753 0.870 0.941 0.935
MATH5_prealgebra 0.777 0.849 0.720 0.808 0.782 0.875 0.953 0.580 0.797 0.911 0.927 0.948
MATH5_precalculus 0.562 0.903 0.437 0.851 0.792 0.814 0.888 0.259 0.429 0.777 0.851 0.770
MATH5 0.723 0.904 0.609 0.822 0.787 0.884 0.926 0.462 0.617 0.865 0.907 0.879
MATHCOT 0.795 0.930 0.700 0.860 0.831 0.930 0.954 0.637 0.751 0.897 0.948 0.933
COMPOSITE AVERAGE
AVG 0.795 0.907 0.700 0.860 0.831 0.930 0.936 0.637 0.751 0.897 0.948 0.920

VISION MODELS:

MODEL gemma-3-4b-it gemma-3-4b-it gemma-3-12b-it gemma-3-27b-it Llama-4-Scout-17B-16E-Instruct Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.2-24B-Instruct-2506 Qwen2.5-Omni-7B Qwen2.5-VL-3B-Instruct Qwen2.5-VL-7B-Instruct Qwen2.5-VL-32B-Instruct
params 3.88B 3.88B 11.77B 27.01B 107.77B 23.57B 23.57B 7.62B 3.09B 7.62B 32.76B
quant Q6_K Q6_K_H Q4_K_H Q4_K_H Q2_K_H Q4_K_H Q4_K_H Q6_K_H Q8_0_H Q6_K_H Q4_K_H
engine llama.cpp version: 5706 llama.cpp version: 5819 llama.cpp version: 5819 llama.cpp version: 5780 llama.cpp version: 5935 llama.cpp version: 5662 llama.cpp version: 5780 llama.cpp version: 5752 llama.cpp version: 5819 llama.cpp version: 5745 llama.cpp version: 5902
TEST acc acc acc acc acc acc acc acc acc acc acc
CHARTQA 0.464 0.456 0.558 0.662 0.719 0.743 0.716 0.554 0.706 0.651 0.711
DOCVQA 0.567 0.563 0.711 0.795 0.862 0.892 0.866 0.744 0.686 0.735 0.795
MMMU_Accounting 0.366 0.400 0.566 0.700 0.866 0.466 0.733 0.466 0.433 0.533 0.533
MMMU_Agriculture 0.400 0.400 0.500 0.533 0.600 0.500 0.533 0.333 0.500 0.433 0.566
MMMU_Architecture_and_Engineering 0.200 0.166 0.400 0.333 0.366 0.400 0.400 0.333 0.133 0.400 0.366
MMMU_Art_Theory 0.533 0.666 0.833 0.866 0.800 0.866 0.700 0.700 0.400 0.633 0.766
MMMU_Art 0.566 0.566 0.700 0.766 0.766 0.633 0.666 0.600 0.400 0.500 0.600
MMMU_Basic_Medical_Science 0.333 0.533 0.633 0.566 0.700 0.733 0.600 0.566 0.333 0.500 0.633
MMMU_Biology 0.300 0.166 0.300 0.366 0.500 0.400 0.433 0.333 0.200 0.500 0.500
MMMU_Chemistry 0.033 0.266 0.333 0.333 0.433 0.366 0.366 0.300 0.233 0.166 0.400
MMMU_Clinical_Medicine 0.066 0.466 0.533 0.600 0.633 0.633 0.733 0.533 0.333 0.566 0.666
MMMU_Computer_Science 0.400 0.466 0.466 0.600 0.533 0.400 0.433 0.566 0.133 0.433 0.566
MMMU_Design 0.633 0.766 0.733 0.766 0.866 0.666 0.800 0.800 0.566 0.566 0.800
MMMU_Diagnostics_and_Laboratory_Medicine 0.100 0.200 0.300 0.233 0.433 0.400 0.433 0.400 0.300 0.300 0.466
MMMU_Economics 0.466 0.533 0.500 0.600 0.766 0.666 0.766 0.600 0.500 0.600 0.666
MMMU_Electronics 0.066 0.133 0.233 0.400 0.466 0.400 0.400 0.366 0.233 0.300 0.333
MMMU_Energy_and_Power 0.333 0.233 0.400 0.500 0.600 0.500 0.400 0.400 0.266 0.233 0.466
MMMU_Finance 0.333 0.333 0.466 0.533 0.500 0.500 0.466 0.366 0.266 0.400 0.466
MMMU_Geography 0.200 0.266 0.333 0.366 0.533 0.533 0.433 0.400 0.233 0.466 0.533
MMMU_History 0.566 0.633 0.733 0.800 0.800 0.500 0.700 0.633 0.533 0.533 0.866
MMMU_Literature 0.666 0.766 0.866 0.900 0.800 0.866 0.866 0.833 0.866 0.766 0.800
MMMU_Manage 0.233 0.333 0.333 0.500 0.500 0.466 0.566 0.500 0.366 0.466 0.533
MMMU_Marketing 0.333 0.400 0.466 0.666 0.800 0.633 0.700 0.500 0.233 0.566 0.533
MMMU_Materials 0.133 0.233 0.133 0.300 0.533 0.300 0.366 0.266 0.133 0.333 0.400
MMMU_Math 0.300 0.400 0.533 0.566 0.566 0.466 0.566 0.466 0.433 0.566 0.333
MMMU_Mechanical_Engineering 0.166 0.166 0.300 0.333 0.733 0.433 0.466 0.300 0.266 0.366 0.500
MMMU_Music 0.166 0.333 0.200 0.333 0.233 0.400 0.433 0.200 0.533 0.400 0.133
MMMU_Pharmacy 0.333 0.366 0.600 0.633 0.766 0.566 0.700 0.366 0.433 0.566 0.666
MMMU_Physics 0.166 0.300 0.433 0.600 0.666 0.433 0.600 0.500 0.400 0.333 0.633
MMMU_Psychology 0.366 0.433 0.500 0.566 0.566 0.633 0.466 0.433 0.500 0.366 0.566
MMMU_Public_Health 0.433 0.700 0.700 0.800 0.866 0.766 0.800 0.666 0.333 0.733 0.866
MMMU_Sociology 0.366 0.600 0.600 0.700 0.666 0.533 0.733 0.466 0.566 0.400 0.633
MMMU 0.318 0.407 0.487 0.558 0.628 0.535 0.575 0.473 0.368 0.464 0.560
MMMUPRO_Accounting 0.224 0.310 0.534 0.603 0.741 0.551 0.586 0.362 0.293 0.396 0.603
MMMUPRO_Agriculture 0.200 0.200 0.350 0.450 0.383 0.283 0.266 0.150 0.166 0.233 0.250
MMMUPRO_Architecture_and_Engineering 0.100 0.133 0.216 0.333 0.433 0.316 0.366 0.250 0.200 0.266 0.366
MMMUPRO_Art_Theory 0.472 0.490 0.636 0.709 0.672 0.618 0.527 0.563 0.400 0.581 0.654
MMMUPRO_Art 0.396 0.452 0.547 0.622 0.603 0.471 0.528 0.547 0.207 0.415 0.509
MMMUPRO_Basic_Medical_Science 0.269 0.250 0.384 0.442 0.596 0.384 0.403 0.307 0.250 0.423 0.365
MMMUPRO_Biology 0.169 0.237 0.288 0.322 0.423 0.355 0.372 0.237 0.101 0.305 0.440
MMMUPRO_Chemistry 0.200 0.266 0.333 0.350 0.366 0.383 0.450 0.216 0.250 0.316 0.433
MMMUPRO_Clinical_Medicine 0.118 0.135 0.237 0.372 0.322 0.474 0.389 0.271 0.101 0.203 0.406
MMMUPRO_Computer_Science 0.283 0.350 0.383 0.300 0.483 0.333 0.383 0.300 0.150 0.366 0.400
MMMUPRO_Design 0.433 0.500 0.533 0.616 0.616 0.533 0.616 0.616 0.366 0.550 0.683
MMMUPRO_Diagnostics_and_Laboratory_Medicine 0.116 0.200 0.200 0.233 0.383 0.200 0.300 0.216 0.100 0.183 0.250
MMMUPRO_Economics 0.423 0.457 0.559 0.644 0.677 0.661 0.627 0.389 0.254 0.491 0.576
MMMUPRO_Electronics 0.233 0.316 0.350 0.316 0.616 0.600 0.600 0.400 0.266 0.433 0.466
MMMUPRO_Energy_and_Power 0.172 0.172 0.120 0.327 0.568 0.224 0.275 0.155 0.137 0.172 0.413
MMMUPRO_Finance 0.283 0.366 0.533 0.516 0.633 0.600 0.650 0.350 0.216 0.383 0.433
MMMUPRO_Geography 0.346 0.307 0.346 0.403 0.480 0.384 0.384 0.269 0.134 0.307 0.384
MMMUPRO_History 0.375 0.392 0.535 0.553 0.607 0.428 0.553 0.464 0.410 0.392 0.553
MMMUPRO_Literature 0.500 0.461 0.634 0.692 0.730 0.615 0.634 0.557 0.615 0.557 0.653
MMMUPRO_Manage 0.220 0.240 0.320 0.400 0.480 0.420 0.420 0.320 0.320 0.260 0.440
MMMUPRO_Marketing 0.288 0.305 0.440 0.440 0.627 0.525 0.593 0.338 0.271 0.508 0.559
MMMUPRO_Materials 0.083 0.133 0.150 0.250 0.316 0.166 0.300 0.166 0.150 0.166 0.250
MMMUPRO_Math 0.283 0.233 0.316 0.466 0.483 0.416 0.466 0.283 0.200 0.250 0.316
MMMUPRO_Mechanical_Engineering 0.152 0.186 0.271 0.305 0.474 0.440 0.372 0.271 0.135 0.338 0.406
MMMUPRO_Music 0.216 0.250 0.266 0.233 0.183 0.183 0.233 0.300 0.250 0.233 0.216
MMMUPRO_Pharmacy 0.298 0.298 0.456 0.491 0.596 0.508 0.684 0.385 0.333 0.333 0.491
MMMUPRO_Physics 0.166 0.116 0.416 0.400 0.533 0.466 0.466 0.300 0.200 0.266 0.433
MMMUPRO_Psychology 0.366 0.333 0.300 0.383 0.500 0.350 0.416 0.400 0.166 0.200 0.366
MMMUPRO_Public_Health 0.241 0.293 0.396 0.551 0.758 0.482 0.551 0.327 0.172 0.448 0.655
MMMUPRO_Sociology 0.333 0.462 0.574 0.629 0.592 0.574 0.518 0.407 0.425 0.314 0.592
MMMUPRO 0.263 0.293 0.384 0.442 0.527 0.430 0.463 0.335 0.238 0.341 0.450
DISCIPLINES
NLP - - - - - - - - - - -
MATH 0.305 0.338 0.400 0.450 0.505 0.394 0.450 0.366 0.211 0.372 0.388
SCIENCE 0.178 0.218 0.318 0.378 0.452 0.354 0.400 0.258 0.213 0.289 0.407
ENGINEERING 0.239 0.272 0.337 0.409 0.563 0.442 0.463 0.373 0.246 0.360 0.476
MEDICINE 0.222 0.309 0.408 0.467 0.580 0.481 0.529 0.371 0.243 0.389 0.511
HUMANITIES 0.392 0.441 0.517 0.571 0.577 0.508 0.524 0.470 0.387 0.419 0.530
BUSINESS 0.309 0.360 0.477 0.550 0.653 0.552 0.603 0.399 0.300 0.447 0.532
LAW - - - - - - - - - - -
VISION 0.498 0.500 0.628 0.714 0.780 0.790 0.773 0.640 0.634 0.661 0.728
COMPOSITE AVERAGE
AVG 0.471 0.479 0.602 0.685 0.752 0.749 0.739 0.608 0.590 0.627 0.698

AUDIO MODELS:

MODEL Qwen2.5-Omni-7B ultravox-v0_5-llama-3_1-8b ultravox-v0_5-deepseek-r1-llama-3_1-8b ultravox-v0_5-deepseek-r1-llama-3_1-8b ultravox-v0_6-gemma-3-27b ultravox-v0_6-qwen-3-32b Voxtral-Mini-3B-2507 Voxtral-Mini-3B-2507 Voxtral-Small-24B-2507
params 7.62B 8.03B 8.03B 8.03B 27.01B 32.8B 4.01B 4.01B 23.57B
quant Q6_K_H Q6_K_H Q6_K Q6_K_H Q4_K_H Q4_K_H Q6_K Q6_K_H Q4_K_H
engine llama.cpp version: 5780 llama.cpp version: 5780 llama.cpp version: 5869 llama.cpp version: 5890 llama.cpp version: 5853 llama.cpp version: 5853 llama.cpp version: 6014 llama.cpp version: 6014 llama.cpp version: 6014
TEST acc acc acc acc acc acc acc acc acc
BBA_formal_fallacies 0.472 0.552 0.768 0.848 0.640 0.996 0.544 0.528 0.576
BBA_navigate 0.756 0.776 0.988 0.984 0.716 0.976 0.664 0.656 0.680
BBA_object_counting 0.616 0.864 0.924 0.856 0.800 0.984 0.596 0.640 0.504
BBA_web_of_lies 0.540 0.844 0.932 0.920 0.464 0.784 0.576 0.576 0.660
BBA 0.596 0.759 0.903 0.902 0.655 0.935 0.595 0.600 0.605
DISCIPLINES
NLP 0.506 0.698 0.850 0.884 0.552 0.890 0.560 0.552 0.618
MATH 0.686 0.820 0.956 0.920 0.758 0.980 0.630 0.648 0.592
SCIENCE - - - - - - - - -
ENGINEERING - - - - - - - - -
MEDICINE - - - - - - - - -
HUMANITIES - - - - - - - - -
BUSINESS - - - - - - - - -
LAW - - - - - - - - -
AUDIO - - - - - - - - -
COMPOSITE AVERAGE
AVG 0.596 0.759 0.903 0.902 0.655 0.935 0.595 0.600 0.605