To see results on github go to https://github.com/steampunque/benchlm

To see results on hf go to https://huggingface.co/spaces/steampunque/benchlm

Independent LLM benchmarks for a wide range of open weight models using custom prompts
including category and discipline summaries.  The model list is actively updated
with latest models releases.  Older obsoleted model results are not kept.

The primary model families being tracked as of 6/2/25 are:
Meta (Llama), Mistral, Google (Gemma), Qwen (Qwen2.5, Qwen2.5 coder, Qwen3, QwQ),
Microsoft (Phi), Deepseek (Qwen R1 distills).

Secondary model being tracked as of 6/2/25 are:
Falcon3, internlm family, GLM family.

Models being tracked are selected based on general high popularity, high
performance or other innovation, with non-restrictive open source license terms.

Tests are run using a modified llama.cpp server (supporting logprob completion mode).

MODEL CATEGORIES:
   GENERAL : general purpose text in text out
   THINK   : RL tuned reasoning models with <think> </think> block or equivalent
   CODE    : coding optimized models
   MATH    : math optimized models applied to Hendryks MATH500 set
   VISION  : image + text in, text out
   AUDIO   : audio + text in, text out

METHODOLOGY:
   -All CoT, code, and math tests are zero shot.  A few BBH tests use fewshot examples.
   -Math CoT test such as GSM8K, APPLE, MATH etc. are self graded against correct answer using LLM under test
      If self grade does not work reliably (such as with very small model) the result is zeroed to mark invalid test.
   -All non-CoT MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1.
      To score a correct answer in MC both queries must answer correctly.
   -Cot MC tests (e.g. MMLUPRO, GPQA, etc.) do one query only.
   -Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases).
   -The new SQA test is not run on all models.  When it is run, the result is only added to SQA test.
      The SQA result is not added to knowledge and composite averages.  The test is not meant for
      small models and results are given for information only.
   -MMMU uses the validation split (30 questions from 30 categories) and CoT prompting
   -MMMUPRO uses the 10-question split and CoT prompting
   -All new tests are run with a maximum of 250 questions per test category on CoT tests.
      This is necessary to contain test time with new thinking models when can generate very lengthy responses.
      The result is printed in italics if there were more than 10 skipped questions in a test category.
      Note some very old runs had skips due to JSON errors in questions but these will not significantly impact averages.

TESTS:
   KNOWLEDGE:
      TQA - Truthful QA
      SQA - Simple QA 4333 question arcane knowledge quiz
      JEOPARDY - 100 Question JEOPARDY quiz
   LANGUAGE:
      LAMBADA - Language Modeling Broadened to Account for Discourse Aspects
   UNDERSTANDING:
      WG - Winogrande
      BOOLQ - Boolean questions
      STORYCLOZE - Story questions
      OBQA - Open Book Question / Answer
      SIQA - Social IQ
      RACE - Reading comprehension dataset from examinations
      MMLU - massive multitask language understanding
      MEDQA - medical QA
   REASONING
      CSQA - Common Sense Question Answer
      COPA - Choice of Plausible Alternatives
      HELLASWAG -Harder Endings, Longer contexts, and Low-shot Activities
                 for Situations With Adversarial Generations
      PIQA - Physical Interaction: Question Answering
      ARC - A12 Reasoning Challenge
      AGIEVAL - AGIEval logiqa, lsat, sat
      AGIEVALC  - Gaokao SAT, logiqa, jec (Chinese)
      MUSR - Multimodal Semantic Reasoning
   COT:
      GSM8K - Grade School Math CoT
      BBH  - Beyond the Imitation Game Bench Hard CoT
      GPQA - Google-Proof QA science CoT
      MMLUPRO - massive multitask language understanding pro CoT
      AGIEVAL - satmath, aquarat
      AGIEVALC  - mathcloze, mathqa (Chinese)
      MUSR - Multimodal Semantic Reasoning
      APPLE - 100 custom Apple Questions
   MATH:
      MATH1..MATH5 - MATH Datasets level 1 through 5 (Hendrycks et al.)
   CODE:
      HUMANEVAL - Python
      HUMANEVALP - Python, extended test
      HUMANEVALX - Python, Java, Javascript, C++
      MBPP - Python
      MBPPP - Python, extendend test
      CRUXEVAL - Python
      USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM
   VISION:
      CHARTQA - Chart Question/Answer
      DOCVQA  - Document Vision QA
      MMMU - Massive Multi-discipline Multimodal Understanding (CoT)
      MMMUPRO - Massive Multi-discipline Multimodal Understanding Pro (CoT)
   AUDIO:
      BBA - Big Bench Audio

GENERAL MODELS:

MODEL Falcon3-1B-Instruct Falcon3-7B-Instruct Falcon3-10B-Instruct gemma-2-9b-it gemma-2-27b-it gemma-3-1b-it gemma-3-4b-it gemma-3-12b-it gemma-3-12b-it gemma-3-27b-it glm-4-9b-chat glm-4-9b-chat internlm3-8b-instruct Llama-3.1-8B-Instruct Llama-3.2-3B-Instruct Llama-4-Scout-17B-16E-Instruct Llama-4-Scout-17B-16E-Instruct Llama-4-Scout-17B-16E-Instruct Mistral-7B-Instruct-v0.3 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.2-24B-Instruct-2506 Phi-3.5-mini-8k-instruct Phi-3.5-mini-128k-instruct Phi-4 Qwen2.5-3B-32k-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-14B-32k-Instruct Qwen2.5-32B-Instruct
params 1.67B 7.46B 10.31B 9.24B 27.23B 0.99989B 3.88B 11.77B 11.77B 27.01B 9.40B 9.40B 8.80B 8.03B 3.21B 107.77B 107.77B 107.77B 7.25B 23.57B 23.57B 23.57B 23.57B 3.82B 3.82B 14.66B 3.09B 3.09B 7.62B 7.62B 14.77B 32.76B
quant IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS Q8_0 Q6_K IQ4_XS Q4_K_H Q4_K_H IQ4_XS Q6_K IQ4_XS Q6_K Q6_K Q2_K_H Q3_K_H Q4_K_H Q8_0 Q2_K_H Q3_K_H Q4_K_H Q4_K_H Q6_K Q6_K IQ4_XS IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS IQ4_XS
engine llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 3266 llama.cpp version: 3389 llama.cpp version: 4877 llama.cpp version: 4888 llama.cpp version: 4938 llama.cpp version: 5572 llama.cpp version: 5586 llama.cpp version: 3496 llama.cpp version: 3334 llama.cpp version: 4488 llama.cpp version: 3428 llama.cpp version: 3825 llama.cpp version: 5236 llama.cpp version: 5279 llama.cpp version: 5335 llama.cpp version: 3262 llama.cpp version: 5509 llama.cpp version: 5509 llama.cpp version: 5509 llama.cpp version: 5742 llama.cpp version: 3609 llama.cpp version: 3600 llama.cpp version: 4295 llama.cpp version: 4038 llama.cpp version: 4038 llama.cpp version: 3943 llama.cpp version: 3870 llama.cpp version: 3821 llama.cpp version: 3821
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc
WG 0.600 0.670 0.700 0.762 0.772 0.576 0.692 0.743 0.741 0.748 0.759 0.753 0.708 0.741 0.685 - - - 0.751 0.775 0.772 0.784 0.780 0.744 0.734 0.708 0.687 0.695 0.709 0.709 0.754 0.746
LAMBADA 0.524 0.688 0.692 0.735 0.755 0.504 0.635 0.724 0.721 0.742 0.786 0.783 0.662 0.747 0.705 - - - 0.766 0.786 0.789 0.798 0.792 0.677 0.613 0.750 0.685 0.682 0.722 0.724 0.769 0.781
HELLASWAG 0.308 0.684 0.716 0.775 0.810 0.307 0.527 0.779 0.767 0.802 0.834 0.840 0.846 0.696 0.559 - - - 0.591 0.866 0.877 0.899 0.872 0.716 0.669 0.801 0.670 0.713 0.820 0.822 0.863 0.894
BOOLQ 0.364 0.591 0.621 0.687 0.739 0.521 0.603 0.669 - 0.701 0.633 0.625 0.562 0.610 0.478 - - - 0.658 - - 0.646 0.684 0.562 0.573 0.653 0.517 0.533 0.617 0.623 0.647 0.701
STORYCLOZE 0.774 0.949 0.947 0.958 0.973 0.685 0.900 0.948 - 0.964 0.967 0.976 0.982 0.895 0.870 - - - 0.917 - - 0.968 0.969 0.531 0.921 0.754 0.913 0.896 0.920 0.915 0.938 0.981
CSQA 0.488 0.725 0.746 0.751 0.763 0.339 0.614 0.716 - 0.741 0.727 0.733 0.730 0.686 0.642 - - - 0.627 - - 0.756 0.751 0.669 0.660 0.740 0.701 0.717 0.768 0.781 0.795 0.823
OBQA 0.380 0.761 0.745 0.846 0.860 0.334 0.648 0.807 - 0.855 0.821 0.802 0.801 0.765 0.709 - - - 0.676 - - 0.866 0.880 0.751 0.720 0.857 0.700 0.731 0.802 0.804 0.863 0.904
COPA 0.612 0.870 0.903 0.925 0.949 0.415 0.785 0.932 - 0.944 0.955 0.944 0.927 0.889 0.749 - - - 0.812 - - 0.924 0.932 0.884 0.870 0.934 0.841 0.858 0.925 0.919 0.935 0.958
PIQA 0.233 0.696 0.732 0.801 0.841 0.386 0.653 0.784 - 0.818 0.773 0.779 0.777 0.725 0.637 - - - 0.708 - - 0.826 0.831 0.733 0.677 0.832 0.695 0.713 0.794 0.807 0.848 0.870
SIQA 0.425 0.658 0.688 0.693 0.731 0.385 0.588 0.699 - 0.716 0.664 0.665 0.706 0.648 0.622 - - - 0.620 - - 0.737 0.710 0.667 0.661 0.639 0.656 0.663 0.721 0.712 0.746 0.742
MEDQA 0.141 0.420 0.430 0.501 0.549 0.073 0.292 0.503 - 0.553 0.436 0.445 0.457 0.500 0.413 - - - 0.334 - - 0.593 0.597 0.423 0.395 0.560 0.344 0.363 0.453 0.458 0.542 0.610
SQA - 0.033 - - 0.117 - 0.052 0.092 - 0.092 - - 0.039 0.073 - - - - - - - 0.066 0.073 - - - - - - - - -
JEOPARDY 0.010 0.400 0.310 0.580 0.760 - 0.350 0.550 0.560 0.830 0.370 0.420 0.210 0.510 0.350 0.680 0.580 0.540 0.490 0.680 - 0.740 0.640 0.320 0.250 0.390 0.120 0.120 0.300 0.290 0.540 0.600
GSM8K 0.485 0.890 0.918 0.890 0.899 - 0.843 0.928 0.928 0.964 0.855 0.839 0.890 0.872 0.822 - - - 0.611 - - 0.940 0.968 0.855 0.714 0.946 0.829 0.856 0.909 0.880 0.938 0.950
APPLE 0.150 0.810 0.740 0.750 0.730 - 0.630 0.740 0.770 0.850 0.630 0.610 0.670 0.690 0.610 0.840 0.860 0.860 0.390 0.830 0.780 0.820 0.890 0.560 0.560 0.910 0.640 0.560 0.740 0.750 0.830 0.860
HUMANEVAL 0.115 0.737 0.774 0.658 0.743 0.408 0.701 0.859 0.829 0.890 0.737 0.731 0.804 0.652 0.585 - - - 0.390 0.841 0.823 0.853 0.871 0.682 0.621 0.847 0.695 0.780 0.798 0.817 0.804 0.884
HUMANEVALP 0.073 0.628 0.664 0.548 0.615 0.317 0.597 0.713 - 0.719 0.615 0.634 0.713 0.536 0.475 - - - 0.329 - - 0.731 0.750 0.591 0.524 0.725 0.615 0.682 0.670 0.658 0.676 0.768
HUMANEVALFIM - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MBPP 0.334 0.677 0.653 0.595 0.642 0.536 0.614 0.692 - 0.677 0.579 0.591 0.552 0.564 0.498 - - - 0.451 - - 0.618 0.642 0.610 0.498 0.673 0.595 0.599 0.669 0.661 0.669 0.684
MBPPP 0.312 0.629 0.611 0.584 0.638 0.531 0.598 0.642 - 0.625 0.562 0.575 0.477 0.540 0.482 - - - 0.397 - - 0.593 0.647 0.575 0.477 0.651 0.540 0.584 0.633 0.651 0.633 0.700
HUMANEVALX_cpp 0.054 0.506 0.603 0.512 0.579 0.158 0.585 0.756 - 0.780 0.439 0.432 0.402 0.457 0.323 - - - 0.225 - - 0.292 0.713 0.280 0.219 0.676 0.420 0.237 0.475 0.554 0.323 0.701
HUMANEVALX_java 0.042 0.640 0.719 0.640 0.768 0.317 0.658 0.804 - 0.810 0.207 0.628 0.597 0.487 0.439 - - - 0.256 - - 0.804 0.829 0.079 0.060 0.634 0.640 0.615 0.695 0.737 0.780 0.865
HUMANEVALX_js 0.115 0.676 0.652 0.579 0.743 0.359 0.664 0.835 - 0.841 0.628 0.628 0.670 0.560 0.067 - - - 0.402 - - 0.786 0.786 0.560 0.451 0.786 0.646 0.689 0.719 0.750 0.798 0.847
HUMANEVALX 0.071 0.607 0.658 0.577 0.697 0.278 0.636 0.798 - 0.810 0.424 0.563 0.556 0.502 0.276 - - - 0.294 - - 0.628 0.776 0.306 0.243 0.699 0.569 0.514 0.630 0.680 0.634 0.804
CRUXEVAL_input 0.210 0.411 0.448 0.462 0.485 0.038 0.388 0.440 - 0.528 0.416 0.406 0.477 0.435 0.353 - - - 0.276 - - 0.547 0.550 0.398 0.388 0.447 0.350 0.331 0.387 0.412 0.541 0.517
CRUXEVAL_output 0.152 0.355 0.410 0.375 0.482 0.196 0.348 0.457 - 0.491 0.356 0.338 0.372 0.360 0.291 - - - 0.303 - - 0.516 0.498 0.342 0.296 0.463 0.275 0.311 0.382 0.386 0.471 0.455
CRUXEVAL 0.181 0.383 0.429 0.418 0.483 0.117 0.368 0.448 - 0.510 0.386 0.372 0.425 0.397 0.322 - - - 0.290 - - 0.531 0.524 0.370 0.342 0.455 0.312 0.321 0.385 0.399 0.506 0.486
CRUXEVALFIM_input - 0.418 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM_output - 0.356 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM - 0.387 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TQA_mc 0.146 0.523 0.510 0.701 0.767 0.115 0.468 0.663 - 0.696 0.636 0.640 0.637 0.564 0.555 - - - 0.549 - - 0.767 0.713 0.621 0.581 0.725 0.516 0.548 0.654 0.657 0.747 0.804
TQA_tf 0.381 0.410 0.431 0.692 0.725 0.390 0.491 0.677 - 0.634 0.484 0.457 0.593 0.512 0.566 - - - 0.548 - - 0.735 0.670 0.483 0.487 0.686 0.414 0.300 0.574 0.568 0.706 0.731
TQA 0.354 0.423 0.440 0.693 0.730 0.358 0.488 0.676 - 0.641 0.502 0.478 0.598 0.518 0.565 - - - 0.548 - - 0.739 0.675 0.499 0.498 0.691 0.426 0.329 0.583 0.578 0.711 0.740
ARC_challenge 0.374 0.809 0.819 0.882 0.897 0.319 0.699 0.869 - 0.899 0.835 0.853 0.871 0.776 0.706 - - - 0.688 - - 0.912 0.907 0.813 0.802 0.911 0.750 0.777 0.843 0.851 0.911 0.934
ARC_easy 0.598 0.925 0.933 0.952 0.963 0.563 0.875 0.955 - 0.971 0.933 0.940 0.945 0.906 0.843 - - - 0.843 - - 0.970 0.968 0.934 0.932 0.970 0.895 0.904 0.945 0.946 0.969 0.978
ARC 0.524 0.886 0.895 0.929 0.941 0.482 0.817 0.927 - 0.947 0.901 0.911 0.920 0.863 0.798 - - - 0.792 - - 0.951 0.948 0.894 0.889 0.950 0.847 0.862 0.911 0.915 0.950 0.963
RACE_high 0.431 0.698 0.730 0.802 0.833 0.338 0.633 0.795 - 0.829 0.788 0.787 0.830 0.679 0.589 - - - 0.607 - - 0.853 0.853 0.613 0.625 0.819 0.698 0.712 0.779 0.788 0.852 0.882
RACE_middle 0.463 0.777 0.793 0.849 0.883 0.398 0.713 0.858 - 0.883 0.816 0.825 0.866 0.734 0.680 - - - 0.696 - - 0.894 0.885 0.706 0.692 0.861 0.775 0.776 0.841 0.853 0.887 0.923
RACE 0.440 0.721 0.748 0.816 0.847 0.355 0.656 0.813 - 0.844 0.796 0.798 0.840 0.695 0.615 - - - 0.633 - - 0.865 0.862 0.640 0.645 0.831 0.720 0.730 0.797 0.807 0.862 0.894
MMLU
abstract_algebra 0.180 0.410 0.450 0.330 0.310 0.110 0.160 0.350 - 0.410 0.220 0.210 0.300 0.200 0.270 - - - 0.190 - - 0.370 0.430 0.300 0.210 0.410 0.240 0.250 0.440 0.430 0.570 0.600
anatomy 0.318 0.577 0.592 0.626 0.607 0.296 0.459 0.577 - 0.629 0.503 0.511 0.637 0.555 0.540 - - - 0.447 - - 0.718 0.725 0.570 0.585 0.703 0.525 0.562 0.622 0.622 0.644 0.733
astronomy 0.263 0.736 0.756 0.760 0.828 0.296 0.559 0.769 - 0.848 0.644 0.651 0.802 0.677 0.565 - - - 0.573 - - 0.888 0.888 0.703 0.703 0.776 0.618 0.657 0.763 0.769 0.868 0.875
business_ethics 0.260 0.570 0.560 0.620 0.670 0.270 0.480 0.630 - 0.710 0.570 0.610 0.670 0.550 0.480 - - - 0.520 - - 0.730 0.720 0.620 0.620 0.740 0.630 0.590 0.680 0.710 0.750 0.800
clinical_knowledge 0.373 0.652 0.683 0.743 0.788 0.316 0.554 0.716 - 0.784 0.618 0.622 0.716 0.675 0.592 - - - 0.581 - - 0.811 0.777 0.713 0.698 0.781 0.633 0.645 0.709 0.713 0.803 0.815
college_biology 0.340 0.763 0.777 0.854 0.895 0.256 0.618 0.826 - 0.847 0.687 0.715 0.777 0.722 0.625 - - - 0.625 - - 0.888 0.902 0.805 0.763 0.868 0.694 0.694 0.784 0.784 0.854 0.923
college_chemistry 0.180 0.470 0.430 0.470 0.430 0.180 0.260 0.400 - 0.470 0.380 0.380 0.440 0.400 0.310 - - - 0.350 - - 0.480 0.500 0.460 0.430 0.520 0.310 0.370 0.480 0.490 0.460 0.530
college_computer_science 0.110 0.540 0.590 0.460 0.580 0.200 0.360 0.500 - 0.560 0.470 0.480 0.640 0.400 0.350 - - - 0.320 - - 0.650 0.580 0.480 0.410 0.600 0.390 0.460 0.620 0.590 0.630 0.720
college_mathematics 0.090 0.320 0.320 0.260 0.300 0.080 0.170 0.300 - 0.400 0.240 0.280 0.280 0.260 0.210 - - - 0.180 - - 0.340 0.380 0.270 0.170 0.340 0.200 0.180 0.380 0.350 0.490 0.540
college_medicine 0.283 0.566 0.612 0.658 0.716 0.260 0.462 0.624 - 0.682 0.572 0.589 0.676 0.589 0.491 - - - 0.456 - - 0.722 0.728 0.612 0.566 0.728 0.560 0.606 0.606 0.624 0.710 0.739
college_physics 0.186 0.372 0.411 0.352 0.421 0.098 0.196 0.411 - 0.470 0.313 0.323 0.362 0.313 0.303 - - - 0.254 - - 0.529 0.539 0.333 0.294 0.529 0.382 0.392 0.401 0.372 0.519 0.656
computer_security 0.370 0.710 0.690 0.730 0.710 0.350 0.590 0.740 - 0.760 0.710 0.730 0.720 0.690 0.620 - - - 0.600 - - 0.710 0.720 0.700 0.650 0.730 0.650 0.690 0.720 0.710 0.730 0.800
conceptual_physics 0.234 0.680 0.680 0.638 0.727 0.174 0.404 0.634 - 0.748 0.561 0.587 0.646 0.463 0.361 - - - 0.365 - - 0.744 0.731 0.565 0.553 0.748 0.485 0.519 0.642 0.642 0.800 0.834
econometrics 0.122 0.649 0.587 0.557 0.587 0.140 0.315 0.535 - 0.570 0.456 0.464 0.578 0.482 0.359 - - - 0.318 - - 0.587 0.614 0.456 0.421 0.596 0.421 0.438 0.605 0.596 0.649 0.675
electrical_engineering 0.220 0.641 0.648 0.558 0.593 0.296 0.393 0.558 - 0.627 0.544 0.572 0.655 0.524 0.462 - - - 0.393 - - 0.703 0.662 0.496 0.475 0.634 0.441 0.434 0.606 0.606 0.648 0.703
elementary_mathematics 0.113 0.505 0.497 0.476 0.476 0.058 0.288 0.502 - 0.719 0.367 0.373 0.481 0.357 0.280 - - - 0.222 - - 0.653 0.621 0.423 0.388 0.544 0.407 0.417 0.560 0.568 0.791 0.838
formal_logic 0.182 0.444 0.484 0.293 0.468 0.142 0.269 0.452 - 0.507 0.325 0.357 0.420 0.420 0.253 - - - 0.277 - - 0.563 0.484 0.452 0.380 0.531 0.325 0.341 0.452 0.428 0.539 0.626
global_facts 0.120 0.190 0.290 0.330 0.370 0.070 0.110 0.300 - 0.420 0.200 0.240 0.330 0.150 0.110 - - - 0.160 - - 0.520 0.450 0.240 0.130 0.320 0.140 0.200 0.260 0.260 0.470 0.430
high_school_biology 0.348 0.764 0.774 0.851 0.890 0.358 0.645 0.816 - 0.854 0.800 0.809 0.825 0.729 0.677 - - - 0.654 - - 0.874 0.870 0.793 0.774 0.887 0.722 0.754 0.803 0.806 0.845 0.896
high_school_chemistry 0.216 0.522 0.507 0.586 0.600 0.167 0.359 0.517 - 0.610 0.546 0.517 0.527 0.467 0.433 - - - 0.310 - - 0.650 0.640 0.512 0.492 0.655 0.413 0.463 0.532 0.536 0.596 0.724
high_school_computer_science 0.250 0.740 0.740 0.710 0.770 0.270 0.570 0.750 - 0.830 0.660 0.660 0.760 0.610 0.540 - - - 0.490 - - 0.810 0.790 0.610 0.580 0.870 0.600 0.660 0.770 0.770 0.830 0.870
high_school_european_history 0.490 0.757 0.745 0.806 0.830 0.363 0.678 0.818 - 0.824 0.812 0.830 0.787 0.709 0.672 - - - 0.678 - - 0.830 0.830 0.727 0.672 0.812 0.733 0.733 0.787 0.800 0.824 0.818
high_school_geography 0.393 0.717 0.747 0.878 0.888 0.424 0.646 0.818 - 0.858 0.792 0.818 0.792 0.757 0.671 - - - 0.671 - - 0.853 0.853 0.792 0.737 0.888 0.712 0.732 0.833 0.833 0.868 0.883
high_school_government_and_politics 0.487 0.875 0.875 0.926 0.963 0.450 0.772 0.911 - 0.937 0.875 0.870 0.880 0.818 0.725 - - - 0.805 - - 0.937 0.948 0.849 0.834 0.937 0.772 0.797 0.917 0.917 0.958 0.968
high_school_macroeconomics 0.235 0.653 0.687 0.717 0.758 0.238 0.474 0.682 - 0.771 0.651 0.653 0.733 0.556 0.497 - - - 0.478 - - 0.766 0.743 0.646 0.635 0.807 0.564 0.592 0.684 0.684 0.802 0.825
high_school_mathematics 0.088 0.344 0.337 0.277 0.325 0.033 0.211 0.337 - 0.422 0.237 0.240 0.285 0.255 0.233 - - - 0.162 - - 0.362 0.348 0.214 0.203 0.274 0.270 0.244 0.440 0.422 0.500 0.537
high_school_microeconomics 0.268 0.823 0.827 0.801 0.852 0.315 0.533 0.798 - 0.831 0.760 0.773 0.798 0.684 0.575 - - - 0.540 - - 0.873 0.878 0.794 0.743 0.861 0.672 0.697 0.827 0.827 0.857 0.907
high_school_physics 0.099 0.509 0.496 0.423 0.496 0.112 0.198 0.403 - 0.562 0.344 0.364 0.443 0.317 0.211 - - - 0.165 - - 0.596 0.582 0.377 0.384 0.569 0.317 0.311 0.470 0.456 0.635 0.695
high_school_psychology 0.445 0.827 0.853 0.896 0.910 0.445 0.750 0.882 - 0.900 0.840 0.858 0.862 0.834 0.761 - - - 0.764 - - 0.913 0.900 0.855 0.844 0.904 0.803 0.796 0.858 0.856 0.882 0.902
high_school_statistics 0.185 0.564 0.625 0.574 0.615 0.129 0.337 0.574 - 0.555 0.509 0.500 0.638 0.462 0.342 - - - 0.361 - - 0.648 0.648 0.569 0.523 0.643 0.481 0.518 0.615 0.648 0.717 0.782
high_school_us_history 0.436 0.764 0.794 0.829 0.867 0.348 0.705 0.843 - 0.872 0.833 0.867 0.774 0.784 0.696 - - - 0.699 - - 0.897 0.921 0.759 0.735 0.877 0.715 0.759 0.843 0.852 0.882 0.906
high_school_world_history 0.535 0.759 0.818 0.872 0.881 0.392 0.696 0.864 - 0.907 0.810 0.827 0.801 0.789 0.725 - - - 0.720 - - 0.869 0.873 0.746 0.742 0.869 0.776 0.793 0.818 0.827 0.869 0.877
human_aging 0.309 0.596 0.627 0.690 0.739 0.345 0.524 0.645 - 0.699 0.582 0.591 0.641 0.618 0.569 - - - 0.542 - - 0.713 0.704 0.582 0.547 0.726 0.569 0.587 0.681 0.690 0.717 0.771
human_sexuality 0.351 0.648 0.694 0.746 0.755 0.358 0.519 0.740 - 0.763 0.648 0.633 0.648 0.671 0.587 - - - 0.569 - - 0.839 0.816 0.664 0.587 0.740 0.625 0.625 0.740 0.717 0.786 0.839
international_law 0.404 0.727 0.776 0.801 0.760 0.495 0.677 0.801 - 0.809 0.735 0.752 0.743 0.776 0.710 - - - 0.710 - - 0.834 0.818 0.735 0.727 0.892 0.710 0.685 0.768 0.785 0.834 0.867
jurisprudence 0.444 0.740 0.768 0.785 0.833 0.379 0.648 0.740 - 0.796 0.675 0.722 0.777 0.731 0.574 - - - 0.626 - - 0.833 0.805 0.722 0.750 0.787 0.694 0.712 0.759 0.750 0.824 0.824
logical_fallacies 0.380 0.711 0.730 0.811 0.797 0.300 0.644 0.779 - 0.871 0.730 0.754 0.717 0.736 0.687 - - - 0.660 - - 0.797 0.785 0.785 0.754 0.779 0.705 0.723 0.773 0.766 0.834 0.877
machine_learning 0.196 0.508 0.491 0.437 0.571 0.169 0.285 0.464 - 0.482 0.419 0.401 0.500 0.366 0.285 - - - 0.321 - - 0.571 0.571 0.437 0.375 0.544 0.339 0.321 0.437 0.410 0.526 0.642
management 0.417 0.825 0.786 0.825 0.844 0.475 0.708 0.864 - 0.825 0.737 0.766 0.834 0.737 0.669 - - - 0.708 - - 0.844 0.864 0.786 0.776 0.854 0.689 0.718 0.805 0.825 0.825 0.864
marketing 0.517 0.820 0.854 0.863 0.893 0.508 0.782 0.858 - 0.897 0.850 0.858 0.888 0.837 0.799 - - - 0.756 - - 0.893 0.888 0.820 0.803 0.914 0.811 0.816 0.888 0.893 0.897 0.901
medical_genetics 0.340 0.720 0.750 0.780 0.810 0.240 0.510 0.720 - 0.790 0.630 0.640 0.710 0.720 0.660 - - - 0.600 - - 0.850 0.880 0.710 0.700 0.860 0.660 0.690 0.770 0.770 0.820 0.900
miscellaneous 0.420 0.749 0.768 0.830 0.854 0.401 0.687 0.825 - 0.879 0.775 0.796 0.768 0.773 0.736 - - - 0.727 - - 0.872 0.872 0.777 0.759 0.864 0.724 0.726 0.807 0.814 0.871 0.885
moral_disputes 0.323 0.609 0.618 0.680 0.736 0.332 0.511 0.664 - 0.719 0.604 0.612 0.635 0.621 0.560 - - - 0.524 - - 0.748 0.754 0.615 0.621 0.748 0.537 0.566 0.664 0.676 0.725 0.760
moral_scenarios 0.115 0.165 0.411 0.325 0.366 0.117 0.143 0.207 - 0.489 0.307 0.360 0.188 0.205 0.410 - - - 0.122 - - 0.482 0.377 0.366 0.404 0.582 0.130 0.058 0.318 0.368 0.546 0.565
nutrition 0.313 0.650 0.666 0.683 0.758 0.333 0.565 0.676 - 0.764 0.643 0.653 0.751 0.689 0.620 - - - 0.555 - - 0.843 0.826 0.669 0.620 0.771 0.647 0.630 0.745 0.745 0.790 0.797
philosophy 0.327 0.681 0.675 0.658 0.713 0.363 0.536 0.726 - 0.742 0.652 0.659 0.688 0.617 0.578 - - - 0.587 - - 0.736 0.781 0.630 0.588 0.784 0.562 0.565 0.675 0.688 0.774 0.778
prehistory 0.308 0.660 0.697 0.728 0.783 0.342 0.577 0.759 - 0.827 0.635 0.663 0.641 0.700 0.604 - - - 0.580 - - 0.805 0.805 0.697 0.663 0.805 0.641 0.666 0.762 0.756 0.836 0.861
professional_accounting 0.184 0.418 0.432 0.496 0.514 0.152 0.280 0.436 - 0.531 0.404 0.425 0.429 0.393 0.336 - - - 0.336 - - 0.531 0.517 0.418 0.386 0.510 0.386 0.414 0.457 0.460 0.560 0.631
professional_law 0.202 0.397 0.417 0.478 0.528 0.177 0.323 0.441 - 0.489 0.404 0.408 0.417 0.397 0.369 - - - 0.333 - - 0.518 0.505 0.410 0.401 0.492 0.340 0.337 0.401 0.402 0.477 0.541
professional_medicine 0.235 0.639 0.636 0.756 0.794 0.113 0.481 0.761 - 0.783 0.654 0.680 0.680 0.724 0.713 - - - 0.564 - - 0.827 0.827 0.687 0.658 0.823 0.573 0.580 0.680 0.683 0.812 0.845
professional_psychology 0.300 0.647 0.665 0.728 0.805 0.272 0.495 0.718 - 0.753 0.598 0.609 0.684 0.642 0.509 - - - 0.521 - - 0.790 0.777 0.655 0.617 0.799 0.586 0.591 0.707 0.702 0.776 0.810
public_relations 0.409 0.563 0.600 0.700 0.672 0.354 0.509 0.690 - 0.681 0.572 0.627 0.618 0.518 0.545 - - - 0.554 - - 0.736 0.736 0.554 0.572 0.727 0.563 0.572 0.627 0.645 0.736 0.663
security_studies 0.240 0.608 0.644 0.746 0.763 0.285 0.632 0.661 - 0.755 0.624 0.632 0.718 0.665 0.616 - - - 0.600 - - 0.787 0.767 0.669 0.673 0.730 0.620 0.653 0.718 0.718 0.767 0.775
sociology 0.412 0.781 0.791 0.815 0.860 0.517 0.666 0.820 - 0.850 0.736 0.741 0.810 0.786 0.741 - - - 0.716 - - 0.860 0.875 0.820 0.781 0.870 0.716 0.736 0.815 0.825 0.855 0.860
us_foreign_policy 0.510 0.780 0.790 0.868 0.840 0.460 0.740 0.890 - 0.860 0.780 0.800 0.840 0.800 0.800 - - - 0.757 - - 0.920 0.910 0.760 0.770 0.890 0.750 0.780 0.820 0.820 0.890 0.880
virology 0.246 0.433 0.445 0.472 0.506 0.283 0.415 0.469 - 0.481 0.415 0.439 0.469 0.439 0.415 - - - 0.387 - - 0.512 0.512 0.403 0.367 0.500 0.373 0.427 0.463 0.457 0.487 0.518
world_religions 0.403 0.748 0.801 0.800 0.847 0.350 0.684 0.795 - 0.836 0.766 0.766 0.748 0.789 0.742 - - - 0.747 - - 0.871 0.853 0.742 0.725 0.836 0.783 0.760 0.818 0.818 0.859 0.871
MMLU 0.285 0.591 0.623 0.647 0.687 0.269 0.477 0.631 - 0.701 0.580 0.595 0.617 0.570 0.525 - - - 0.486 - - 0.717 0.704 0.599 0.578 0.710 0.532 0.540 0.639 0.643 0.721 0.757
AGIEVAL
aquarat 0.374 0.602 0.562 0.665 0.602 0.409 0.763 0.846 - 0.844 0.653 0.637 0.783 0.598 0.633 - - - 0.279 - - 0.516 0.764 0.409 0.574 0.834 0.732 0.728 0.799 0.830 0.822 0.870
logiqa 0.208 0.356 0.337 0.447 0.477 0.145 0.342 0.479 - 0.509 0.399 0.416 0.433 0.328 0.265 - - - 0.264 - - 0.468 0.447 0.281 0.267 0.445 0.316 0.342 0.427 0.436 0.493 0.554
lsatar 0.213 0.213 0.282 0.208 0.260 0.217 0.213 0.365 - 0.317 0.073 0.217 0.308 0.295 0.239 - - - 0.186 - - 0.269 0.639 0.256 0.247 0.369 0.230 0.226 0.260 0.300 0.321 0.400
lsatlr 0.203 0.486 0.537 0.635 0.654 0.115 0.374 0.596 - 0.686 0.505 0.515 0.592 0.441 0.327 - - - 0.366 - - 0.709 0.686 0.415 0.386 0.621 0.452 0.449 0.598 0.603 0.729 0.811
lsatrc 0.312 0.594 0.646 0.750 0.754 0.208 0.475 0.702 - 0.717 0.635 0.643 0.706 0.624 0.486 - - - 0.520 - - 0.814 0.806 0.531 0.524 0.762 0.553 0.617 0.661 0.687 0.810 0.836
saten 0.470 0.791 0.810 0.834 0.868 0.305 0.728 0.854 - 0.893 0.815 0.820 0.844 0.781 0.689 - - - 0.679 - - 0.893 0.873 0.713 0.708 0.830 0.733 0.776 0.810 0.844 0.888 0.922
satmath 0.559 0.790 0.822 0.886 0.768 0.468 0.945 0.981 - 0.936 0.863 0.868 0.968 0.618 0.845 - - - 0.400 - - 0.804 0.813 0.713 0.754 0.977 0.900 0.922 0.963 0.963 0.990 0.981
AGIEVAL 0.294 0.503 0.523 0.598 0.602 0.226 0.488 0.639 - 0.663 0.525 0.546 0.611 0.480 0.433 - - - 0.359 - - 0.615 0.665 0.429 0.438 0.638 0.501 0.520 0.599 0.616 0.681 0.734
AGIEVALC_biology - 0.365 - - - 0.104 0.334 0.595 - 0.665 0.756 0.778 0.869 - - - - - - - - 0.721 0.739 - - - 0.660 0.700 0.804 0.813 0.834 0.582
AGIEVALC_chemistry - 0.269 - - - 0.078 0.289 0.446 - 0.480 0.642 0.691 0.715 - - - - - - - - 0.509 0.509 - - - 0.441 0.470 0.583 0.627 0.696 0.789
AGIEVALC_chinese - 0.247 - - - 0.048 0.231 0.373 - 0.439 0.642 0.650 0.723 - - - - - - - - 0.569 0.577 - - - 0.508 0.504 0.585 0.593 0.760 0.735
AGIEVALC_english - 0.774 - - - 0.444 0.728 0.862 - 0.866 0.823 0.833 0.905 - - - - - - - - 0.892 0.892 - - - 0.794 0.839 0.856 0.849 0.915 0.924
AGIEVALC_geography - 0.407 - - - 0.246 0.396 0.608 - 0.678 0.728 0.728 0.814 - - - - - - - - 0.718 0.718 - - - 0.643 0.633 0.753 0.778 0.804 0.839
AGIEVALC_history - 0.374 - - - 0.225 0.421 0.642 - 0.689 0.829 0.834 0.872 - - - - - - - - 0.736 0.736 - - - 0.740 0.744 0.774 0.800 0.842 0.923
AGIEVALC_jecqaca - 0.221 - - - 0.142 0.258 0.292 - 0.348 0.414 0.440 0.660 - - - - - - - - 0.416 0.410 - - - 0.425 0.424 0.482 0.487 0.564 0.622
AGIEVALC_jecqakd - 0.223 - - - 0.118 0.229 0.356 - 0.400 0.549 0.559 0.759 - - - - - - - - 0.465 0.461 - - - 0.498 0.526 0.592 0.605 0.732 0.747
AGIEVALC_logiqa - 0.310 - - - 0.193 0.328 0.488 - 0.523 0.479 0.490 0.556 - - - - - - - - 0.525 0.525 - - - 0.399 0.405 0.497 0.500 0.565 0.588
AGIEVALC_mathcloze - 0.508 - - - - 0.567 0.779 - 0.855 0.491 0.542 0.508 - - - - - - - - 0.754 0.915 - - - 0.508 0.440 0.694 0.686 0.737 0.805
AGIEVALC_mathqa - 0.569 - - - 0.322 0.616 0.779 - 0.744 0.621 0.648 0.845 - - - - - - - - 0.664 0.844 - - - 0.595 0.683 0.779 0.755 0.808 0.834
AGIEVALC_physics - 0.327 - - - 0.091 0.206 0.304 - 0.471 0.396 0.425 0.563 - - - - - - - - 0.431 0.477 - - - 0.390 0.413 0.431 0.500 0.683 0.770
AGIEVALC - 0.361 - - - 0.187 0.368 0.514 - 0.554 0.589 0.607 0.724 - - - - - - - - 0.583 0.603 - - - 0.529 0.548 0.627 0.636 0.716 0.734
BBH
boolean_expressions 0.544 0.860 0.876 0.768 0.460 0.632 0.880 0.880 - 0.732 0.848 0.868 0.800 0.844 0.480 - - - 0.764 - - 0.872 0.860 0.852 0.832 0.936 0.756 0.796 0.864 0.880 0.888 0.808
causal_judgement 0.550 0.577 0.582 0.598 0.604 0.550 0.582 0.652 - 0.620 0.550 0.550 0.641 0.540 0.518 - - - 0.588 - - 0.631 0.582 0.588 0.593 0.647 0.497 0.529 0.508 0.513 0.647 0.700
date_understanding 0.324 0.668 0.748 0.748 0.788 0.408 0.868 0.920 - 0.760 0.580 0.572 0.832 0.716 0.664 - - - 0.548 - - 0.728 0.920 0.696 0.576 0.932 0.616 0.648 0.764 0.740 0.856 0.872
disambiguation_qa 0.400 0.712 0.668 0.660 0.720 0.284 0.432 0.448 - 0.612 0.584 0.636 0.716 0.516 0.472 - - - 0.600 - - 0.388 0.516 0.720 0.752 0.768 0.544 0.556 0.656 0.636 0.764 0.780
dyck_languages 0.424 0.704 0.712 0.728 0.600 0.344 0.636 0.824 - 0.892 0.516 0.544 0.592 0.796 0.680 - - - 0.744 - - 0.792 0.684 0.580 0.468 0.776 0.596 0.628 0.868 0.836 0.648 0.820
formal_fallacies 0.624 0.740 0.660 0.832 0.760 0.612 0.876 0.832 - 0.820 0.568 0.660 0.984 0.984 0.816 - - - 0.852 - - 0.964 0.692 0.808 0.808 0.804 0.928 0.852 0.628 0.628 0.784 0.812
geometric_shapes 0.056 0.544 0.456 0.436 0.420 0.128 0.376 0.456 - 0.544 0.392 0.400 0.812 0.440 0.416 - - - 0.288 - - 0.280 0.716 0.416 0.292 0.648 0.204 0.212 0.544 0.604 0.584 0.640
hyperbaton 0.512 0.572 0.680 0.884 0.836 0.108 0.940 0.976 - 0.932 0.740 0.824 0.884 0.880 0.624 - - - 0.656 - - 0.884 0.892 0.936 0.936 0.996 0.636 0.676 0.832 0.792 0.868 0.956
logical_deduction_five_objects 0.176 0.700 0.532 0.568 0.608 0.284 0.604 0.840 - 0.784 0.528 0.516 0.784 0.568 0.484 - - - 0.352 - - 0.600 0.968 0.632 0.532 0.940 0.468 0.528 0.752 0.728 0.876 0.924
logical_deduction_seven_objects 0.152 0.556 0.492 0.560 0.552 0.212 0.640 0.740 - 0.776 0.444 0.500 0.756 0.488 0.408 - - - 0.296 - - 0.616 0.944 0.568 0.500 0.920 0.420 0.436 0.668 0.656 0.792 0.864
logical_deduction_three_objects 0.376 0.868 0.820 0.844 0.892 0.428 0.860 0.992 - 0.912 0.836 0.840 0.960 0.804 0.652 - - - 0.608 - - 0.840 0.988 0.844 0.804 0.992 0.696 0.720 0.940 0.956 0.980 0.992
movie_recommendation 0.424 0.652 0.676 0.552 0.508 0.372 0.536 0.664 - 0.632 0.604 0.648 0.740 0.536 0.456 - - - 0.508 - - 0.684 0.672 0.520 0.508 0.992 0.604 0.568 0.556 0.536 0.672 0.648
multistep_arithmetic_two 0.136 0.944 0.968 0.488 0.472 - 0.868 0.888 - 0.972 0.580 0.524 0.508 0.700 0.532 - - - 0.108 - - 0.832 0.956 0.836 0.420 0.984 0.852 0.876 0.896 0.948 0.964 0.976
navigate 0.540 0.580 0.588 0.596 0.648 0.592 0.648 0.724 - 0.744 0.420 0.420 0.580 0.580 0.580 - - - 0.600 - - 0.680 0.464 0.588 0.584 0.640 0.576 0.572 0.596 0.596 0.624 0.684
object_counting 0.464 0.764 0.820 0.848 0.856 - 0.908 0.908 - 0.976 0.616 0.660 0.892 0.864 0.808 - - - 0.608 - - 0.832 0.984 0.836 0.344 0.996 0.740 0.764 0.848 0.804 0.892 0.896
penguins_in_a_table 0.369 0.842 0.746 0.890 0.842 0.267 0.876 0.986 - 0.739 0.917 0.917 0.958 0.856 0.801 - - - 0.623 - - 0.616 0.993 0.883 0.712 1.000 0.821 0.849 0.945 0.924 0.958 0.986
reasoning_about_colored_objects 0.276 0.860 0.800 0.744 0.900 0.180 0.752 0.888 - 0.844 0.876 0.796 0.940 0.824 0.568 - - - 0.608 - - 0.804 0.992 0.808 0.656 0.968 0.700 0.764 0.904 0.868 0.944 0.984
ruin_names 0.176 0.484 0.636 0.716 0.760 0.172 0.468 0.696 - 0.816 0.696 0.652 0.716 0.744 0.532 - - - 0.400 - - 0.764 0.748 0.612 0.600 0.816 0.396 0.324 0.440 0.544 0.692 0.760
salient_translation_error_detection 0.212 0.448 0.508 0.548 0.568 0.172 0.560 0.640 - 0.564 0.476 0.488 0.580 0.512 0.464 - - - 0.444 - - 0.600 0.656 0.520 0.532 0.636 0.452 0.432 0.560 0.572 0.612 0.700
snarks 0.483 0.685 0.707 0.691 0.719 0.033 0.724 0.803 - 0.634 0.702 0.707 0.769 0.651 0.657 - - - 0.606 - - 0.634 0.786 0.747 0.786 0.882 0.662 0.623 0.747 0.780 0.831 0.865
sports_understanding 0.584 0.672 0.692 0.788 0.816 0.488 0.696 0.804 - 0.844 0.472 0.468 0.668 0.720 0.644 - - - 0.716 - - 0.796 0.708 0.596 0.600 0.740 0.620 0.616 0.676 0.684 0.680 0.748
temporal_sequences 0.164 0.528 0.540 0.708 0.748 0.436 0.988 0.996 - 0.940 0.756 0.840 0.956 0.856 0.712 - - - 0.404 - - 0.844 0.992 0.784 0.508 1.000 0.324 0.388 0.800 0.820 0.988 0.992
tracking_shuffled_objects_five_objects 0.208 0.560 0.616 0.600 0.692 0.508 0.924 1.000 - 0.648 0.544 0.536 0.864 0.656 0.500 - - - 0.344 - - 0.716 0.992 0.940 0.712 1.000 0.420 0.452 0.840 0.908 0.924 0.972
tracking_shuffled_objects_seven_objects 0.140 0.324 0.524 0.572 0.640 0.228 0.884 0.988 - 0.660 0.512 0.436 0.764 0.592 0.420 - - - 0.296 - - 0.744 0.984 0.896 0.612 0.984 0.292 0.312 0.800 0.868 0.848 0.980
tracking_shuffled_objects_three_objects 0.288 0.696 0.732 0.732 0.848 0.808 0.972 0.992 - 0.548 0.620 0.696 0.956 0.728 0.608 - - - 0.436 - - 0.880 0.996 0.960 0.788 1.000 0.604 0.664 0.832 0.872 0.856 0.996
web_of_lies 0.476 0.576 0.520 0.520 0.488 0.488 0.516 0.540 - 0.532 0.476 0.488 0.512 0.512 0.544 - - - 0.488 - - 0.560 0.504 0.488 0.492 0.512 0.512 0.512 0.528 0.532 0.544 0.624
word_sorting 0.056 0.204 0.292 0.404 0.540 0.080 0.236 0.424 - 0.536 0.404 0.392 0.144 0.512 0.360 - - - 0.280 - - 0.632 0.592 0.204 0.152 0.360 0.156 0.156 0.212 0.220 0.292 0.400
BBH 0.334 0.638 0.650 0.664 0.674 0.355 0.711 0.794 - 0.743 0.596 0.608 0.749 0.681 0.566 - - - 0.506 - - 0.714 0.806 0.696 0.592 0.846 0.554 0.567 0.709 0.718 0.775 0.827
MUSR
murder_mystery 0.552 0.640 0.592 0.668 0.576 0.528 0.592 0.608 - 0.552 0.616 0.584 0.620 0.584 0.576 - - - 0.516 - - 0.712 0.680 0.636 0.620 0.708 0.544 0.612 0.604 0.584 0.652 0.640
object_placements 0.429 0.535 0.578 0.519 0.542 0.296 0.480 0.542 - 0.448 0.492 0.531 0.460 0.546 0.523 - - - 0.453 - - 0.516 0.532 0.503 0.457 0.464 0.472 0.476 0.531 0.554 0.519 0.265
team_allocation 0.436 0.512 0.496 0.460 0.476 0.328 0.400 0.560 - 0.572 0.572 0.588 0.448 0.460 0.396 - - - 0.356 - - 0.612 0.576 0.536 0.480 0.628 0.444 0.384 0.512 0.476 0.556 0.592
MUSR 0.472 0.562 0.555 0.548 0.531 0.383 0.490 0.570 - 0.524 0.559 0.567 0.509 0.530 0.498 - - - 0.441 - - 0.613 0.596 0.558 0.518 0.599 0.486 0.490 0.548 0.538 0.575 0.497
GPQA_diamond - - - - - - - - - 0.388 - - - - - 0.479 0.469 - - - - 0.358 0.540 - - - - - - - - -
GPQA - - - - - - - - - 0.388 - - - - - 0.479 0.469 - - - - 0.358 0.540 - - - - - - - - -
MMLUPRO
biology 0.324 0.708 0.702 0.747 0.772 0.361 0.640 0.794 - 0.752 0.676 0.695 0.750 0.686 0.623 - - - 0.582 - - 0.776 0.584 0.702 0.662 0.835 0.610 0.638 0.709 0.729 0.797 0.764
business 0.190 0.624 0.525 0.583 0.626 0.173 0.518 0.659 - 0.616 0.522 0.562 0.628 0.558 0.458 - - - 0.335 - - 0.612 0.756 0.571 0.509 0.785 0.504 0.558 0.647 0.661 0.718 0.755
chemistry 0.166 0.639 0.500 0.503 0.546 0.115 0.380 0.574 - 0.536 0.465 0.467 0.589 0.467 0.390 - - - - - - 0.488 0.728 0.463 0.296 0.765 0.387 0.451 0.559 0.580 0.684 0.701
computer_science 0.197 0.602 0.590 0.482 0.560 0.170 0.421 0.643 - 0.560 0.497 0.502 0.585 0.485 0.414 - - - - - - 0.556 0.680 0.475 0.448 0.734 0.434 0.402 0.590 0.604 0.663 0.734
economics 0.236 0.663 0.662 0.668 0.678 0.206 0.534 0.699 - 0.660 0.617 0.610 0.662 0.568 0.492 - - - - - - 0.648 0.612 0.609 0.587 0.792 0.521 0.550 0.674 0.687 0.721 0.787
engineering 0.157 0.437 0.424 0.406 0.414 0.138 0.253 0.373 - 0.420 0.303 0.298 0.454 0.378 0.302 - - - - - - 0.488 0.544 0.297 0.283 0.589 0.296 0.309 0.418 0.420 0.512 0.573
health 0.158 0.503 0.517 0.545 0.621 0.156 0.399 0.596 - 0.548 0.492 0.496 0.544 0.558 0.437 - - - - - - 0.544 0.616 0.515 0.466 0.700 0.388 0.416 0.556 0.569 0.643 0.690
history 0.149 0.406 0.467 0.493 0.490 0.152 0.354 0.540 - 0.588 0.425 0.438 0.459 0.451 0.380 - - - - - - 0.592 0.568 0.380 0.380 0.627 0.333 0.367 0.459 0.464 0.566 0.624
law 0.123 0.268 0.295 0.343 0.405 0.158 0.263 0.372 - 0.400 0.299 0.284 0.307 0.303 0.243 - - - - - - 0.384 0.328 0.276 0 0.500 0.220 0.237 0.300 0.292 0.366 0.455
math 0.203 0.694 0.564 0.538 0.570 0.180 0.586 0.739 - 0.664 0.490 0.523 0.617 0.555 0.511 - - - - - - 0.536 0.812 0.522 0.458 0.816 0.581 0.603 0.712 0.723 0.775 0.814
other 0.164 0.450 0.496 0.551 0.574 0.173 0.428 0.580 - 0.552 0.464 0.458 0.536 0.487 0.389 - - - - - - 0.484 0.592 0.500 0.433 0.706 0.410 0.405 0.529 0.551 0.611 0.664
philosophy 0.148 0.442 0.462 0.448 0.488 0.176 0.356 0.555 - 0.560 0.408 0.412 0.424 0.382 0.326 - - - - - - 0.476 0.580 0.406 0.390 0.633 0.376 0.364 0.480 0.464 0.557 0.599
physics 0.159 0.583 0.493 0.501 0.559 0.125 0.397 0.595 - 0.512 0.441 0.461 0.587 0.488 0.397 - - - - - - 0.488 0.724 0.455 0.425 0.765 0.419 0.456 0.589 0.602 0.702 0.543
psychology 0.258 0.621 0.645 0.647 0.692 0.273 0.567 0.685 - 0.632 0.586 0.602 0.665 0.637 0.518 - - - - - - 0.680 0.544 0.621 0.572 0.759 0.526 0.563 0.636 0.644 0.721 0.749
MMLUPRO 0.186 0.552 0.517 0.528 0.568 0.177 0.436 0.597 - 0.571 0.471 0.480 0.559 0.499 0.419 - - - 0.453 - - 0.553 0.619 0.482 0.408 0.719 0.430 0.457 0.564 0.575 0.649 0.671
CATEGORIES
REASONING 0.367 0.713 0.738 0.788 0.814 0.344 0.598 0.787 0.767 0.809 0.804 0.811 0.815 0.713 0.606 - - - 0.628 0.866 0.877 0.863 0.848 0.724 0.691 0.809 0.689 0.719 0.805 0.809 0.850 0.874
UNDERSTANDING 0.366 0.644 0.670 0.707 0.742 0.327 0.552 0.695 0.741 0.746 0.661 0.670 0.691 0.631 0.579 - - - 0.563 0.775 0.772 0.764 0.756 0.614 0.622 0.728 0.605 0.613 0.692 0.696 0.761 0.793
LANGUAGE 0.524 0.688 0.692 0.735 0.755 0.504 0.635 0.724 0.721 0.742 0.786 0.783 0.662 0.747 0.705 - - - 0.766 0.786 0.789 0.798 0.792 0.677 0.613 0.750 0.685 0.682 0.722 0.724 0.769 0.781
KNOWLEDGE 0.354 0.442 0.496 0.690 0.733 0.353 0.478 0.626 0.560 0.630 0.553 0.543 0.615 0.547 0.536 0.680 0.580 0.540 0.582 0.680 - 0.676 0.653 0.517 0.519 0.676 0.469 0.426 0.595 0.597 0.693 0.725
COT 0.220 0.552 0.530 0.550 0.582 0.201 0.470 0.616 - 0.630 0.485 0.500 0.586 0.530 0.446 0.479 0.469 - 0.498 - - 0.600 0.651 0.506 0.440 0.725 0.443 0.462 0.570 0.581 0.653 0.684
MATHCOT 0.369 0.730 0.752 0.735 0.740 0.417 0.793 0.879 0.882 0.790 0.682 0.679 0.813 0.728 0.647 0.840 0.860 0.860 0.493 0.830 0.780 0.743 0.908 0.767 0.638 0.919 0.667 0.694 0.823 0.821 0.869 0.903
CODE 0.176 0.460 0.534 0.495 0.568 0.241 0.485 0.582 0.829 0.618 0.456 0.475 0.500 0.463 0.366 - - - 0.321 0.841 0.823 0.409 0.619 0.427 0.376 0.568 0.437 0.445 0.510 0.528 0.578 0.612
DISCIPLINES
NLP 0.408 0.647 0.670 0.755 0.786 0.392 0.595 0.748 0.751 0.761 0.729 0.728 0.737 0.677 0.609 - - - 0.642 0.834 0.841 0.808 0.792 0.647 0.637 0.755 0.632 0.630 0.731 0.734 0.791 0.818
MATH 0.294 0.669 0.659 0.637 0.653 0.298 0.659 0.775 0.882 0.727 0.590 0.597 0.720 0.629 0.556 0.840 0.860 0.860 0.451 0.830 0.780 0.678 0.789 0.646 0.543 0.817 0.576 0.599 0.741 0.742 0.799 0.843
SCIENCE 0.350 0.706 0.713 0.739 0.769 0.304 0.580 0.737 - 0.797 0.686 0.698 0.756 0.676 0.605 0.479 0.469 - 0.673 - - 0.806 0.821 0.696 0.660 0.845 0.629 0.657 0.738 0.748 0.815 0.806
ENGINEERING 0.166 0.464 0.453 0.426 0.438 0.158 0.271 0.397 - 0.496 0.334 0.333 0.480 0.397 0.323 - - - 0.393 - - 0.567 0.587 0.323 0.308 0.595 0.315 0.325 0.443 0.444 0.530 0.590
MEDICINE 0.216 0.524 0.540 0.595 0.648 0.182 0.411 0.598 - 0.642 0.521 0.530 0.570 0.577 0.496 - - - 0.447 - - 0.681 0.684 0.537 0.501 0.672 0.459 0.478 0.574 0.580 0.655 0.702
HUMANITIES 0.291 0.550 0.615 0.645 0.679 0.272 0.495 0.641 0.560 0.705 0.593 0.610 0.622 0.578 0.529 0.680 0.580 0.540 0.536 0.680 - 0.710 0.698 0.588 0.567 0.739 0.527 0.533 0.629 0.638 0.716 0.742
BUSINESS 0.252 0.679 0.655 0.678 0.709 0.245 0.537 0.704 - 0.743 0.623 0.637 0.696 0.598 0.517 - - - 0.466 - - 0.749 0.762 0.637 0.604 0.801 0.565 0.596 0.701 0.710 0.759 0.802
LAW 0.200 0.362 0.427 0.483 0.524 0.172 0.316 0.443 - 0.504 0.417 0.429 0.494 0.406 0.344 - - - 0.370 - - 0.537 0.543 0.392 0.310 0.541 0.374 0.383 0.451 0.456 0.541 0.604
COMPOSITE AVERAGE
AVG 0.342 0.612 0.641 0.692 0.724 0.324 0.555 0.701 0.753 0.729 0.648 0.654 0.689 0.629 0.561 0.620 0.595 0.700 0.578 0.833 0.841 0.740 0.757 0.616 0.585 0.748 0.578 0.586 0.686 0.691 0.754 0.783

THINKING MODELS:

MODEL Qwen3-0.6B Qwen3-1.7B Qwen3-4B Qwen3-4B Qwen3-8B Qwen3-8B Qwen3-8B Qwen3-14B Qwen3-14B Qwen3-30B-A3B Qwen3-30B-A3B Qwen3-32B Qwen3-32B QwQ-32B-Preview
params 0.75163B 2.03B 4.02B 4.02B 8.19B 8.19B 8.19B 14.77B 14.77B 30.53B 30.53B 32.8B 32.8B 32.76B
quant Q8_0 Q8_0 Q8_0 Q8_0_H Q4_K_H Q6_K_H Q6_K IQ4_XS Q4_K_H IQ4_XS Q4_K_H IQ4_XS Q4_K_H IQ4_XS
engine llama.cpp version: 5679 llama.cpp version: 5415 llama.cpp version: 5242 llama.cpp version: 5509 llama.cpp version: 5279 llama.cpp version: 5223 llama.cpp version: 5153 llama.cpp version: 5223 llama.cpp version: 5379 llama.cpp version: 5279 llama.cpp version: 5353 llama.cpp version: 5466 llama.cpp version: 5466 llama.cpp version: 4273
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc
WG 0.564 0.610 0.662 0.642 0.651 0.689 0.678 0.722 0.726 0.699 0.700 0.712 0.731 0.750
LAMBADA 0.471 0.590 0.644 0.638 0.660 0.700 0.700 0.729 0.714 0.698 0.725 0.692 0.701 0.780
HELLASWAG 0.277 0.553 0.721 0.726 0.718 0.787 0.787 0.827 0.792 0.815 0.832 0.838 0.812 0.875
BOOLQ 0.449 0.531 0.626 0.626 0.641 0.608 0.611 0.662 0.632 - 0.502 0.603 0.574 0.629
STORYCLOZE 0.764 0.833 0.849 0.843 0.809 0.852 0.873 - 0.905 - 0.917 0.960 0.959 0.964
CSQA 0.377 0.567 0.705 0.705 0.685 0.740 0.748 - 0.749 - 0.742 - 0.778 0.796
OBQA 0.394 0.584 0.756 0.754 0.719 0.767 0.774 - 0.787 - 0.836 - 0.869 0.882
COPA 0.569 0.765 0.872 0.865 0.829 0.828 0.864 - 0.919 - 0.919 - 0.946 0.936
PIQA 0.393 0.574 0.710 0.710 0.744 0.769 0.781 - 0.798 - 0.845 - 0.815 0.829
SIQA 0.363 0.569 0.664 0.664 0.637 0.671 0.679 - 0.689 - 0.693 - 0.714 0.714
MEDQA 0.135 0.278 0.435 0.428 0.448 0.499 0.509 - 0.531 - 0.597 - 0.553 0.598
SQA 0.249 0.032 0.039 0.040 0.036 0.039 0.042 - 0.045 - 0.055 - 0.047 -
JEOPARDY 0.640 0.270 0.280 0.220 0.410 0.280 0.240 0.480 0.490 0.520 0.470 0.470 0.470 0.600
GSM8K 0.748 0.920 0.946 0.960 0.946 0.953 0.956 - 0.948 - 0.962 - 0.972 0.962
APPLE 0.460 0.790 0.850 0.840 0.790 0.880 0.890 0.910 0.920 0.820 0.850 0.910 0.910 0.870
HUMANEVAL 0.445 0.682 0.817 0.804 0.835 0.865 0.859 - 0.859 - 0.884 - 0.890 0.414
HUMANEVALP 0.335 0.591 0.682 0.676 0.725 0.713 0.731 - 0.737 - 0.750 - 0.780 0.359
HUMANEVALFIM - - - - - - - - - - - - - -
MBPP 0.408 0.544 0.645 0.642 0.571 0.618 0.630 - 0.700 - 0.677 - 0.684 0.404
MBPPP 0.388 0.482 0.598 0.602 0.580 0.611 0.566 - 0.651 - - - 0.678 0.392
HUMANEVALX_cpp 0.231 0.359 0.463 0.353 0.524 0.615 0.554 - 0.652 - - - 0.737 0.378
HUMANEVALX_java 0.274 0.548 0.731 0.737 0.737 0.780 0.798 - 0.841 - - - 0.847 0.097
HUMANEVALX_js 0.256 0.518 0.719 0.695 0.762 0.774 0.774 - 0.786 - - - 0.817 0.493
HUMANEVALX 0.254 0.475 0.638 0.595 0.674 0.723 0.709 - 0.760 - - - 0.800 0.323
CRUXEVAL_input 0.353 0.406 0.457 0.453 0.445 0.528 0.510 - 0.537 - - - 0.450 0.200
CRUXEVAL_output 0.241 0.338 0.420 0.403 0.405 0.446 0.447 - 0.501 - - - 0.431 0.368
CRUXEVAL 0.297 0.372 0.438 0.428 0.425 0.487 0.478 - 0.519 - - - 0.440 0.284
CRUXEVALFIM_input - - - - - - - - - - - - - -
CRUXEVALFIM_output - - - - - - - - - - - - - -
CRUXEVALFIM - - - - - - - - - - - - - -
TQA_mc 0.261 0.406 0.600 0.598 0.592 0.641 0.635 - 0.676 - - - 0.742 0.795
TQA_tf 0.429 0.445 0.502 0.500 0.513 0.430 0.458 - 0.614 - - - 0.456 0.523
TQA 0.409 0.441 0.514 0.511 0.523 0.455 0.479 - 0.621 - - - 0.490 0.554
ARC_challenge 0.275 0.686 0.854 0.852 0.833 0.882 0.882 - 0.896 - - - 0.910 0.917
ARC_easy 0.502 0.850 0.937 0.933 0.934 0.952 0.955 - 0.964 - - - 0.974 0.975
ARC 0.427 0.796 0.910 0.906 0.901 0.929 0.931 - 0.942 - - - 0.953 0.956
RACE_high 0.359 0.594 0.759 0.756 0.747 0.794 0.798 - 0.826 - - - 0.822 0.871
RACE_middle 0.397 0.652 0.808 0.808 0.818 0.842 0.844 - 0.873 - - - 0.881 -
RACE 0.370 0.611 0.774 0.771 0.768 0.808 0.811 - 0.839 - - - 0.839 0.871
MMLU
abstract_algebra 0.100 0.240 0.410 0.420 0.360 0.430 0.470 - 0.500 - - - 0.470 -
anatomy 0.274 0.437 0.540 0.555 0.592 0.622 0.607 - 0.651 - - - 0.644 -
astronomy 0.328 0.611 0.723 0.730 0.796 0.822 0.828 - 0.861 - - - 0.868 -
business_ethics 0.290 0.460 0.670 0.670 0.610 0.650 0.650 - 0.720 - - - 0.730 -
clinical_knowledge 0.350 0.528 0.690 0.709 0.728 0.758 0.743 - 0.762 - - - 0.781 -
college_biology 0.388 0.604 0.770 0.784 0.784 0.805 0.812 - 0.847 - - - 0.881 -
college_chemistry 0.230 0.340 0.420 0.420 0.420 0.480 0.490 - 0.550 - - - 0.500 -
college_computer_science 0.230 0.380 0.580 0.580 0.610 0.650 0.700 - 0.630 - - - 0.700 -
college_mathematics 0.160 0.300 0.370 0.390 0.390 0.450 0.500 - 0.450 - - - 0.400 -
college_medicine 0.312 0.537 0.676 0.664 0.676 0.716 0.722 - 0.739 - - - 0.722 -
college_physics 0.137 0.254 0.519 0.509 0.500 0.529 0.578 - 0.578 - - - 0.558 -
computer_security 0.430 0.650 0.690 0.700 0.710 0.750 0.740 - 0.770 - - - 0.770 -
conceptual_physics 0.255 0.514 0.685 0.685 0.697 0.736 0.761 - 0.821 - - - 0.821 -
econometrics 0.140 0.394 0.587 0.614 0.552 0.614 0.631 - 0.622 - - - 0.605 -
electrical_engineering 0.324 0.475 0.586 0.579 0.586 0.648 0.648 - 0.717 - - - 0.724 -
elementary_mathematics 0.148 0.391 0.582 0.574 0.584 0.640 0.650 - 0.701 - - - 0.679 -
formal_logic 0.238 0.357 0.547 0.531 0.460 0.476 0.476 - 0.515 - - - 0.579 -
global_facts 0.140 0.110 0.180 0.220 0.240 0.260 0.280 - 0.300 - - - 0.310 -
high_school_biology 0.432 0.600 0.822 0.822 0.822 0.848 0.858 - 0.883 - - - 0.906 -
high_school_chemistry 0.147 0.423 0.600 0.591 0.551 0.635 0.650 - 0.709 - - - 0.665 -
high_school_computer_science 0.340 0.580 0.750 0.750 0.730 0.820 0.830 - 0.830 - - - 0.810 -
high_school_european_history 0.387 0.600 0.703 0.690 0.727 0.818 0.812 - 0.787 - - - 0.806 -
high_school_geography 0.393 0.631 0.803 0.808 0.752 0.787 0.808 - 0.853 - - - 0.878 -
high_school_government_and_politics 0.284 0.621 0.849 0.854 0.834 0.891 0.906 - 0.901 - - - 0.958 -
high_school_macroeconomics 0.292 0.474 0.661 0.656 0.669 0.712 0.712 - 0.787 - - - 0.810 -
high_school_mathematics 0.166 0.292 0.348 0.355 0.351 0.418 0.396 - 0.474 - - - 0.381 -
high_school_microeconomics 0.390 0.609 0.773 0.768 0.802 0.882 0.886 - 0.911 - - - 0.894 -
high_school_physics 0.119 0.350 0.556 0.549 0.549 0.602 0.602 - 0.649 - - - 0.662 -
high_school_psychology 0.526 0.746 0.842 0.844 0.849 0.877 0.877 - 0.891 - - - 0.921 -
high_school_statistics 0.333 0.462 0.648 0.657 0.625 0.694 0.689 - 0.703 - - - 0.736 -
high_school_us_history 0.338 0.553 0.784 0.764 0.710 0.823 0.848 - 0.862 - - - 0.897 -
high_school_world_history 0.459 0.645 0.793 0.776 0.797 0.839 0.831 - 0.827 - - - 0.864 -
human_aging 0.331 0.439 0.587 0.578 0.600 0.609 0.623 - 0.668 - - - 0.771 -
human_sexuality 0.374 0.549 0.641 0.664 0.664 0.740 0.755 - 0.770 - - - 0.809 -
international_law 0.429 0.537 0.669 0.652 0.628 0.694 0.710 - 0.826 - - - 0.801 -
jurisprudence 0.398 0.527 0.675 0.694 0.675 0.731 0.731 - 0.805 - - - 0.777 -
logical_fallacies 0.319 0.613 0.791 0.785 0.717 0.779 0.803 - 0.822 - - - 0.803 -
machine_learning 0.276 0.339 0.526 0.508 0.392 0.491 0.455 - 0.562 - - - 0.455 -
management 0.514 0.640 0.786 0.805 0.834 0.844 0.873 - 0.825 - - - 0.796 -
marketing 0.602 0.739 0.816 0.807 0.820 0.884 0.876 - 0.876 - - - 0.876 -
medical_genetics 0.370 0.580 0.750 0.710 0.750 0.780 0.750 - 0.790 - - - 0.840 -
miscellaneous 0.390 0.597 0.752 0.744 0.757 0.789 0.795 - 0.831 - - - 0.854 -
moral_disputes 0.289 0.473 0.580 0.583 0.540 0.609 0.615 - 0.638 - - - 0.699 -
moral_scenarios 0.109 0 0.145 0.140 0.234 0.322 0.269 - 0.330 - - - 0.292 -
nutrition 0.316 0.509 0.660 0.669 0.660 0.702 0.702 - 0.771 - - - 0.771 -
philosophy 0.225 0.501 0.623 0.617 0.575 0.636 0.643 - 0.675 - - - 0.710 -
prehistory 0.345 0.530 0.688 0.688 0.688 0.753 0.743 - 0.774 - - - 0.796 -
professional_accounting 0.237 0.308 0.439 0.443 0.404 0.482 0.482 - 0.510 - - - 0.546 -
professional_law 0.201 0.288 0.378 0.382 0.369 0.414 0.424 - 0.431 - - - 0.475 -
professional_medicine 0.209 0.463 0.698 0.705 0.676 0.757 0.775 - 0.786 - - - 0.830 -
professional_psychology 0.303 0.459 0.651 0.655 0.601 0.676 0.679 - 0.733 - - - 0.766 -
public_relations 0.345 0.500 0.572 0.563 0.581 0.600 0.636 - 0.645 - - - 0.709 -
security_studies 0.412 0.604 0.636 0.636 0.677 0.730 0.730 - 0.746 - - - 0.755 -
sociology 0.427 0.656 0.731 0.746 0.766 0.800 0.815 - 0.781 - - - 0.855 -
us_foreign_policy 0.470 0.610 0.720 0.710 0.780 0.830 0.830 - 0.830 - - - 0.840 -
virology 0.319 0.379 0.433 0.433 0.409 0.463 0.475 - 0.487 - - - 0.487 -
world_religions 0.362 0.637 0.719 0.719 0.783 0.783 0.777 - 0.818 - - - 0.807 -
MMLU 0.298 0.457 0.598 0.598 0.598 0.651 0.654 - 0.684 - - - 0.698 -
AGIEVAL
aquarat 0.572 0.760 0.866 0.840 0.834 0.866 0.897 - 0.885 - - - 0.860 -
logiqa 0.062 0.230 0.451 0.453 0.393 0.420 0.431 - 0.465 - - - 0.520 -
lsatar 0.208 0.313 0.486 0.500 0.430 0.486 0.517 - 0.469 - - - 0.495 -
lsatlr 0.164 0.372 0.601 0.594 0.574 0.641 0.658 - 0.725 - - - 0.768 -
lsatrc 0.327 0.464 0.669 0.665 0.657 0.687 0.713 - 0.713 - - - 0.806 -
saten 0.412 0.655 0.830 0.825 0.825 0.820 0.820 - 0.834 - - - 0.873 -
satmath 0.772 0.950 0.990 0.981 0.986 0.990 0.990 - 0.990 - - - 0.995 -
AGIEVAL 0.282 0.458 0.641 0.636 0.608 0.643 0.659 - 0.678 - - - 0.717 -
AGIEVALC_biology 0.152 0.539 0.765 0.760 0.769 0.834 0.847 - 0.856 - - - 0.878 -
AGIEVALC_chemistry 0.117 0.397 0.622 0.602 0.568 0.647 0.656 - 0.720 - - - 0.803 -
AGIEVALC_chinese 0.081 0.365 0.516 0.504 0.581 0.609 0.634 - 0.678 - - - 0.739 -
AGIEVALC_english 0.477 0.728 0.856 0.849 0.820 0.856 0.856 - 0.866 - - - 0.872 -
AGIEVALC_geography 0.281 0.547 0.708 0.683 0.648 0.758 0.743 - 0.768 - - - 0.829 -
AGIEVALC_history 0.319 0.612 0.736 0.727 0.702 0.761 0.770 - 0.821 - - - 0.872 -
AGIEVALC_jecqaca 0.183 0.303 0.397 0.392 0.378 0.425 0.439 - 0.482 - - - 0.566 -
AGIEVALC_jecqakd 0.123 0.359 0.480 0.484 0.513 0.561 0.574 - 0.613 - - - 0.676 -
AGIEVALC_logiqa 0.122 0.317 0.496 0.483 0.471 0.499 0.497 - 0.562 - - - 0.599 -
AGIEVALC_mathcloze 0.728 0.669 0.923 0.957 0.838 0.830 0.923 - 0.881 - - - 0.932 0.864
AGIEVALC_mathqa 0.500 0.704 0.828 0.812 0.764 0.863 0.851 - 0.813 - - - 0.852 0.828
AGIEVALC_physics 0.080 0.333 0.436 0.431 0.454 0.563 0.545 - 0.626 - - - 0.701 0.741
AGIEVALC 0.225 0.448 0.602 0.589 0.580 0.639 0.646 - 0.680 - - - 0.729 0.811
BBH
boolean_expressions 0.724 0.728 0.820 0.812 0.612 0.560 0.620 - 0.900 - - - 0.832 -
causal_judgement 0.491 0.561 0.593 0.582 0.540 0.588 0.572 - 0.604 - - - 0.631 -
date_understanding 0.504 0.752 0.880 0.888 0.852 0.936 0.912 - 0.916 - - - 0.940 -
disambiguation_qa 0.448 0.464 0.648 0.588 0.464 0.544 0.520 - 0.636 - - - 0.448 -
dyck_languages 0.412 0.524 0.580 0.572 0.672 0.696 0.688 - 0.772 - - - 0.816 -
formal_fallacies 0.800 0.800 0.748 0.776 0.568 0.616 0.604 - 0.768 - - - 0.728 -
geometric_shapes 0.228 0.572 0.536 0.556 0.692 0.716 0.676 - 0.688 - - - 0.728 -
hyperbaton 0.576 0.692 0.872 0.856 0.912 0.952 0.960 - 0.976 - - - 0.948 -
logical_deduction_five_objects 0.416 0.772 0.884 0.868 0.856 0.872 0.928 - 0.936 - - - 0.972 -
logical_deduction_seven_objects 0.360 0.664 0.856 0.840 0.816 0.860 0.880 - 0.888 - - - 0.924 -
logical_deduction_three_objects 0.612 0.932 0.988 0.988 0.988 0.984 0.980 - 0.996 - - - 1.000 -
movie_recommendation 0.360 0.416 0.528 0.504 0.492 0.520 0.544 - 0.572 - - - 0.616 -
multistep_arithmetic_two 0.896 0.988 0.996 0.988 0.984 1.000 1.000 - 0.572 - - - 0.996 -
navigate 0.516 0.576 0.580 0.580 0.508 0.576 0.608 - 0.680 - - - 0.728 -
object_counting 0.664 0.872 0.992 0.996 0.996 0.992 0.996 - 0.996 - - - 1.000 -
penguins_in_a_table 0.602 0.897 0.945 0.958 0.993 1.000 0.993 - 1.000 - - - 1.000 -
reasoning_about_colored_objects 0.520 0.792 0.952 0.960 0.928 0.940 0.960 - 0.948 - - - 0.984 -
ruin_names 0.164 0.512 0.508 0.516 0.604 0.656 0.652 - 0.772 - - - 0.776 -
salient_translation_error_detection 0.316 0.488 0.612 0.632 0.604 0.628 0.572 - 0.660 - - - 0.680 -
snarks 0.471 0.573 0.730 0.685 0.735 0.792 0.735 - 0.780 - - - 0.837 -
sports_understanding 0.472 0.524 0.624 0.596 0.540 0.644 0.636 - 0.560 - - - 0.776 -
temporal_sequences 0.136 0.400 0.912 0.892 0.940 0.992 0.992 - 0.980 - - - 0.992 -
tracking_shuffled_objects_five_objects 0.280 0.648 0.940 0.964 0.968 0.956 0.936 - 0.996 - - - 0.996 -
tracking_shuffled_objects_seven_objects 0.232 0.564 0.852 0.884 0.924 0.952 0.972 - 0.944 - - - 0.948 -
tracking_shuffled_objects_three_objects 0.408 0.736 0.896 0.884 0.920 0.920 0.952 - 0.996 - - - 0.996 -
web_of_lies 0.456 0.460 0.552 0.544 0.488 0.488 0.488 - 0.540 - - - 0.492 -
word_sorting 0.080 0.136 0.220 0.228 0.292 0.292 0.288 - 0.324 - - - 0.324 -
BBH 0.446 0.628 0.748 0.744 0.734 0.763 0.763 - 0.791 - - - 0.817 -
MUSR
murder_mystery 0.524 0.560 0.640 0.668 0.584 0.636 0.652 - 0.672 - - - 0.636 -
object_placements 0.480 0.512 0.566 0.556 0.536 0.582 0.578 - 0.528 - - - 0.516 -
team_allocation 0.280 0.468 0.628 0.612 0.648 0.656 0.668 - 0.564 - - - 0.632 -
MUSR 0.428 0.513 0.611 0.612 0.589 0.624 0.632 - 0.588 - - - 0.594 -
GPQA_diamond 0.262 0.282 0.434 0.489 0.398 0.530 - - 0.439 - - - 0.555 -
GPQA 0.262 0.282 0.434 0.489 0.398 0.530 - - 0.439 - - - 0.555 -
MMLUPRO
biology 0.268 0.596 0.799 0.784 0.804 0.822 0.831 - 0.824 - - - 0.852 -
business 0.348 0.548 0.717 0.692 0.724 0.740 0.738 - 0.784 - - - 0.812 -
chemistry 0.300 0.564 0.720 0.700 0.700 0.746 0.747 - 0.796 - - - 0.780 -
computer_science 0.232 0.532 0.680 0.664 0.676 0.704 0.707 - 0.704 - - - 0.804 -
economics 0.276 0.544 0.716 0.732 0.732 0.741 0.759 - 0.808 - - - 0.832 -
engineering 0.232 0.456 0.557 0.596 0.536 0.587 0.600 - 0.620 - - - 0.676 -
health 0.192 0.312 0.559 0.564 0.632 0.630 0.639 - 0.652 - - - 0.680 -
history 0.200 0.300 0.488 0.524 0.536 0.561 0.556 - 0.600 - - - 0.680 -
law 0.076 0.208 0.295 0.304 0.348 0.353 0.379 - 0.384 - - - 0.468 -
math 0.444 0.676 0.832 0.800 0.780 0.824 0.827 - 0.844 - - - 0.872 -
other 0.188 0.356 0.484 0.528 0.560 0.590 0.593 - 0.656 - - - 0.704 -
philosophy 0.168 0.336 0.504 0.520 0.492 0.559 0.567 - 0.580 - - - 0.664 -
physics 0.232 0.580 0.708 0.740 0.720 0.753 0.752 - 0.756 - - - 0.824 -
psychology 0.212 0.512 0.672 0.632 0.668 0.725 0.692 - 0.728 - - - 0.748 -
MMLUPRO 0.240 0.465 0.622 0.627 0.636 0.674 0.679 - 0.695 - - - 0.742 -
CATEGORIES
REASONING 0.332 0.593 0.744 0.746 0.738 0.787 0.792 0.827 0.802 0.815 0.832 0.838 0.824 0.885
UNDERSTANDING 0.353 0.521 0.653 0.651 0.646 0.693 0.697 0.722 0.725 0.699 0.747 0.849 0.743 0.809
LANGUAGE 0.471 0.590 0.644 0.638 0.660 0.700 0.700 0.729 0.714 0.698 0.725 0.692 0.701 0.780
KNOWLEDGE 0.367 0.456 0.552 0.548 0.558 0.527 0.541 0.657 0.632 0.520 0.501 0.599 0.563 0.581
COT 0.307 0.478 0.616 0.611 0.611 0.667 0.670 - 0.682 - - - 0.717 -
MATHCOT 0.545 0.766 0.900 0.890 0.882 0.884 0.895 0.910 0.904 0.820 0.954 0.910 0.924 0.927
CODE 0.317 0.443 0.538 0.524 0.532 0.582 0.573 - 0.618 - 0.755 - 0.586 0.321
DISCIPLINES
NLP 0.402 0.562 0.674 0.672 0.671 0.694 0.700 0.767 0.739 0.769 0.764 0.768 0.723 0.772
MATH 0.433 0.638 0.789 0.767 0.779 0.808 0.815 0.910 0.824 0.820 0.954 0.910 0.828 0.927
SCIENCE 0.340 0.648 0.783 0.789 0.782 0.808 0.818 - 0.847 - - - 0.866 0.946
ENGINEERING 0.265 0.463 0.561 0.589 0.554 0.595 0.606 - 0.655 - - - 0.693 -
MEDICINE 0.232 0.392 0.554 0.552 0.566 0.613 0.620 - 0.642 - 0.597 - 0.668 0.598
HUMANITIES 0.283 0.449 0.595 0.593 0.598 0.642 0.639 0.480 0.677 0.520 0.470 0.470 0.707 0.600
BUSINESS 0.358 0.555 0.717 0.717 0.725 0.756 0.762 - 0.807 - - - 0.815 -
LAW 0.200 0.329 0.438 0.458 0.448 0.473 0.488 - 0.535 - - - 0.586 -
COMPOSITE AVERAGE
AVG 0.363 0.539 0.664 0.661 0.663 0.693 0.699 0.767 0.732 0.768 0.764 0.767 0.731 0.759

CODE MODELS:

MODEL codegemma-2b codegemma-1.1-7b-it codegemma-7b CodeLlama-7b-hf Codestral-22B-v0.1 Codestral-22B-Instruct-v0.1 CodeQwen1.5-7B-Chat CodeQwen1.5-7B CodeQwen1.5-7B Deepseek-Coder-V2-Lite-Instruct Qwen2.5-Coder-0.5B-32k-Instruct Qwen2.5-Coder-1.5B-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-Coder-3B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-Coder-7B Qwen2.5-Coder-7B Qwen2.5-Coder-14B-Instruct Qwen2.5-Coder-14B Qwen2.5-Coder-32B-Instruct
params 2.51B 8.54B 8.54B 6.74B 22B 22B 7.25B 7.25B 7.25B 14.77B 0.49403B 1.54B 3.09B 3.09B 7.62B 7.62B 7.62B 14.77B 14.77B 32.76B
quant Q6_K Q6_K Q6_K IQ4_XS IQ4_XS IQ4_XS Q6_K Q6_K Q8_0 IQ4_XS Q6_K Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q6_K IQ4_XS IQ4_XS IQ4_XS
engine llama.cpp version: 4255 llama.cpp version: 4150 llama.cpp version: 4255 llama.cpp version: 4191 llama.cpp version: 4132 llama.cpp version: 4191 llama.cpp version: 4094 llama.cpp version: 4132 llama.cpp version: 4191 llama.cpp version: 4488 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4094 llama.cpp version: 4295 llama.cpp version: 4132 llama.cpp version: 4120 llama.cpp version: 4150 llama.cpp version: 4150
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc
HUMANEVAL 0.292 0.591 0.451 0.280 0.664 0.810 0.859 0.518 0.567 0.847 0.518 0.676 0.780 0.835 0.829 0.640 0.713 0.878 0.676 0.884
HUMANEVALP 0.201 0.475 0.335 0.182 0.554 0.682 0.701 0.414 0.445 - 0.432 0.567 0.682 0.719 0.707 0.530 0.579 0.756 0.536 0.756
HUMANEVALFIM 0.268 - 0.463 0.292 0.719 0.719 0.731 0.518 0.475 0.621 0.518 0.524 - 0.634 0.493 0.713 0.756 0.829 0.518 0.890
MBPP 0.447 0.552 0.521 0.404 0.630 0.653 0.712 0.536 0.525 - 0.408 0.560 0.599 0.618 0.735 0.614 0.571 0.727 0.661 0.715
MBPPP 0.415 0.517 0.455 0.375 0.558 0.593 0.665 0.486 0.486 - 0.352 0.504 0.584 0.589 0.687 0.540 0.513 0.665 0.558 0.669
HUMANEVALX_cpp 0.170 0.359 0.384 0.256 0.640 0.621 0.676 0.463 0.475 - 0.286 0.426 0.237 0.567 0.676 0.548 0.475 0.506 0.573 0.689
HUMANEVALX_java 0.341 0.469 0.493 0.371 0.756 0.670 0.774 0.591 0.609 - 0.512 0.609 0.615 0.743 0.798 0.725 0.652 0.201 0.762 0.841
HUMANEVALX_js 0.347 0.560 0.493 0.347 0.658 0.621 0.768 0.567 0.585 - 0.493 0.615 0.682 0.670 0.798 0.628 0.658 0.817 0.695 0.835
HUMANEVALX 0.286 0.463 0.457 0.325 0.684 0.638 0.739 0.540 0.556 - 0.430 0.550 0.512 0.660 0.758 0.634 0.595 0.508 0.676 0.788
CRUXEVAL_input 0.057 0.406 0.206 0.156 0.438 0.351 0.456 0.192 0.203 - 0.435 0.416 0.347 0.481 0.578 0.255 0.267 0.677 0.281 0.676
CRUXEVAL_output 0.253 0.368 0.306 0.281 0.465 0.447 0.363 0.363 0.363 - 0.278 0.332 0.311 0.413 0.507 0.381 0.435 0.577 0.422 0.610
CRUXEVAL 0.155 0.387 0.256 0.218 0.451 0.399 0.410 0.278 0.283 - 0.356 0.374 0.329 0.447 0.543 0.318 0.351 0.627 0.351 0.643
CRUXEVALFIM_input 0.325 - 0.378 0.301 0.295 0.351 0.237 0.206 0.210 - 0.017 0.155 - 0.208 0.322 0.296 0.313 0.421 0.346 0.515
CRUXEVALFIM_output 0.153 - 0.332 0.122 0.441 0.355 0.212 0.280 0.266 - 0.098 0.222 - 0.323 0.481 0.352 0.365 0.546 0.481 0.557
CRUXEVALFIM 0.239 - 0.355 0.211 0.368 0.353 0.225 0.243 0.238 - 0.058 0.188 - 0.266 0.401 0.324 0.339 0.483 0.413 0.536
CODE 0.237 0.441 0.352 0.248 0.483 0.467 0.447 0.339 0.342 0.734 0.278 0.368 0.449 0.453 0.548 0.413 0.427 0.593 0.458 0.648

MATH MODELS:

MODEL Deepseek-R1-Distill-Qwen-1.5B Deepseek-R1-Distill-Qwen-7B Deepseek-R1-Distill-Qwen-14B Deepseek-R1-Distill-Qwen-32B Mathstral-7B-v0.1 openchat-3.5-0106-math Qwen2-Math-7B-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-Math-1.5B-Instruct Qwen2.5-Math-7B-Instruct Qwen3-32B QwQ-32B
params 1.78B 7.62B 14.77B 32.76B 7.25B 7.24B 7.62B 7.62B 1.54B 7.62B 32.8B 32.76B
quant Q8_0 IQ4_XS IQ4_XS IQ4_XS Q6_K Q6_K Q6_K Q6_K IQ4_XS Q6_K Q4_K_H IQ4_XS
engine llama.cpp version: 4763 llama.cpp version: 4644 llama.cpp version: 4657 llama.cpp version: 4559 llama.cpp version: 4406 llama.cpp version: 4394 llama.cpp version: 4416 llama.cpp version: 4402 llama.cpp version: 4406 llama.cpp version: 4394 llama.cpp version: 5633 llama.cpp version: 4820
TEST acc acc acc acc acc acc acc acc acc acc acc acc
MATH1_algebra 0.918 0.962 0.925 0.962 0.911 0.807 0.985 0.888 0.859 0.955 0.992 0.992
MATH1_counting_and_probability 0.794 0.948 0.923 0.948 0.820 0.615 0.948 0.923 0.897 0.974 1.000 0.974
MATH1_geometry 0.710 0.736 0.868 0.921 0.631 0.421 0.789 0.631 0.710 0.842 0.842 0.921
MATH1_intermediate_algebra 0.730 0.903 0.865 0.961 0.673 0.519 0.865 0.750 0.730 0.711 0.923 1.000
MATH1_number_theory 0.866 0.800 0.700 0.933 0.766 0.366 0.933 0.633 0.766 1.000 0.666 0.800
MATH1_prealgebra 0.883 0.965 0.883 0.953 0.790 0.639 0.906 0.837 0.837 0.883 0.930 0.953
MATH1_precalculus 0.596 0.859 0.842 1.000 0.666 0.456 0.894 0.859 0.631 0.789 0.947 0.982
MATH1 0.814 0.910 0.878 0.958 0.784 0.613 0.919 0.821 0.794 0.885 0.931 0.963
MATH2_algebra 0.825 0.930 0.900 0.995 0.850 0.592 0.940 0.880 0.910 0.860 0.970 0.975
MATH2_counting_and_probability 0.782 0.851 0.841 0.950 0.613 0.425 0.831 0.792 0.683 0.861 0.970 0.990
MATH2_geometry 0.743 0.914 0.792 0.914 0.597 0.475 0.756 0.658 0.621 0.743 0.792 0.963
MATH2_intermediate_algebra 0.664 0.875 0.835 0.968 0.406 0.281 0.835 0.687 0.671 0.710 0.953 0.960
MATH2_number_theory 0.782 0.826 0.891 0.934 0.641 0.413 0.869 0.717 0.695 0.880 0.891 0.945
MATH2_prealgebra 0.887 0.909 0.875 0.971 0.768 0.621 0.909 0.830 0.836 0.881 0.932 0.971
MATH2_precalculus 0.663 0.902 0.805 0.955 0.442 0.964 0.814 0.769 0.557 0.725 0.858 0.964
MATH2 0.777 0.893 0.856 0.963 0.647 0.552 0.866 0.781 0.742 0.817 0.921 0.968
MATH3_algebra 0.854 0.934 0.911 0.992 0.796 0.892 0.946 0.850 0.881 0.850 0.969 0.996
MATH3_counting_and_probability 0.730 0.770 0.830 0.930 0.640 0.820 0.760 0.760 0.710 0.880 0.950 1.000
MATH3_geometry 0.627 0.911 0.794 0.901 0.549 0.882 0.745 0.686 0.696 0.764 0.833 0.970
MATH3_intermediate_algebra 0.635 0.902 0.882 0.964 0.384 0.866 0.697 0.600 0.574 0.738 0.933 0.969
MATH3_number_theory 0.696 0.754 0.770 0.926 0.475 0.713 0.803 0.713 0.655 0.819 0.811 0.942
MATH3_prealgebra 0.763 0.883 0.892 0.946 0.718 0.508 0.915 0.812 0.816 0.883 0.946 0.982
MATH3_precalculus 0.582 0.874 0.818 0.968 0.299 0.181 0.748 0.582 0.480 0.685 0.858 0.897
MATH3 0.719 0.876 0.859 0.954 0.583 0.705 0.824 0.732 0.714 0.810 0.915 0.969
MATH4_algebra 0.805 0.897 0.922 0.957 0.657 0.325 0.893 0.830 0.851 0.865 0.968 0.992
MATH4_counting_and_probability 0.639 0.738 0.711 0.945 0.486 0.270 0.657 0.594 0.558 0.783 0.945 0.981
MATH4_geometry 0.576 0.776 0.768 0.832 0.288 0.304 0.592 0.480 0.432 0.616 0.712 0.872
MATH4_intermediate_algebra 0.588 0.858 0.850 0.935 0.278 0.120 0.596 0.520 0.512 0.649 0.911 0.947
MATH4_number_theory 0.697 0.809 0.725 0.894 0.387 0.232 0.746 0.640 0.619 0.823 0.823 0.943
MATH4_prealgebra 0.785 0.874 0.827 0.921 0.623 0.434 0.858 0.691 0.748 0.801 0.879 0.926
MATH4_precalculus 0.570 0.868 0.728 0.947 0.263 0.105 0.473 0.421 0.333 0.578 0.859 0.973
MATH4 0.684 0.845 0.816 0.925 0.452 0.261 0.718 0.626 0.620 0.746 0.887 0.952
MATH5_algebra 0.752 0.899 0.853 0.970 0.387 0.273 0.781 0.732 0.674 0.762 0.964 0.964
MATH5_counting_and_probability 0.569 0.756 0.699 0.910 0.390 0.308 0.560 0.471 0.495 0.642 0.910 0.934
MATH5_geometry 0.545 0.810 0.727 0.840 0.136 0.204 0.416 0.386 0.348 0.507 0.734 0.833
MATH5_intermediate_algebra 0.453 0.821 0.778 0.810 0.103 0.128 0.242 0.250 0.253 0.389 0.807 0.860
MATH5_number_theory 0.707 0.727 0.792 0.935 0.214 0.123 0.590 0.694 0.525 0.753 0.870 0.941
MATH5_prealgebra 0.720 0.808 0.782 0.875 0.481 0.341 0.637 0.601 0.580 0.797 0.911 0.927
MATH5_precalculus 0.437 0.851 0.792 0.814 0.118 0.125 0.303 0.348 0.259 0.429 0.777 0.851
MATH5 0.609 0.822 0.787 0.884 0.268 0.216 0.518 0.509 0.462 0.617 0.865 0.907
MATHCOT 0.700 0.860 0.831 0.930 0.497 0.433 0.733 0.664 0.637 0.751 0.897 0.948
COMPOSITE AVERAGE
AVG 0.700 0.860 0.831 0.930 0.497 0.433 0.733 0.664 0.637 0.751 0.897 0.948

VISION MODELS:

MODEL gemma-3-4b-it gemma-3-4b-it Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.2-24B-Instruct-2506 Qwen2.5-VL-7B-Instruct Qwen2.5-Omni-7B
params 3.88B 3.88B 23.57B 23.57B 7.62B 7.62B
quant Q6_K Q6_K_H Q4_K_H Q4_K_H Q6_K_H Q6_K_H
engine llama.cpp version: 5706 llama.cpp version: 5819 llama.cpp version: 5662 llama.cpp version: 5780 llama.cpp version: 5745 llama.cpp version: 5752
TEST acc acc acc acc acc acc
CHARTQA 0.464 0.456 0.743 0.716 0.651 0.554
DOCVQA 0.567 0.563 0.892 0.866 0.735 0.744
MMMU_Accounting 0.366 0.400 0.466 0.733 0.533 0.466
MMMU_Agriculture 0.400 0.400 0.500 0.533 0.433 0.333
MMMU_Architecture_and_Engineering 0.200 0.166 0.400 0.400 0.400 0.333
MMMU_Art_Theory 0.533 0.666 0.866 0.700 0.633 0.700
MMMU_Art 0.566 0.566 0.633 0.666 0.500 0.600
MMMU_Basic_Medical_Science 0.333 0.533 0.733 0.600 0.500 0.566
MMMU_Biology 0.300 0.166 0.400 0.433 0.500 0.333
MMMU_Chemistry 0.033 0.266 0.366 0.366 0.166 0.300
MMMU_Clinical_Medicine 0.066 0.466 0.633 0.733 0.566 0.533
MMMU_Computer_Science 0.400 0.466 0.400 0.433 0.433 0.566
MMMU_Design 0.633 0.766 0.666 0.800 0.566 0.800
MMMU_Diagnostics_and_Laboratory_Medicine 0.100 0.200 0.400 0.433 0.300 0.400
MMMU_Economics 0.466 0.533 0.666 0.766 0.600 0.600
MMMU_Electronics 0.066 0.133 0.400 - 0.300 0.366
MMMU_Energy_and_Power 0.333 0.233 0.500 - 0.233 0.400
MMMU_Finance 0.333 0.333 0.500 - 0.400 0.366
MMMU_Geography 0.200 0.266 0.533 - 0.466 0.400
MMMU_History 0.566 0.633 0.500 - 0.533 0.633
MMMU_Literature 0.666 0.766 0.866 - 0.766 0.833
MMMU_Manage 0.233 0.333 0.466 - 0.466 0.500
MMMU_Marketing 0.333 0.400 0.633 - 0.566 0.500
MMMU_Materials 0.133 0.233 0.300 - 0.333 0.266
MMMU_Math 0.300 0.400 0.466 - 0.566 0.466
MMMU_Mechanical_Engineering 0.166 0.166 0.433 - 0.366 0.300
MMMU_Music 0.166 0.333 0.400 - 0.400 0.200
MMMU_Pharmacy 0.333 0.366 0.566 - 0.566 0.366
MMMU_Physics 0.166 0.300 0.433 - 0.333 0.500
MMMU_Psychology 0.366 0.433 0.633 - 0.366 0.433
MMMU_Public_Health 0.433 0.700 0.766 - 0.733 0.666
MMMU_Sociology 0.366 0.600 0.533 - 0.400 0.466
MMMU 0.318 0.407 0.535 - 0.464 0.473
MMMUPRO_Accounting 0.224 0.310 0.551 - 0.396 0.362
MMMUPRO_Agriculture 0.200 0.200 0.283 - 0.233 0.150
MMMUPRO_Architecture_and_Engineering 0.100 0.133 0.316 - 0.266 0.250
MMMUPRO_Art_Theory 0.472 0.490 0.618 - 0.581 0.563
MMMUPRO_Art 0.396 0.452 0.471 - 0.415 0.547
MMMUPRO_Basic_Medical_Science 0.269 0.250 0.384 - 0.423 0.307
MMMUPRO_Biology 0.169 0.237 0.355 - 0.305 0.237
MMMUPRO_Chemistry 0.200 0.266 0.383 - 0.316 0.216
MMMUPRO_Clinical_Medicine 0.118 0.135 0.474 - 0.203 0.271
MMMUPRO_Computer_Science 0.283 0.350 0.333 - 0.366 0.300
MMMUPRO_Design 0.433 0.500 0.533 - 0.550 0.616
MMMUPRO_Diagnostics_and_Laboratory_Medicine 0.116 0.200 0.200 - 0.183 0.216
MMMUPRO_Economics 0.423 0.457 0.661 - 0.491 0.389
MMMUPRO_Electronics 0.233 0.316 0.600 - 0.433 0.400
MMMUPRO_Energy_and_Power 0.172 0.172 0.224 - 0.172 0.155
MMMUPRO_Finance 0.283 0.366 0.600 - 0.383 0.350
MMMUPRO_Geography 0.346 0.307 0.384 - 0.307 0.269
MMMUPRO_History 0.375 0.392 0.428 - 0.392 0.464
MMMUPRO_Literature 0.500 0.461 0.615 - 0.557 0.557
MMMUPRO_Manage 0.220 0.240 0.420 - 0.260 0.320
MMMUPRO_Marketing 0.288 0.305 0.525 - 0.508 0.338
MMMUPRO_Materials 0.083 0.133 0.166 - 0.166 0.166
MMMUPRO_Math 0.283 0.233 0.416 - 0.250 0.283
MMMUPRO_Mechanical_Engineering 0.152 0.186 0.440 - 0.338 0.271
MMMUPRO_Music 0.216 0.250 0.183 - 0.233 0.300
MMMUPRO_Pharmacy 0.298 0.298 0.508 - 0.333 0.385
MMMUPRO_Physics 0.166 0.116 0.466 - 0.266 0.300
MMMUPRO_Psychology 0.366 0.333 0.350 - 0.200 0.400
MMMUPRO_Public_Health 0.241 0.293 0.482 - 0.448 0.327
MMMUPRO_Sociology 0.333 0.462 0.574 - 0.314 0.407
MMMUPRO 0.263 0.293 0.430 - 0.341 0.335
DISCIPLINES
NLP - - - - - -
MATH 0.305 0.338 0.394 - 0.372 0.366
SCIENCE 0.178 0.218 0.354 - 0.289 0.258
ENGINEERING 0.239 0.272 0.442 - 0.360 0.373
MEDICINE 0.222 0.309 0.481 - 0.389 0.371
HUMANITIES 0.392 0.441 0.508 - 0.419 0.470
BUSINESS 0.309 0.360 0.552 - 0.447 0.399
LAW - - - - - -
VISION 0.498 0.500 0.790 - 0.661 0.640
COMPOSITE AVERAGE
AVG 0.471 0.479 0.749 - 0.627 0.608

AUDIO MODELS:

MODEL Qwen2.5-Omni-7B ultravox-v0_5-llama-3_2-1b ultravox-v0_5-llama-3_1-8b
params 7.62B 1.24B 8.03B
quant Q6_K_H Q8_0_H Q6_K_H
engine llama.cpp version: 5780 llama.cpp version: 5819 llama.cpp version: 5780
TEST acc acc acc
BBA_formal_fallacies 0.488 0.468 0.460
BBA_navigate 0.628 0.420 0.436
BBA_object_counting 0.616 0.576 0.856
BBA_web_of_lies 0.508 0.488 0.496
BBA 0.560 0.488 0.562
DISCIPLINES
NLP 0.498 0.478 0.478
MATH 0.622 0.498 0.646
SCIENCE - - -
ENGINEERING - - -
MEDICINE - - -
HUMANITIES - - -
BUSINESS - - -
LAW - - -
AUDIO - - -
COMPOSITE AVERAGE
AVG 0.560 0.488 0.562