To see results on github go to https://github.com/steampunque/benchlm

To see results on hf go to https://huggingface.co/spaces/steampunque/benchlm

Independent LLM benchmarks for a wide range of open weight models using custom prompts
including category and discipline summaries.  The model list is actively updated
with latest models releases.  Older obsoleted model results are not kept.

The primary model families being tracked as of 6/2/25 are:
Meta (Llama), Mistral, Google (Gemma), Qwen (Qwen2.5, Qwen2.5 coder, Qwen3, QwQ)

Secondary model being tracked as of 6/2/25 are:
Microsoft (Phi), Deepseek (Qwen R1 distills), Falcon family, internlm family, GLM family, ultravox family.

Models being tracked are selected based on general high popularity, high
performance or other innovation, with non-restrictive open source license terms.

Tests are run using a modified llama.cpp server (supporting logprob completion mode).

MODEL CATEGORIES:
   GENERAL : general purpose text in text out
   THINK   : RL tuned reasoning models with <think> </think> block or equivalent
   CODE    : coding optimized models
   MATH    : math optimized models applied to Hendryks MATH500 set
   VISION  : image + text in, text out
   AUDIO   : audio + text in, text out

METHODOLOGY:
   -All CoT, code, and math tests are zero shot.  A few BBH tests use fewshot examples.
   -Math CoT test such as GSM8K, APPLE, MATH etc. are self graded against correct answer using LLM under test
      If self grade does not work reliably (such as with very small model) the result is zeroed to mark invalid test.
   -All non-CoT MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1.
      To score a correct answer in MC both queries must answer correctly.
   -All not-CoT tests (non-CoT MC, etc.) disable the think block prefix for thinking models.
    The think block is also disabled for all CODE tests with thinking models.
   -Cot MC tests (e.g. MMLUPRO, GPQA, etc.) do one query only.
   -Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases).
   -The new SQA test is not run on all models.  When it is run, the result is only added to SQA test.
      The SQA result is not added to knowledge and composite averages.  The test is not meant for
      small models and results are given for information only.
   -MMMU uses the validation split (30 questions from 30 categories) and CoT prompting
   -MMMUPRO uses the 10-question split and CoT prompting
   -All new tests are run with a maximum of 250 questions per test category on CoT tests.
      This is necessary to contain test time with new thinking models when can generate very lengthy responses.
      The result is printed in italics if there were more than 10 skipped questions in a test category.
      Note some very old runs had skips due to JSON errors in questions but these will not significantly impact averages.

TESTS:
   KNOWLEDGE:
      TQA - Truthful QA
      SQA - Simple QA 4333 question arcane knowledge quiz
      JEOPARDY - 100 Question JEOPARDY quiz
   LANGUAGE:
      LAMBADA - Language Modeling Broadened to Account for Discourse Aspects
   UNDERSTANDING:
      WG - Winogrande
      BOOLQ - Boolean questions
      STORYCLOZE - Story questions
      OBQA - Open Book Question / Answer
      SIQA - Social IQ
      RACE - Reading comprehension dataset from examinations
      MMLU - massive multitask language understanding
      MEDQA - medical QA
   REASONING
      CSQA - Common Sense Question Answer
      COPA - Choice of Plausible Alternatives
      HELLASWAG -Harder Endings, Longer contexts, and Low-shot Activities
                 for Situations With Adversarial Generations
      PIQA - Physical Interaction: Question Answering
      ARC - A12 Reasoning Challenge
      AGIEVAL - AGIEval logiqa, lsat, sat
      AGIEVALC  - Gaokao SAT, logiqa, jec (Chinese)
      MUSR - Multimodal Semantic Reasoning
   COT:
      GSM8K - Grade School Math CoT
      BBH  - Beyond the Imitation Game Bench Hard CoT
      GPQA - Google-Proof QA science CoT
      MMLUPRO - massive multitask language understanding pro CoT
      AGIEVAL - satmath, aquarat
      AGIEVALC  - mathcloze, mathqa (Chinese)
      MUSR - Multimodal Semantic Reasoning
      APPLE - 100 custom Apple Questions
   MATH:
      MATH1..MATH5 - MATH Datasets level 1 through 5 (Hendrycks et al.)
   CODE:
      HUMANEVAL - Python
      HUMANEVALP - Python, extended test
      HUMANEVALX - Python, Java, Javascript, C++
      MBPP - Python
      MBPPP - Python, extendend test
      CRUXEVAL - Python
      USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM
   VISION:
      CHARTQA - Chart Question/Answer
      DOCVQA  - Document Vision QA
      MMMU - Massive Multi-discipline Multimodal Understanding (CoT)
      MMMUPRO - Massive Multi-discipline Multimodal Understanding Pro (CoT)
   AUDIO:
      BBA - Big Bench Audio

GENERAL MODELS:

MODEL Falcon3-1B-Instruct Falcon3-7B-Instruct Falcon3-10B-Instruct gemma-2-9b-it gemma-2-27b-it gemma-3-1b-it gemma-3-4b-it gemma-3-12b-it gemma-3-12b-it gemma-3-27b-it glm-4-9b-chat glm-4-9b-chat internlm3-8b-instruct Llama-3.1-8B-Instruct Llama-3.2-3B-Instruct Llama-4-Scout-17B-16E-Instruct Llama-4-Scout-17B-16E-Instruct Llama-4-Scout-17B-16E-Instruct Mistral-7B-Instruct-v0.3 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.2-24B-Instruct-2506 Phi-3.5-mini-8k-instruct Phi-3.5-mini-128k-instruct Phi-4-mini-instruct Phi-4 Qwen2.5-3B-32k-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-14B-32k-Instruct Qwen2.5-32B-Instruct Qwen3-4B-Instruct-2507
params 1.67B 7.46B 10.31B 9.24B 27.23B 0.99989B 3.88B 11.77B 11.77B 27.01B 9.40B 9.40B 8.80B 8.03B 3.21B 107.77B 107.77B 107.77B 7.25B 23.57B 23.57B 23.57B 23.57B 3.82B 3.82B 3.84B 14.66B 3.09B 3.09B 7.62B 7.62B 14.77B 32.76B 4.02B
quant IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS Q8_0 Q6_K IQ4_XS Q4_K_H Q4_K_H IQ4_XS Q6_K IQ4_XS Q6_K Q6_K Q2_K_H Q3_K_H Q4_K_H Q8_0 Q2_K_H Q3_K_H Q4_K_H Q4_K_H Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS IQ4_XS Q6_K_H
engine llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 3266 llama.cpp version: 3389 llama.cpp version: 4877 llama.cpp version: 4888 llama.cpp version: 4938 llama.cpp version: 5572 llama.cpp version: 5586 llama.cpp version: 3496 llama.cpp version: 3334 llama.cpp version: 4488 llama.cpp version: 3428 llama.cpp version: 3825 llama.cpp version: 5236 llama.cpp version: 5279 llama.cpp version: 5335 llama.cpp version: 3262 llama.cpp version: 5509 llama.cpp version: 5509 llama.cpp version: 5509 llama.cpp version: 5742 llama.cpp version: 3609 llama.cpp version: 3600 llama.cpp version: 4792 llama.cpp version: 4295 llama.cpp version: 4038 llama.cpp version: 4038 llama.cpp version: 3943 llama.cpp version: 3870 llama.cpp version: 3821 llama.cpp version: 3821 llama.cpp version: 6628
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc
WG 0.600 0.670 0.700 0.762 0.772 0.576 0.692 0.743 0.741 0.748 0.759 0.753 0.708 0.741 0.685 - - - 0.751 0.775 0.772 0.784 0.780 0.744 0.734 0.707 0.708 0.687 0.695 0.709 0.709 0.754 0.746 0.677
LAMBADA 0.524 0.688 0.692 0.735 0.755 0.504 0.635 0.724 0.721 0.742 0.786 0.783 0.662 0.747 0.705 - - - 0.766 0.786 0.789 0.798 0.792 0.677 0.613 0.653 0.750 0.685 0.682 0.722 0.724 0.769 0.781 0.670
HELLASWAG 0.308 0.684 0.716 0.775 0.810 0.307 0.527 0.779 0.767 0.802 0.834 0.840 0.846 0.696 0.559 - - - 0.591 0.866 0.877 0.899 0.872 0.716 0.669 0.542 0.801 0.670 0.713 0.820 0.822 0.863 0.894 0.775
BOOLQ 0.364 0.591 0.621 0.687 0.739 0.521 0.603 0.669 - 0.701 0.633 0.625 0.562 0.610 0.478 - - - 0.658 - - 0.646 0.684 0.562 0.573 0.453 0.653 0.517 0.533 0.617 0.623 0.647 0.701 0.606
STORYCLOZE 0.774 0.949 0.947 0.958 0.973 0.685 0.900 0.948 - 0.964 0.967 0.976 0.982 0.895 0.870 - - - 0.917 - - 0.968 0.969 0.531 0.921 0.889 0.754 0.913 0.896 0.920 0.915 0.938 0.981 0.928
CSQA 0.488 0.725 0.746 0.751 0.763 0.339 0.614 0.716 - 0.741 0.727 0.733 0.730 0.686 0.642 - - - 0.627 - - 0.756 0.751 0.669 0.660 0.633 0.740 0.701 0.717 0.768 0.781 0.795 0.823 0.737
OBQA 0.380 0.761 0.745 0.846 0.860 0.334 0.648 0.807 - 0.855 0.821 0.802 0.801 0.765 0.709 - - - 0.676 - - 0.866 0.880 0.751 0.720 0.719 0.857 0.700 0.731 0.802 0.804 0.863 0.904 0.804
COPA 0.612 0.870 0.903 0.925 0.949 0.415 0.785 0.932 - 0.944 0.955 0.944 0.927 0.889 0.749 - - - 0.812 - - 0.924 0.932 0.884 0.870 0.834 0.934 0.841 0.858 0.925 0.919 0.935 0.958 0.887
PIQA 0.233 0.696 0.732 0.801 0.841 0.386 0.653 0.784 - 0.818 0.773 0.779 0.777 0.725 0.637 - - - 0.708 - - 0.826 0.831 0.733 0.677 0.674 0.832 0.695 0.713 0.794 0.807 0.848 0.870 0.761
SIQA 0.425 0.658 0.688 0.693 0.731 0.385 0.588 0.699 - 0.716 0.664 0.665 0.706 0.648 0.622 - - - 0.620 - - 0.737 0.710 0.667 0.661 0.645 0.639 0.656 0.663 0.721 0.712 0.746 0.742 0.692
MEDQA 0.141 0.420 0.430 0.501 0.549 0.073 0.292 0.503 - 0.553 0.436 0.445 0.457 0.500 0.413 - - - 0.334 - - 0.593 0.597 0.423 0.395 0.361 0.560 0.344 0.363 0.453 0.458 0.542 0.610 0.494
SQA - 0.033 - - 0.117 - 0.052 0.092 - 0.092 - - 0.039 0.073 - - - - - - - 0.066 0.073 - - 0.039 - - - - - - - 0.058
JEOPARDY 0.010 0.400 0.310 0.580 0.760 - 0.350 0.550 0.560 0.830 0.370 0.420 0.210 0.510 0.350 0.680 0.580 0.540 0.490 0.680 - 0.740 0.640 0.320 0.250 0.280 0.390 0.120 0.120 0.300 0.290 0.540 0.600 0.310
GSM8K 0.485 0.890 0.918 0.890 0.899 - 0.843 0.928 0.928 0.964 0.855 0.839 0.890 0.872 0.822 - - - 0.611 - - 0.940 0.968 0.855 0.714 0.868 0.946 0.829 0.856 0.909 0.880 0.938 0.950 0.960
APPLE 0.150 0.810 0.740 0.750 0.730 - 0.630 0.740 0.770 0.850 0.630 0.610 0.670 0.690 0.610 0.840 0.860 0.860 0.390 0.830 0.780 0.820 0.890 0.560 0.560 0.640 0.910 0.640 0.560 0.740 0.750 0.830 0.860 0.850
HUMANEVAL 0.115 0.737 0.774 0.658 0.743 0.408 0.701 0.859 0.829 0.890 0.737 0.731 0.804 0.652 0.585 - - - 0.390 0.841 0.823 0.853 0.871 0.682 0.621 0.646 0.847 0.695 0.780 0.798 0.817 0.804 0.884 0.841
HUMANEVALP 0.073 0.628 0.664 0.548 0.615 0.317 0.597 0.713 - 0.719 0.615 0.634 0.713 0.536 0.475 - - - 0.329 - - 0.731 0.750 0.591 0.524 0.554 0.725 0.615 0.682 0.670 0.658 0.676 0.768 0.713
HUMANEVALFIM - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MBPP 0.334 0.677 0.653 0.595 0.642 0.536 0.614 0.692 - 0.677 0.579 0.591 0.552 0.564 0.498 - - - 0.451 - - 0.618 0.642 0.610 0.498 0.501 0.673 0.595 0.599 0.669 0.661 0.669 0.684 0.665
MBPPP 0.312 0.629 0.611 0.584 0.638 0.531 0.598 0.642 - 0.625 0.562 0.575 0.477 0.540 0.482 - - - 0.397 - - 0.593 0.647 0.575 0.477 0.504 0.651 0.540 0.584 0.633 0.651 0.633 0.700 0.629
HUMANEVALX_cpp 0.054 0.506 0.603 0.512 0.579 0.158 0.585 0.756 - 0.780 0.439 0.432 0.402 0.457 0.323 - - - 0.225 - - 0.292 0.713 0.280 0.219 0.445 0.676 0.420 0.237 0.475 0.554 0.323 0.701 0.652
HUMANEVALX_java 0.042 0.640 0.719 0.640 0.768 0.317 0.658 0.804 - 0.810 0.207 0.628 0.597 0.487 0.439 - - - 0.256 - - 0.804 0.829 0.079 0.060 0.536 0.634 0.640 0.615 0.695 0.737 0.780 0.865 0.823
HUMANEVALX_js 0.115 0.676 0.652 0.579 0.743 0.359 0.664 0.835 - 0.841 0.628 0.628 0.670 0.560 0.067 - - - 0.402 - - 0.786 0.786 0.560 0.451 0.548 0.786 0.646 0.689 0.719 0.750 0.798 0.847 0.841
HUMANEVALX 0.071 0.607 0.658 0.577 0.697 0.278 0.636 0.798 - 0.810 0.424 0.563 0.556 0.502 0.276 - - - 0.294 - - 0.628 0.776 0.306 0.243 0.510 0.699 0.569 0.514 0.630 0.680 0.634 0.804 0.772
CRUXEVAL_input 0.210 0.411 0.448 0.462 0.485 0.038 0.388 0.440 - 0.528 0.416 0.406 0.477 0.435 0.353 - - - 0.276 - - 0.547 0.550 0.398 0.388 0.336 0.447 0.350 0.331 0.387 0.412 0.541 0.517 0.518
CRUXEVAL_output 0.152 0.355 0.410 0.375 0.482 0.196 0.348 0.457 - 0.491 0.356 0.338 0.372 0.360 0.291 - - - 0.303 - - 0.516 0.498 0.342 0.296 0.317 0.463 0.275 0.311 0.382 0.386 0.471 0.455 0.463
CRUXEVAL 0.181 0.383 0.429 0.418 0.483 0.117 0.368 0.448 - 0.510 0.386 0.372 0.425 0.397 0.322 - - - 0.290 - - 0.531 0.524 0.370 0.342 0.326 0.455 0.312 0.321 0.385 0.399 0.506 0.486 0.491
CRUXEVALFIM_input - 0.418 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM_output - 0.356 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM - 0.387 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TQA_mc 0.146 0.523 0.510 0.701 0.767 0.115 0.468 0.663 - 0.696 0.636 0.640 0.637 0.564 0.555 - - - 0.549 - - 0.767 0.713 0.621 0.581 0.477 0.725 0.516 0.548 0.654 0.657 0.747 0.804 0.707
TQA_tf 0.381 0.410 0.431 0.692 0.725 0.390 0.491 0.677 - 0.634 0.484 0.457 0.593 0.512 0.566 - - - 0.548 - - 0.735 0.670 0.483 0.487 0.588 0.686 0.414 0.300 0.574 0.568 0.706 0.731 0.640
TQA 0.354 0.423 0.440 0.693 0.730 0.358 0.488 0.676 - 0.641 0.502 0.478 0.598 0.518 0.565 - - - 0.548 - - 0.739 0.675 0.499 0.498 0.575 0.691 0.426 0.329 0.583 0.578 0.711 0.740 0.648
ARC_challenge 0.374 0.809 0.819 0.882 0.897 0.319 0.699 0.869 - 0.899 0.835 0.853 0.871 0.776 0.706 - - - 0.688 - - 0.912 0.907 0.813 0.802 0.759 0.911 0.750 0.777 0.843 0.851 0.911 0.934 0.875
ARC_easy 0.598 0.925 0.933 0.952 0.963 0.563 0.875 0.955 - 0.971 0.933 0.940 0.945 0.906 0.843 - - - 0.843 - - 0.970 0.968 0.934 0.932 0.908 0.970 0.895 0.904 0.945 0.946 0.969 0.978 0.952
ARC 0.524 0.886 0.895 0.929 0.941 0.482 0.817 0.927 - 0.947 0.901 0.911 0.920 0.863 0.798 - - - 0.792 - - 0.951 0.948 0.894 0.889 0.859 0.950 0.847 0.862 0.911 0.915 0.950 0.963 0.927
RACE_high 0.431 0.698 0.730 0.802 0.833 0.338 0.633 0.795 - 0.829 0.788 0.787 0.830 0.679 0.589 - - - 0.607 - - 0.853 0.853 0.613 0.625 0.690 0.819 0.698 0.712 0.779 0.788 0.852 0.882 0.802
RACE_middle 0.463 0.777 0.793 0.849 0.883 0.398 0.713 0.858 - 0.883 0.816 0.825 0.866 0.734 0.680 - - - 0.696 - - 0.894 0.885 0.706 0.692 0.745 0.861 0.775 0.776 0.841 0.853 0.887 0.923 0.850
RACE 0.440 0.721 0.748 0.816 0.847 0.355 0.656 0.813 - 0.844 0.796 0.798 0.840 0.695 0.615 - - - 0.633 - - 0.865 0.862 0.640 0.645 0.706 0.831 0.720 0.730 0.797 0.807 0.862 0.894 0.816
MMLU
abstract_algebra 0.180 0.410 0.450 0.330 0.310 0.110 0.160 0.350 - 0.410 0.220 0.210 0.300 0.200 0.270 - - - 0.190 - - 0.370 0.430 0.300 0.210 0.210 0.410 0.240 0.250 0.440 0.430 0.570 0.600 0.510
anatomy 0.318 0.577 0.592 0.626 0.607 0.296 0.459 0.577 - 0.629 0.503 0.511 0.637 0.555 0.540 - - - 0.447 - - 0.718 0.725 0.570 0.585 0.503 0.703 0.525 0.562 0.622 0.622 0.644 0.733 0.585
astronomy 0.263 0.736 0.756 0.760 0.828 0.296 0.559 0.769 - 0.848 0.644 0.651 0.802 0.677 0.565 - - - 0.573 - - 0.888 0.888 0.703 0.703 0.611 0.776 0.618 0.657 0.763 0.769 0.868 0.875 0.789
business_ethics 0.260 0.570 0.560 0.620 0.670 0.270 0.480 0.630 - 0.710 0.570 0.610 0.670 0.550 0.480 - - - 0.520 - - 0.730 0.720 0.620 0.620 0.560 0.740 0.630 0.590 0.680 0.710 0.750 0.800 0.680
clinical_knowledge 0.373 0.652 0.683 0.743 0.788 0.316 0.554 0.716 - 0.784 0.618 0.622 0.716 0.675 0.592 - - - 0.581 - - 0.811 0.777 0.713 0.698 0.637 0.781 0.633 0.645 0.709 0.713 0.803 0.815 0.732
college_biology 0.340 0.763 0.777 0.854 0.895 0.256 0.618 0.826 - 0.847 0.687 0.715 0.777 0.722 0.625 - - - 0.625 - - 0.888 0.902 0.805 0.763 0.694 0.868 0.694 0.694 0.784 0.784 0.854 0.923 0.840
college_chemistry 0.180 0.470 0.430 0.470 0.430 0.180 0.260 0.400 - 0.470 0.380 0.380 0.440 0.400 0.310 - - - 0.350 - - 0.480 0.500 0.460 0.430 0.420 0.520 0.310 0.370 0.480 0.490 0.460 0.530 0.490
college_computer_science 0.110 0.540 0.590 0.460 0.580 0.200 0.360 0.500 - 0.560 0.470 0.480 0.640 0.400 0.350 - - - 0.320 - - 0.650 0.580 0.480 0.410 0.420 0.600 0.390 0.460 0.620 0.590 0.630 0.720 0.620
college_mathematics 0.090 0.320 0.320 0.260 0.300 0.080 0.170 0.300 - 0.400 0.240 0.280 0.280 0.260 0.210 - - - 0.180 - - 0.340 0.380 0.270 0.170 0.160 0.340 0.200 0.180 0.380 0.350 0.490 0.540 0.380
college_medicine 0.283 0.566 0.612 0.658 0.716 0.260 0.462 0.624 - 0.682 0.572 0.589 0.676 0.589 0.491 - - - 0.456 - - 0.722 0.728 0.612 0.566 0.531 0.728 0.560 0.606 0.606 0.624 0.710 0.739 0.647
college_physics 0.186 0.372 0.411 0.352 0.421 0.098 0.196 0.411 - 0.470 0.313 0.323 0.362 0.313 0.303 - - - 0.254 - - 0.529 0.539 0.333 0.294 0.284 0.529 0.382 0.392 0.401 0.372 0.519 0.656 0.529
computer_security 0.370 0.710 0.690 0.730 0.710 0.350 0.590 0.740 - 0.760 0.710 0.730 0.720 0.690 0.620 - - - 0.600 - - 0.710 0.720 0.700 0.650 0.700 0.730 0.650 0.690 0.720 0.710 0.730 0.800 0.710
conceptual_physics 0.234 0.680 0.680 0.638 0.727 0.174 0.404 0.634 - 0.748 0.561 0.587 0.646 0.463 0.361 - - - 0.365 - - 0.744 0.731 0.565 0.553 0.544 0.748 0.485 0.519 0.642 0.642 0.800 0.834 0.753
econometrics 0.122 0.649 0.587 0.557 0.587 0.140 0.315 0.535 - 0.570 0.456 0.464 0.578 0.482 0.359 - - - 0.318 - - 0.587 0.614 0.456 0.421 0.368 0.596 0.421 0.438 0.605 0.596 0.649 0.675 0.622
electrical_engineering 0.220 0.641 0.648 0.558 0.593 0.296 0.393 0.558 - 0.627 0.544 0.572 0.655 0.524 0.462 - - - 0.393 - - 0.703 0.662 0.496 0.475 0.572 0.634 0.441 0.434 0.606 0.606 0.648 0.703 0.634
elementary_mathematics 0.113 0.505 0.497 0.476 0.476 0.058 0.288 0.502 - 0.719 0.367 0.373 0.481 0.357 0.280 - - - 0.222 - - 0.653 0.621 0.423 0.388 0.335 0.544 0.407 0.417 0.560 0.568 0.791 0.838 0.616
formal_logic 0.182 0.444 0.484 0.293 0.468 0.142 0.269 0.452 - 0.507 0.325 0.357 0.420 0.420 0.253 - - - 0.277 - - 0.563 0.484 0.452 0.380 0.380 0.531 0.325 0.341 0.452 0.428 0.539 0.626 0.579
global_facts 0.120 0.190 0.290 0.330 0.370 0.070 0.110 0.300 - 0.420 0.200 0.240 0.330 0.150 0.110 - - - 0.160 - - 0.520 0.450 0.240 0.130 0.120 0.320 0.140 0.200 0.260 0.260 0.470 0.430 0.270
high_school_biology 0.348 0.764 0.774 0.851 0.890 0.358 0.645 0.816 - 0.854 0.800 0.809 0.825 0.729 0.677 - - - 0.654 - - 0.874 0.870 0.793 0.774 0.729 0.887 0.722 0.754 0.803 0.806 0.845 0.896 0.858
high_school_chemistry 0.216 0.522 0.507 0.586 0.600 0.167 0.359 0.517 - 0.610 0.546 0.517 0.527 0.467 0.433 - - - 0.310 - - 0.650 0.640 0.512 0.492 0.482 0.655 0.413 0.463 0.532 0.536 0.596 0.724 0.684
high_school_computer_science 0.250 0.740 0.740 0.710 0.770 0.270 0.570 0.750 - 0.830 0.660 0.660 0.760 0.610 0.540 - - - 0.490 - - 0.810 0.790 0.610 0.580 0.590 0.870 0.600 0.660 0.770 0.770 0.830 0.870 0.780
high_school_european_history 0.490 0.757 0.745 0.806 0.830 0.363 0.678 0.818 - 0.824 0.812 0.830 0.787 0.709 0.672 - - - 0.678 - - 0.830 0.830 0.727 0.672 0.678 0.812 0.733 0.733 0.787 0.800 0.824 0.818 0.787
high_school_geography 0.393 0.717 0.747 0.878 0.888 0.424 0.646 0.818 - 0.858 0.792 0.818 0.792 0.757 0.671 - - - 0.671 - - 0.853 0.853 0.792 0.737 0.722 0.888 0.712 0.732 0.833 0.833 0.868 0.883 0.797
high_school_government_and_politics 0.487 0.875 0.875 0.926 0.963 0.450 0.772 0.911 - 0.937 0.875 0.870 0.880 0.818 0.725 - - - 0.805 - - 0.937 0.948 0.849 0.834 0.839 0.937 0.772 0.797 0.917 0.917 0.958 0.968 0.860
high_school_macroeconomics 0.235 0.653 0.687 0.717 0.758 0.238 0.474 0.682 - 0.771 0.651 0.653 0.733 0.556 0.497 - - - 0.478 - - 0.766 0.743 0.646 0.635 0.617 0.807 0.564 0.592 0.684 0.684 0.802 0.825 0.715
high_school_mathematics 0.088 0.344 0.337 0.277 0.325 0.033 0.211 0.337 - 0.422 0.237 0.240 0.285 0.255 0.233 - - - 0.162 - - 0.362 0.348 0.214 0.203 0.185 0.274 0.270 0.244 0.440 0.422 0.500 0.537 0.396
high_school_microeconomics 0.268 0.823 0.827 0.801 0.852 0.315 0.533 0.798 - 0.831 0.760 0.773 0.798 0.684 0.575 - - - 0.540 - - 0.873 0.878 0.794 0.743 0.760 0.861 0.672 0.697 0.827 0.827 0.857 0.907 0.840
high_school_physics 0.099 0.509 0.496 0.423 0.496 0.112 0.198 0.403 - 0.562 0.344 0.364 0.443 0.317 0.211 - - - 0.165 - - 0.596 0.582 0.377 0.384 0.324 0.569 0.317 0.311 0.470 0.456 0.635 0.695 0.602
high_school_psychology 0.445 0.827 0.853 0.896 0.910 0.445 0.750 0.882 - 0.900 0.840 0.858 0.862 0.834 0.761 - - - 0.764 - - 0.913 0.900 0.855 0.844 0.823 0.904 0.803 0.796 0.858 0.856 0.882 0.902 0.860
high_school_statistics 0.185 0.564 0.625 0.574 0.615 0.129 0.337 0.574 - 0.555 0.509 0.500 0.638 0.462 0.342 - - - 0.361 - - 0.648 0.648 0.569 0.523 0.458 0.643 0.481 0.518 0.615 0.648 0.717 0.782 0.671
high_school_us_history 0.436 0.764 0.794 0.829 0.867 0.348 0.705 0.843 - 0.872 0.833 0.867 0.774 0.784 0.696 - - - 0.699 - - 0.897 0.921 0.759 0.735 0.754 0.877 0.715 0.759 0.843 0.852 0.882 0.906 0.803
high_school_world_history 0.535 0.759 0.818 0.872 0.881 0.392 0.696 0.864 - 0.907 0.810 0.827 0.801 0.789 0.725 - - - 0.720 - - 0.869 0.873 0.746 0.742 0.772 0.869 0.776 0.793 0.818 0.827 0.869 0.877 0.822
human_aging 0.309 0.596 0.627 0.690 0.739 0.345 0.524 0.645 - 0.699 0.582 0.591 0.641 0.618 0.569 - - - 0.542 - - 0.713 0.704 0.582 0.547 0.587 0.726 0.569 0.587 0.681 0.690 0.717 0.771 0.641
human_sexuality 0.351 0.648 0.694 0.746 0.755 0.358 0.519 0.740 - 0.763 0.648 0.633 0.648 0.671 0.587 - - - 0.569 - - 0.839 0.816 0.664 0.587 0.603 0.740 0.625 0.625 0.740 0.717 0.786 0.839 0.664
international_law 0.404 0.727 0.776 0.801 0.760 0.495 0.677 0.801 - 0.809 0.735 0.752 0.743 0.776 0.710 - - - 0.710 - - 0.834 0.818 0.735 0.727 0.727 0.892 0.710 0.685 0.768 0.785 0.834 0.867 0.752
jurisprudence 0.444 0.740 0.768 0.785 0.833 0.379 0.648 0.740 - 0.796 0.675 0.722 0.777 0.731 0.574 - - - 0.626 - - 0.833 0.805 0.722 0.750 0.722 0.787 0.694 0.712 0.759 0.750 0.824 0.824 0.759
logical_fallacies 0.380 0.711 0.730 0.811 0.797 0.300 0.644 0.779 - 0.871 0.730 0.754 0.717 0.736 0.687 - - - 0.660 - - 0.797 0.785 0.785 0.754 0.766 0.779 0.705 0.723 0.773 0.766 0.834 0.877 0.803
machine_learning 0.196 0.508 0.491 0.437 0.571 0.169 0.285 0.464 - 0.482 0.419 0.401 0.500 0.366 0.285 - - - 0.321 - - 0.571 0.571 0.437 0.375 0.375 0.544 0.339 0.321 0.437 0.410 0.526 0.642 0.526
management 0.417 0.825 0.786 0.825 0.844 0.475 0.708 0.864 - 0.825 0.737 0.766 0.834 0.737 0.669 - - - 0.708 - - 0.844 0.864 0.786 0.776 0.747 0.854 0.689 0.718 0.805 0.825 0.825 0.864 0.786
marketing 0.517 0.820 0.854 0.863 0.893 0.508 0.782 0.858 - 0.897 0.850 0.858 0.888 0.837 0.799 - - - 0.756 - - 0.893 0.888 0.820 0.803 0.837 0.914 0.811 0.816 0.888 0.893 0.897 0.901 0.833
medical_genetics 0.340 0.720 0.750 0.780 0.810 0.240 0.510 0.720 - 0.790 0.630 0.640 0.710 0.720 0.660 - - - 0.600 - - 0.850 0.880 0.710 0.700 0.700 0.860 0.660 0.690 0.770 0.770 0.820 0.900 0.740
miscellaneous 0.420 0.749 0.768 0.830 0.854 0.401 0.687 0.825 - 0.879 0.775 0.796 0.768 0.773 0.736 - - - 0.727 - - 0.872 0.872 0.777 0.759 0.734 0.864 0.724 0.726 0.807 0.814 0.871 0.885 0.782
moral_disputes 0.323 0.609 0.618 0.680 0.736 0.332 0.511 0.664 - 0.719 0.604 0.612 0.635 0.621 0.560 - - - 0.524 - - 0.748 0.754 0.615 0.621 0.653 0.748 0.537 0.566 0.664 0.676 0.725 0.760 0.606
moral_scenarios 0.115 0.165 0.411 0.325 0.366 0.117 0.143 0.207 - 0.489 0.307 0.360 0.188 0.205 0.410 - - - 0.122 - - 0.482 0.377 0.366 0.404 0.270 0.582 0.130 0.058 0.318 0.368 0.546 0.565 0.268
nutrition 0.313 0.650 0.666 0.683 0.758 0.333 0.565 0.676 - 0.764 0.643 0.653 0.751 0.689 0.620 - - - 0.555 - - 0.843 0.826 0.669 0.620 0.630 0.771 0.647 0.630 0.745 0.745 0.790 0.797 0.686
philosophy 0.327 0.681 0.675 0.658 0.713 0.363 0.536 0.726 - 0.742 0.652 0.659 0.688 0.617 0.578 - - - 0.587 - - 0.736 0.781 0.630 0.588 0.630 0.784 0.562 0.565 0.675 0.688 0.774 0.778 0.649
prehistory 0.308 0.660 0.697 0.728 0.783 0.342 0.577 0.759 - 0.827 0.635 0.663 0.641 0.700 0.604 - - - 0.580 - - 0.805 0.805 0.697 0.663 0.617 0.805 0.641 0.666 0.762 0.756 0.836 0.861 0.703
professional_accounting 0.184 0.418 0.432 0.496 0.514 0.152 0.280 0.436 - 0.531 0.404 0.425 0.429 0.393 0.336 - - - 0.336 - - 0.531 0.517 0.418 0.386 0.421 0.510 0.386 0.414 0.457 0.460 0.560 0.631 0.471
professional_law 0.202 0.397 0.417 0.478 0.528 0.177 0.323 0.441 - 0.489 0.404 0.408 0.417 0.397 0.369 - - - 0.333 - - 0.518 0.505 0.410 0.401 0.394 0.492 0.340 0.337 0.401 0.402 0.477 0.541 0.404
professional_medicine 0.235 0.639 0.636 0.756 0.794 0.113 0.481 0.761 - 0.783 0.654 0.680 0.680 0.724 0.713 - - - 0.564 - - 0.827 0.827 0.687 0.658 0.588 0.823 0.573 0.580 0.680 0.683 0.812 0.845 0.709
professional_psychology 0.300 0.647 0.665 0.728 0.805 0.272 0.495 0.718 - 0.753 0.598 0.609 0.684 0.642 0.509 - - - 0.521 - - 0.790 0.777 0.655 0.617 0.619 0.799 0.586 0.591 0.707 0.702 0.776 0.810 0.668
public_relations 0.409 0.563 0.600 0.700 0.672 0.354 0.509 0.690 - 0.681 0.572 0.627 0.618 0.518 0.545 - - - 0.554 - - 0.736 0.736 0.554 0.572 0.600 0.727 0.563 0.572 0.627 0.645 0.736 0.663 0.636
security_studies 0.240 0.608 0.644 0.746 0.763 0.285 0.632 0.661 - 0.755 0.624 0.632 0.718 0.665 0.616 - - - 0.600 - - 0.787 0.767 0.669 0.673 0.661 0.730 0.620 0.653 0.718 0.718 0.767 0.775 0.714
sociology 0.412 0.781 0.791 0.815 0.860 0.517 0.666 0.820 - 0.850 0.736 0.741 0.810 0.786 0.741 - - - 0.716 - - 0.860 0.875 0.820 0.781 0.766 0.870 0.716 0.736 0.815 0.825 0.855 0.860 0.791
us_foreign_policy 0.510 0.780 0.790 0.868 0.840 0.460 0.740 0.890 - 0.860 0.780 0.800 0.840 0.800 0.800 - - - 0.757 - - 0.920 0.910 0.760 0.770 0.770 0.890 0.750 0.780 0.820 0.820 0.890 0.880 0.770
virology 0.246 0.433 0.445 0.472 0.506 0.283 0.415 0.469 - 0.481 0.415 0.439 0.469 0.439 0.415 - - - 0.387 - - 0.512 0.512 0.403 0.367 0.367 0.500 0.373 0.427 0.463 0.457 0.487 0.518 0.475
world_religions 0.403 0.748 0.801 0.800 0.847 0.350 0.684 0.795 - 0.836 0.766 0.766 0.748 0.789 0.742 - - - 0.747 - - 0.871 0.853 0.742 0.725 0.707 0.836 0.783 0.760 0.818 0.818 0.859 0.871 0.771
MMLU 0.285 0.591 0.623 0.647 0.687 0.269 0.477 0.631 - 0.701 0.580 0.595 0.617 0.570 0.525 - - - 0.486 - - 0.717 0.704 0.599 0.578 0.560 0.710 0.532 0.540 0.639 0.643 0.721 0.757 0.639
AGIEVAL
aquarat 0.374 0.602 0.562 0.665 0.602 0.409 0.763 0.846 - 0.844 0.653 0.637 0.783 0.598 0.633 - - - 0.279 - - 0.516 0.764 0.409 0.574 0.724 0.834 0.732 0.728 0.799 0.830 0.822 0.870 0.896
logiqa 0.208 0.356 0.337 0.447 0.477 0.145 0.342 0.479 - 0.509 0.399 0.416 0.433 0.328 0.265 - - - 0.264 - - 0.468 0.447 0.281 0.267 0.267 0.445 0.316 0.342 0.427 0.436 0.493 0.554 0.465
lsatar 0.213 0.213 0.282 0.208 0.260 0.217 0.213 0.365 - 0.317 0.073 0.217 0.308 0.295 0.239 - - - 0.186 - - 0.269 0.639 0.256 0.247 0.234 0.369 0.230 0.226 0.260 0.300 0.321 0.400 0.791
lsatlr 0.203 0.486 0.537 0.635 0.654 0.115 0.374 0.596 - 0.686 0.505 0.515 0.592 0.441 0.327 - - - 0.366 - - 0.709 0.686 0.415 0.386 0.401 0.621 0.452 0.449 0.598 0.603 0.729 0.811 0.686
lsatrc 0.312 0.594 0.646 0.750 0.754 0.208 0.475 0.702 - 0.717 0.635 0.643 0.706 0.624 0.486 - - - 0.520 - - 0.814 0.806 0.531 0.524 0.557 0.762 0.553 0.617 0.661 0.687 0.810 0.836 0.706
saten 0.470 0.791 0.810 0.834 0.868 0.305 0.728 0.854 - 0.893 0.815 0.820 0.844 0.781 0.689 - - - 0.679 - - 0.893 0.873 0.713 0.708 0.723 0.830 0.733 0.776 0.810 0.844 0.888 0.922 0.839
satmath 0.559 0.790 0.822 0.886 0.768 0.468 0.945 0.981 - 0.936 0.863 0.868 0.968 0.618 0.845 - - - 0.400 - - 0.804 0.813 0.713 0.754 0.890 0.977 0.900 0.922 0.963 0.963 0.990 0.981 0.995
AGIEVAL 0.294 0.503 0.523 0.598 0.602 0.226 0.488 0.639 - 0.663 0.525 0.546 0.611 0.480 0.433 - - - 0.359 - - 0.615 0.665 0.429 0.438 0.475 0.638 0.501 0.520 0.599 0.616 0.681 0.734 0.702
AGIEVALC_biology - 0.365 - - - 0.104 0.334 0.595 - 0.665 0.756 0.778 0.869 - - - - - - - - 0.721 0.739 - - - - 0.660 0.700 0.804 0.813 0.834 0.582 0.826
AGIEVALC_chemistry - 0.269 - - - 0.078 0.289 0.446 - 0.480 0.642 0.691 0.715 - - - - - - - - 0.509 0.509 - - - - 0.441 0.470 0.583 0.627 0.696 0.789 0.691
AGIEVALC_chinese - 0.247 - - - 0.048 0.231 0.373 - 0.439 0.642 0.650 0.723 - - - - - - - - 0.569 0.577 - - - - 0.508 0.504 0.585 0.593 0.760 0.735 0.573
AGIEVALC_english - 0.774 - - - 0.444 0.728 0.862 - 0.866 0.823 0.833 0.905 - - - - - - - - 0.892 0.892 - - - - 0.794 0.839 0.856 0.849 0.915 0.924 0.849
AGIEVALC_geography - 0.407 - - - 0.246 0.396 0.608 - 0.678 0.728 0.728 0.814 - - - - - - - - 0.718 0.718 - - - - 0.643 0.633 0.753 0.778 0.804 0.839 0.693
AGIEVALC_history - 0.374 - - - 0.225 0.421 0.642 - 0.689 0.829 0.834 0.872 - - - - - - - - 0.736 0.736 - - - - 0.740 0.744 0.774 0.800 0.842 0.923 0.770
AGIEVALC_jecqaca - 0.221 - - - 0.142 0.258 0.292 - 0.348 0.414 0.440 0.660 - - - - - - - - 0.416 0.410 - - - - 0.425 0.424 0.482 0.487 0.564 0.622 0.409
AGIEVALC_jecqakd - 0.223 - - - 0.118 0.229 0.356 - 0.400 0.549 0.559 0.759 - - - - - - - - 0.465 0.461 - - - - 0.498 0.526 0.592 0.605 0.732 0.747 0.521
AGIEVALC_logiqa - 0.310 - - - 0.193 0.328 0.488 - 0.523 0.479 0.490 0.556 - - - - - - - - 0.525 0.525 - - - - 0.399 0.405 0.497 0.500 0.565 0.588 0.499
AGIEVALC_mathcloze - 0.508 - - - - 0.567 0.779 - 0.855 0.491 0.542 0.508 - - - - - - - - 0.754 0.915 - - - - 0.508 0.440 0.694 0.686 0.737 0.805 0.949
AGIEVALC_mathqa - 0.569 - - - 0.322 0.616 0.779 - 0.744 0.621 0.648 0.845 - - - - - - - - 0.664 0.844 - - - - 0.595 0.683 0.779 0.755 0.808 0.834 0.932
AGIEVALC_physics - 0.327 - - - 0.091 0.206 0.304 - 0.471 0.396 0.425 0.563 - - - - - - - - 0.431 0.477 - - - - 0.390 0.413 0.431 0.500 0.683 0.770 0.465
AGIEVALC - 0.361 - - - 0.187 0.368 0.514 - 0.554 0.589 0.607 0.724 - - - - - - - - 0.583 0.603 - - - - 0.529 0.548 0.627 0.636 0.716 0.734 0.626
BBH
boolean_expressions 0.544 0.860 0.876 0.768 0.460 0.632 0.880 0.880 - 0.732 0.848 0.868 0.800 0.844 0.480 - - - 0.764 - - 0.872 0.860 0.852 0.832 0.776 0.936 0.756 0.796 0.864 0.880 0.888 0.808 0.700
causal_judgement 0.550 0.577 0.582 0.598 0.604 0.550 0.582 0.652 - 0.620 0.550 0.550 0.641 0.540 0.518 - - - 0.588 - - 0.631 0.582 0.588 0.593 0.577 0.647 0.497 0.529 0.508 0.513 0.647 0.700 0.625
date_understanding 0.324 0.668 0.748 0.748 0.788 0.408 0.868 0.920 - 0.760 0.580 0.572 0.832 0.716 0.664 - - - 0.548 - - 0.728 0.920 0.696 0.576 0.648 0.932 0.616 0.648 0.764 0.740 0.856 0.872 0.936
disambiguation_qa 0.400 0.712 0.668 0.660 0.720 0.284 0.432 0.448 - 0.612 0.584 0.636 0.716 0.516 0.472 - - - 0.600 - - 0.388 0.516 0.720 0.752 0.608 0.768 0.544 0.556 0.656 0.636 0.764 0.780 0.720
dyck_languages 0.424 0.704 0.712 0.728 0.600 0.344 0.636 0.824 - 0.892 0.516 0.544 0.592 0.796 0.680 - - - 0.744 - - 0.792 0.684 0.580 0.468 0.656 0.776 0.596 0.628 0.868 0.836 0.648 0.820 0.540
formal_fallacies 0.624 0.740 0.660 0.832 0.760 0.612 0.876 0.832 - 0.820 0.568 0.660 0.984 0.984 0.816 - - - 0.852 - - 0.964 0.692 0.808 0.808 0.592 0.804 0.928 0.852 0.628 0.628 0.784 0.812 0.980
geometric_shapes 0.056 0.544 0.456 0.436 0.420 0.128 0.376 0.456 - 0.544 0.392 0.400 0.812 0.440 0.416 - - - 0.288 - - 0.280 0.716 0.416 0.292 0.316 0.648 0.204 0.212 0.544 0.604 0.584 0.640 0.812
hyperbaton 0.512 0.572 0.680 0.884 0.836 0.108 0.940 0.976 - 0.932 0.740 0.824 0.884 0.880 0.624 - - - 0.656 - - 0.884 0.892 0.936 0.936 0.860 0.996 0.636 0.676 0.832 0.792 0.868 0.956 0.968
logical_deduction_five_objects 0.176 0.700 0.532 0.568 0.608 0.284 0.604 0.840 - 0.784 0.528 0.516 0.784 0.568 0.484 - - - 0.352 - - 0.600 0.968 0.632 0.532 0.536 0.940 0.468 0.528 0.752 0.728 0.876 0.924 0.960
logical_deduction_seven_objects 0.152 0.556 0.492 0.560 0.552 0.212 0.640 0.740 - 0.776 0.444 0.500 0.756 0.488 0.408 - - - 0.296 - - 0.616 0.944 0.568 0.500 0.472 0.920 0.420 0.436 0.668 0.656 0.792 0.864 0.928
logical_deduction_three_objects 0.376 0.868 0.820 0.844 0.892 0.428 0.860 0.992 - 0.912 0.836 0.840 0.960 0.804 0.652 - - - 0.608 - - 0.840 0.988 0.844 0.804 0.796 0.992 0.696 0.720 0.940 0.956 0.980 0.992 0.988
movie_recommendation 0.424 0.652 0.676 0.552 0.508 0.372 0.536 0.664 - 0.632 0.604 0.648 0.740 0.536 0.456 - - - 0.508 - - 0.684 0.672 0.520 0.508 0.528 0.992 0.604 0.568 0.556 0.536 0.672 0.648 0.496
multistep_arithmetic_two 0.136 0.944 0.968 0.488 0.472 - 0.868 0.888 - 0.972 0.580 0.524 0.508 0.700 0.532 - - - 0.108 - - 0.832 0.956 0.836 0.420 0.408 0.984 0.852 0.876 0.896 0.948 0.964 0.976 0.992
navigate 0.540 0.580 0.588 0.596 0.648 0.592 0.648 0.724 - 0.744 0.420 0.420 0.580 0.580 0.580 - - - 0.600 - - 0.680 0.464 0.588 0.584 0.612 0.640 0.576 0.572 0.596 0.596 0.624 0.684 0.992
object_counting 0.464 0.764 0.820 0.848 0.856 - 0.908 0.908 - 0.976 0.616 0.660 0.892 0.864 0.808 - - - 0.608 - - 0.832 0.984 0.836 0.344 0.596 0.996 0.740 0.764 0.848 0.804 0.892 0.896 0.960
penguins_in_a_table 0.369 0.842 0.746 0.890 0.842 0.267 0.876 0.986 - 0.739 0.917 0.917 0.958 0.856 0.801 - - - 0.623 - - 0.616 0.993 0.883 0.712 0.801 1.000 0.821 0.849 0.945 0.924 0.958 0.986 1.000
reasoning_about_colored_objects 0.276 0.860 0.800 0.744 0.900 0.180 0.752 0.888 - 0.844 0.876 0.796 0.940 0.824 0.568 - - - 0.608 - - 0.804 0.992 0.808 0.656 0.776 0.968 0.700 0.764 0.904 0.868 0.944 0.984 0.996
ruin_names 0.176 0.484 0.636 0.716 0.760 0.172 0.468 0.696 - 0.816 0.696 0.652 0.716 0.744 0.532 - - - 0.400 - - 0.764 0.748 0.612 0.600 0.588 0.816 0.396 0.324 0.440 0.544 0.692 0.760 0.720
salient_translation_error_detection 0.212 0.448 0.508 0.548 0.568 0.172 0.560 0.640 - 0.564 0.476 0.488 0.580 0.512 0.464 - - - 0.444 - - 0.600 0.656 0.520 0.532 0.540 0.636 0.452 0.432 0.560 0.572 0.612 0.700 0.716
snarks 0.483 0.685 0.707 0.691 0.719 0.033 0.724 0.803 - 0.634 0.702 0.707 0.769 0.651 0.657 - - - 0.606 - - 0.634 0.786 0.747 0.786 0.814 0.882 0.662 0.623 0.747 0.780 0.831 0.865 0.842
sports_understanding 0.584 0.672 0.692 0.788 0.816 0.488 0.696 0.804 - 0.844 0.472 0.468 0.668 0.720 0.644 - - - 0.716 - - 0.796 0.708 0.596 0.600 0.544 0.740 0.620 0.616 0.676 0.684 0.680 0.748 0.680
temporal_sequences 0.164 0.528 0.540 0.708 0.748 0.436 0.988 0.996 - 0.940 0.756 0.840 0.956 0.856 0.712 - - - 0.404 - - 0.844 0.992 0.784 0.508 0.768 1.000 0.324 0.388 0.800 0.820 0.988 0.992 0.992
tracking_shuffled_objects_five_objects 0.208 0.560 0.616 0.600 0.692 0.508 0.924 1.000 - 0.648 0.544 0.536 0.864 0.656 0.500 - - - 0.344 - - 0.716 0.992 0.940 0.712 0.852 1.000 0.420 0.452 0.840 0.908 0.924 0.972 0.988
tracking_shuffled_objects_seven_objects 0.140 0.324 0.524 0.572 0.640 0.228 0.884 0.988 - 0.660 0.512 0.436 0.764 0.592 0.420 - - - 0.296 - - 0.744 0.984 0.896 0.612 0.848 0.984 0.292 0.312 0.800 0.868 0.848 0.980 0.948
tracking_shuffled_objects_three_objects 0.288 0.696 0.732 0.732 0.848 0.808 0.972 0.992 - 0.548 0.620 0.696 0.956 0.728 0.608 - - - 0.436 - - 0.880 0.996 0.960 0.788 0.884 1.000 0.604 0.664 0.832 0.872 0.856 0.996 0.992
web_of_lies 0.476 0.576 0.520 0.520 0.488 0.488 0.516 0.540 - 0.532 0.476 0.488 0.512 0.512 0.544 - - - 0.488 - - 0.560 0.504 0.488 0.492 0.512 0.512 0.512 0.512 0.528 0.532 0.544 0.624 1.000
word_sorting 0.056 0.204 0.292 0.404 0.540 0.080 0.236 0.424 - 0.536 0.404 0.392 0.144 0.512 0.360 - - - 0.280 - - 0.632 0.592 0.204 0.152 0.140 0.360 0.156 0.156 0.212 0.220 0.292 0.400 0.272
BBH 0.334 0.638 0.650 0.664 0.674 0.355 0.711 0.794 - 0.743 0.596 0.608 0.749 0.681 0.566 - - - 0.506 - - 0.714 0.806 0.696 0.592 0.627 0.846 0.554 0.567 0.709 0.718 0.775 0.827 0.841
MUSR
murder_mystery 0.552 0.640 0.592 0.668 0.576 0.528 0.592 0.608 - 0.552 0.616 0.584 0.620 0.584 0.576 - - - 0.516 - - 0.712 0.680 0.636 0.620 0.588 0.708 0.544 0.612 0.604 0.584 0.652 0.640 0.636
object_placements 0.429 0.535 0.578 0.519 0.542 0.296 0.480 0.542 - 0.448 0.492 0.531 0.460 0.546 0.523 - - - 0.453 - - 0.516 0.532 0.503 0.457 0.453 0.464 0.472 0.476 0.531 0.554 0.519 0.265 0.532
team_allocation 0.436 0.512 0.496 0.460 0.476 0.328 0.400 0.560 - 0.572 0.572 0.588 0.448 0.460 0.396 - - - 0.356 - - 0.612 0.576 0.536 0.480 0.508 0.628 0.444 0.384 0.512 0.476 0.556 0.592 0.708
MUSR 0.472 0.562 0.555 0.548 0.531 0.383 0.490 0.570 - 0.524 0.559 0.567 0.509 0.530 0.498 - - - 0.441 - - 0.613 0.596 0.558 0.518 0.515 0.599 0.486 0.490 0.548 0.538 0.575 0.497 0.625
GPQA_diamond - - - - - - - - - 0.388 - - - - - 0.479 0.469 - - - - 0.358 0.540 - - - - - - - - - - 0.570
GPQA - - - - - - - - - 0.388 - - - - - 0.479 0.469 - - - - 0.358 0.540 - - - - - - - - - - 0.570
MMLUPRO
biology 0.324 0.708 0.702 0.747 0.772 0.361 0.640 0.794 - 0.752 0.676 0.695 0.750 0.686 0.623 - - - 0.582 - - 0.776 0.584 0.702 0.662 0.682 0.835 0.610 0.638 0.709 0.729 0.797 0.764 0.852
business 0.190 0.624 0.525 0.583 0.626 0.173 0.518 0.659 - 0.616 0.522 0.562 0.628 0.558 0.458 - - - 0.335 - - 0.612 0.756 0.571 0.509 0.588 0.785 0.504 0.558 0.647 0.661 0.718 0.755 0.768
chemistry 0.166 0.639 0.500 0.503 0.546 0.115 0.380 0.574 - 0.536 0.465 0.467 0.589 0.467 0.390 - - - - - - 0.488 0.728 0.463 0.296 0.513 0.765 0.387 0.451 0.559 0.580 0.684 0.701 0.824
computer_science 0.197 0.602 0.590 0.482 0.560 0.170 0.421 0.643 - 0.560 0.497 0.502 0.585 0.485 0.414 - - - - - - 0.556 0.680 0.475 0.448 0.456 0.734 0.434 0.402 0.590 0.604 0.663 0.734 0.736
economics 0.236 0.663 0.662 0.668 0.678 0.206 0.534 0.699 - 0.660 0.617 0.610 0.662 0.568 0.492 - - - - - - 0.648 0.612 0.609 0.587 0.629 0.792 0.521 0.550 0.674 0.687 0.721 0.787 0.756
engineering 0.157 0.437 0.424 0.406 0.414 0.138 0.253 0.373 - 0.420 0.303 0.298 0.454 0.378 0.302 - - - - - - 0.488 0.544 0.297 0.283 0.361 0.589 0.296 0.309 0.418 0.420 0.512 0.573 0.668
health 0.158 0.503 0.517 0.545 0.621 0.156 0.399 0.596 - 0.548 0.492 0.496 0.544 0.558 0.437 - - - - - - 0.544 0.616 0.515 0.466 0.506 0.700 0.388 0.416 0.556 0.569 0.643 0.690 0.644
history 0.149 0.406 0.467 0.493 0.490 0.152 0.354 0.540 - 0.588 0.425 0.438 0.459 0.451 0.380 - - - - - - 0.592 0.568 0.380 0.380 0.409 0.627 0.333 0.367 0.459 0.464 0.566 0.624 0.568
law 0.123 0.268 0.295 0.343 0.405 0.158 0.263 0.372 - 0.400 0.299 0.284 0.307 0.303 0.243 - - - - - - 0.384 0.328 0.276 0 0.294 0.500 0.220 0.237 0.300 0.292 0.366 0.455 0.412
math 0.203 0.694 0.564 0.538 0.570 0.180 0.586 0.739 - 0.664 0.490 0.523 0.617 0.555 0.511 - - - - - - 0.536 0.812 0.522 0.458 0.578 0.816 0.581 0.603 0.712 0.723 0.775 0.814 0.884
other 0.164 0.450 0.496 0.551 0.574 0.173 0.428 0.580 - 0.552 0.464 0.458 0.536 0.487 0.389 - - - - - - 0.484 0.592 0.500 0.433 0.493 0.706 0.410 0.405 0.529 0.551 0.611 0.664 0.604
philosophy 0.148 0.442 0.462 0.448 0.488 0.176 0.356 0.555 - 0.560 0.408 0.412 0.424 0.382 0.326 - - - - - - 0.476 0.580 0.406 0.390 0.394 0.633 0.376 0.364 0.480 0.464 0.557 0.599 0.544
physics 0.159 0.583 0.493 0.501 0.559 0.125 0.397 0.595 - 0.512 0.441 0.461 0.587 0.488 0.397 - - - - - - 0.488 0.724 0.455 0.425 0.500 0.765 0.419 0.456 0.589 0.602 0.702 0.543 0.872
psychology 0.258 0.621 0.645 0.647 0.692 0.273 0.567 0.685 - 0.632 0.586 0.602 0.665 0.637 0.518 - - - - - - 0.680 0.544 0.621 0.572 0.583 0.759 0.526 0.563 0.636 0.644 0.721 0.749 0.684
MMLUPRO 0.186 0.552 0.517 0.528 0.568 0.177 0.436 0.597 - 0.571 0.471 0.480 0.559 0.499 0.419 - - - 0.453 - - 0.553 0.619 0.482 0.408 0.502 0.719 0.430 0.457 0.564 0.575 0.649 0.671 0.701
CATEGORIES
REASONING 0.367 0.713 0.738 0.788 0.814 0.344 0.598 0.787 0.767 0.809 0.804 0.811 0.815 0.713 0.606 - - - 0.628 0.866 0.877 0.863 0.848 0.724 0.691 0.619 0.809 0.689 0.719 0.805 0.809 0.850 0.874 0.785
UNDERSTANDING 0.366 0.644 0.670 0.707 0.742 0.327 0.552 0.695 0.741 0.746 0.661 0.670 0.691 0.631 0.579 - - - 0.563 0.775 0.772 0.764 0.756 0.614 0.622 0.617 0.728 0.605 0.613 0.692 0.696 0.761 0.793 0.695
LANGUAGE 0.524 0.688 0.692 0.735 0.755 0.504 0.635 0.724 0.721 0.742 0.786 0.783 0.662 0.747 0.705 - - - 0.766 0.786 0.789 0.798 0.792 0.677 0.613 0.653 0.750 0.685 0.682 0.722 0.724 0.769 0.781 0.670
KNOWLEDGE 0.354 0.442 0.496 0.690 0.733 0.353 0.478 0.626 0.560 0.630 0.553 0.543 0.615 0.547 0.536 0.680 0.580 0.540 0.582 0.680 - 0.676 0.653 0.517 0.519 0.534 0.676 0.469 0.426 0.595 0.597 0.693 0.725 0.622
COT 0.220 0.552 0.530 0.550 0.582 0.201 0.470 0.616 - 0.630 0.485 0.500 0.586 0.530 0.446 0.479 0.469 - 0.498 - - 0.600 0.651 0.506 0.440 0.513 0.725 0.443 0.462 0.570 0.581 0.653 0.684 0.708
MATHCOT 0.369 0.730 0.752 0.735 0.740 0.417 0.793 0.879 0.882 0.790 0.682 0.679 0.813 0.728 0.647 0.840 0.860 0.860 0.493 0.830 0.780 0.743 0.908 0.767 0.638 0.745 0.919 0.667 0.694 0.823 0.821 0.869 0.903 0.946
CODE 0.176 0.460 0.534 0.495 0.568 0.241 0.485 0.582 0.829 0.618 0.456 0.475 0.500 0.463 0.366 - - - 0.321 0.841 0.823 0.409 0.619 0.427 0.376 0.418 0.568 0.437 0.445 0.510 0.528 0.578 0.612 0.597
DISCIPLINES
NLP 0.408 0.647 0.670 0.755 0.786 0.392 0.595 0.748 0.751 0.761 0.729 0.728 0.737 0.677 0.609 - - - 0.642 0.834 0.841 0.808 0.792 0.647 0.637 0.614 0.755 0.632 0.630 0.731 0.734 0.791 0.818 0.733
MATH 0.294 0.669 0.659 0.637 0.653 0.298 0.659 0.775 0.882 0.727 0.590 0.597 0.720 0.629 0.556 0.840 0.860 0.860 0.451 0.830 0.780 0.678 0.789 0.646 0.543 0.625 0.817 0.576 0.599 0.741 0.742 0.799 0.843 0.831
SCIENCE 0.350 0.706 0.713 0.739 0.769 0.304 0.580 0.737 - 0.797 0.686 0.698 0.756 0.676 0.605 0.479 0.469 - 0.673 - - 0.806 0.821 0.696 0.660 0.681 0.845 0.629 0.657 0.738 0.748 0.815 0.806 0.830
ENGINEERING 0.166 0.464 0.453 0.426 0.438 0.158 0.271 0.397 - 0.496 0.334 0.333 0.480 0.397 0.323 - - - 0.393 - - 0.567 0.587 0.323 0.308 0.388 0.595 0.315 0.325 0.443 0.444 0.530 0.590 0.655
MEDICINE 0.216 0.524 0.540 0.595 0.648 0.182 0.411 0.598 - 0.642 0.521 0.530 0.570 0.577 0.496 - - - 0.447 - - 0.681 0.684 0.537 0.501 0.490 0.672 0.459 0.478 0.574 0.580 0.655 0.702 0.595
HUMANITIES 0.291 0.550 0.615 0.645 0.679 0.272 0.495 0.641 0.560 0.705 0.593 0.610 0.622 0.578 0.529 0.680 0.580 0.540 0.536 0.680 - 0.710 0.698 0.588 0.567 0.562 0.739 0.527 0.533 0.629 0.638 0.716 0.742 0.634
BUSINESS 0.252 0.679 0.655 0.678 0.709 0.245 0.537 0.704 - 0.743 0.623 0.637 0.696 0.598 0.517 - - - 0.466 - - 0.749 0.762 0.637 0.604 0.635 0.801 0.565 0.596 0.701 0.710 0.759 0.802 0.759
LAW 0.200 0.362 0.427 0.483 0.524 0.172 0.316 0.443 - 0.504 0.417 0.429 0.494 0.406 0.344 - - - 0.370 - - 0.537 0.543 0.392 0.310 0.390 0.541 0.374 0.383 0.451 0.456 0.541 0.604 0.514
COMPOSITE AVERAGE
AVG 0.342 0.612 0.641 0.692 0.724 0.324 0.555 0.701 0.753 0.729 0.648 0.654 0.689 0.629 0.561 0.620 0.595 0.700 0.578 0.833 0.841 0.740 0.757 0.616 0.585 0.591 0.748 0.578 0.586 0.686 0.691 0.754 0.783 0.716

THINKING MODELS:

MODEL Qwen3-0.6B Qwen3-1.7B Qwen3-4B Qwen3-4B Qwen3-4B-Thinking-2507 Qwen3-8B Qwen3-8B Qwen3-8B Qwen3-14B Qwen3-14B Qwen3-30B-A3B Qwen3-30B-A3B Qwen3-32B Qwen3-32B QwQ-32B-Preview QwQ-32B Ring-mini-2.0
params 0.75163B 2.03B 4.02B 4.02B 4.02B 8.19B 8.19B 8.19B 14.77B 14.77B 30.53B 30.53B 32.8B 32.8B 32.76B 32.76B 16.26B
quant Q8_0 Q8_0 Q8_0 Q8_0_H Q6_K_H Q4_K_H Q6_K_H Q6_K IQ4_XS Q4_K_H IQ4_XS Q4_K_H IQ4_XS Q4_K_H IQ4_XS Q4_K_H Q6_K_H
engine llama.cpp version: 5679 llama.cpp version: 5415 llama.cpp version: 5242 llama.cpp version: 5509 llama.cpp version: 6653 llama.cpp version: 5279 llama.cpp version: 5223 llama.cpp version: 5153 llama.cpp version: 5223 llama.cpp version: 5379 llama.cpp version: 5279 llama.cpp version: 5353 llama.cpp version: 5466 llama.cpp version: 5466 llama.cpp version: 4273 llama.cpp version: 6118 llama.cpp version: 6815
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc acc
WG 0.564 0.610 0.662 0.642 0.662 0.651 0.689 0.678 0.722 0.726 0.699 0.700 0.712 0.731 0.750 0.689 0.606
LAMBADA 0.471 0.590 0.644 0.638 0.662 0.660 0.700 0.700 0.729 0.714 0.698 0.725 0.692 0.701 0.780 0.651 0.522
HELLASWAG 0.277 0.553 0.721 0.726 0.722 0.718 0.787 0.787 0.827 0.792 0.815 0.832 0.838 0.812 0.875 0.883 0.689
BOOLQ 0.449 0.531 0.626 0.626 0.485 0.641 0.608 0.611 0.662 0.632 - 0.502 0.603 0.574 0.629 0.658 0.556
STORYCLOZE 0.764 0.833 0.849 0.843 0.809 0.809 0.852 0.873 - 0.905 - 0.917 0.960 0.959 0.964 0.935 0.937
CSQA 0.377 0.567 0.705 0.705 0.704 0.685 0.740 0.748 - 0.749 - 0.742 - 0.778 0.796 0.680 0.724
OBQA 0.394 0.584 0.756 0.754 0.729 0.719 0.767 0.774 - 0.787 - 0.836 - 0.869 0.882 0.815 0.764
COPA 0.569 0.765 0.872 0.865 0.801 0.829 0.828 0.864 - 0.919 - 0.919 - 0.946 0.936 0.962 0.877
PIQA 0.393 0.574 0.710 0.710 0.685 0.744 0.769 0.781 - 0.798 - 0.845 - 0.815 0.829 0.871 0.755
SIQA 0.363 0.569 0.664 0.664 0.640 0.637 0.671 0.679 - 0.689 - 0.693 - 0.714 0.714 0.747 0.664
MEDQA 0.135 0.278 0.435 0.428 0.450 0.448 0.499 0.509 - 0.531 - 0.597 - 0.553 0.598 0.518 0.450
SQA 0.249 0.032 0.039 0.040 0.147 0.036 0.039 0.042 - 0.045 - 0.055 - 0.047 - 0.060 0.022
JEOPARDY 0.640 0.270 0.280 0.220 0.400 0.410 0.280 0.240 0.480 0.490 0.520 0.470 0.470 0.470 0.600 0.440 0.030
GSM8K 0.748 0.920 0.946 0.960 0.960 0.946 0.953 0.956 - 0.948 - 0.962 - 0.972 0.962 0.964 0.952
APPLE 0.460 0.790 0.850 0.840 0.880 0.790 0.880 0.890 0.910 0.920 0.820 0.850 0.910 0.910 0.870 0.880 0.850
HUMANEVAL 0.445 0.682 0.817 0.804 0.786 0.835 0.865 0.859 - 0.859 - 0.884 - 0.890 0.414 0.512 0.853
HUMANEVALP 0.335 0.591 0.682 0.676 0.670 0.725 0.713 0.731 - 0.737 - 0.750 - 0.780 0.359 0.432 0.750
HUMANEVALFIM - - - - - - - - - - - - - - - - -
MBPP 0.408 0.544 0.645 0.642 0.536 0.571 0.618 0.630 - 0.700 - 0.677 - 0.684 0.404 0.568 0.665
MBPPP 0.388 0.482 0.598 0.602 0.517 0.580 0.611 0.566 - 0.651 - - - 0.678 0.392 0.584 0.642
HUMANEVALX_cpp 0.231 0.359 0.463 0.353 0.646 0.524 0.615 0.554 - 0.652 - - - 0.737 0.378 0.603 0.750
HUMANEVALX_java 0.274 0.548 0.731 0.737 0.774 0.737 0.780 0.798 - 0.841 - - - 0.847 0.097 0.280 0.835
HUMANEVALX_js 0.256 0.518 0.719 0.695 0.439 0.762 0.774 0.774 - 0.786 - - - 0.817 0.493 0.493 0.786
HUMANEVALX 0.254 0.475 0.638 0.595 0.619 0.674 0.723 0.709 - 0.760 - - - 0.800 0.323 0.459 0.790
CRUXEVAL_input 0.353 0.406 0.457 0.453 0.500 0.445 0.528 0.510 - 0.537 - - - 0.450 0.200 0.498 0.398
CRUXEVAL_output 0.241 0.338 0.420 0.403 0.440 0.405 0.446 0.447 - 0.501 - - - 0.431 0.368 0.513 0.391
CRUXEVAL 0.297 0.372 0.438 0.428 0.470 0.425 0.487 0.478 - 0.519 - - - 0.440 0.284 0.506 0.395
CRUXEVALFIM_input - - - - - - - - - - - - - - - - -
CRUXEVALFIM_output - - - - - - - - - - - - - - - - -
CRUXEVALFIM - - - - - - - - - - - - - - - - -
TQA_mc 0.261 0.406 0.600 0.598 0.656 0.592 0.641 0.635 - 0.676 - - - 0.742 0.795 0.701 0.608
TQA_tf 0.429 0.445 0.502 0.500 0.519 0.513 0.430 0.458 - 0.614 - - - 0.456 0.523 0.628 0.513
TQA 0.409 0.441 0.514 0.511 0.535 0.523 0.455 0.479 - 0.621 - - - 0.490 0.554 0.637 0.524
ARC_challenge 0.275 0.686 0.854 0.852 0.850 0.833 0.882 0.882 - 0.896 - - - 0.910 0.917 0.843 0.824
ARC_easy 0.502 0.850 0.937 0.933 0.948 0.934 0.952 0.955 - 0.964 - - - 0.974 0.975 0.906 0.937
ARC 0.427 0.796 0.910 0.906 0.916 0.901 0.929 0.931 - 0.942 - - - 0.953 0.956 0.886 0.900
RACE_high 0.359 0.594 0.759 0.756 0.784 0.747 0.794 0.798 - 0.826 - - - 0.822 0.871 0.862 0.748
RACE_middle 0.397 0.652 0.808 0.808 0.830 0.818 0.842 0.844 - 0.873 - - - 0.881 - 0.889 0.802
RACE 0.370 0.611 0.774 0.771 0.797 0.768 0.808 0.811 - 0.839 - - - 0.839 0.871 0.870 0.764
MMLU
abstract_algebra 0.100 0.240 0.410 0.420 0.460 0.360 0.430 0.470 - 0.500 - - - 0.470 - 0.470 0.380
anatomy 0.274 0.437 0.540 0.555 0.614 0.592 0.622 0.607 - 0.651 - - - 0.644 - 0.681 0.622
astronomy 0.328 0.611 0.723 0.730 0.750 0.796 0.822 0.828 - 0.861 - - - 0.868 - 0.802 0.750
business_ethics 0.290 0.460 0.670 0.670 0.630 0.610 0.650 0.650 - 0.720 - - - 0.730 - 0.720 0.630
clinical_knowledge 0.350 0.528 0.690 0.709 0.686 0.728 0.758 0.743 - 0.762 - - - 0.781 - 0.766 0.694
college_biology 0.388 0.604 0.770 0.784 0.805 0.784 0.805 0.812 - 0.847 - - - 0.881 - 0.819 0.694
college_chemistry 0.230 0.340 0.420 0.420 0.460 0.420 0.480 0.490 - 0.550 - - - 0.500 - 0.460 0.490
college_computer_science 0.230 0.380 0.580 0.580 0.620 0.610 0.650 0.700 - 0.630 - - - 0.700 - 0.580 0.470
college_mathematics 0.160 0.300 0.370 0.390 0.360 0.390 0.450 0.500 - 0.450 - - - 0.400 - 0.350 0.390
college_medicine 0.312 0.537 0.676 0.664 0.664 0.676 0.716 0.722 - 0.739 - - - 0.722 - 0.722 0.595
college_physics 0.137 0.254 0.519 0.509 0.460 0.500 0.529 0.578 - 0.578 - - - 0.558 - 0.421 0.441
computer_security 0.430 0.650 0.690 0.700 0.710 0.710 0.750 0.740 - 0.770 - - - 0.770 - 0.630 0.640
conceptual_physics 0.255 0.514 0.685 0.685 0.702 0.697 0.736 0.761 - 0.821 - - - 0.821 - 0.668 0.668
econometrics 0.140 0.394 0.587 0.614 0.622 0.552 0.614 0.631 - 0.622 - - - 0.605 - 0.543 0.447
electrical_engineering 0.324 0.475 0.586 0.579 0.579 0.586 0.648 0.648 - 0.717 - - - 0.724 - 0.517 0.558
elementary_mathematics 0.148 0.391 0.582 0.574 0.589 0.584 0.640 0.650 - 0.701 - - - 0.679 - 0.547 0.476
formal_logic 0.238 0.357 0.547 0.531 0.484 0.460 0.476 0.476 - 0.515 - - - 0.579 - 0.611 0.436
global_facts 0.140 0.110 0.180 0.220 0.180 0.240 0.260 0.280 - 0.300 - - - 0.310 - 0.430 0.200
high_school_biology 0.432 0.600 0.822 0.822 0.832 0.822 0.848 0.858 - 0.883 - - - 0.906 - 0.783 0.790
high_school_chemistry 0.147 0.423 0.600 0.591 0.625 0.551 0.635 0.650 - 0.709 - - - 0.665 - 0.596 0.581
high_school_computer_science 0.340 0.580 0.750 0.750 0.760 0.730 0.820 0.830 - 0.830 - - - 0.810 - 0.800 0.700
high_school_european_history 0.387 0.600 0.703 0.690 0.781 0.727 0.818 0.812 - 0.787 - - - 0.806 - 0.824 0.696
high_school_geography 0.393 0.631 0.803 0.808 0.792 0.752 0.787 0.808 - 0.853 - - - 0.878 - 0.858 0.797
high_school_government_and_politics 0.284 0.621 0.849 0.854 0.834 0.834 0.891 0.906 - 0.901 - - - 0.958 - 0.886 0.823
high_school_macroeconomics 0.292 0.474 0.661 0.656 0.653 0.669 0.712 0.712 - 0.787 - - - 0.810 - 0.692 0.623
high_school_mathematics 0.166 0.292 0.348 0.355 0.403 0.351 0.418 0.396 - 0.474 - - - 0.381 - 0.296 0.355
high_school_microeconomics 0.390 0.609 0.773 0.768 0.798 0.802 0.882 0.886 - 0.911 - - - 0.894 - 0.676 0.735
high_school_physics 0.119 0.350 0.556 0.549 0.529 0.549 0.602 0.602 - 0.649 - - - 0.662 - 0.456 0.437
high_school_psychology 0.526 0.746 0.842 0.844 0.849 0.849 0.877 0.877 - 0.891 - - - 0.921 - 0.790 0.836
high_school_statistics 0.333 0.462 0.648 0.657 0.657 0.625 0.694 0.689 - 0.703 - - - 0.736 - 0.537 0.550
high_school_us_history 0.338 0.553 0.784 0.764 0.759 0.710 0.823 0.848 - 0.862 - - - 0.897 - 0.872 0.754
high_school_world_history 0.459 0.645 0.793 0.776 0.789 0.797 0.839 0.831 - 0.827 - - - 0.864 - 0.877 0.725
human_aging 0.331 0.439 0.587 0.578 0.596 0.600 0.609 0.623 - 0.668 - - - 0.771 - 0.721 0.551
human_sexuality 0.374 0.549 0.641 0.664 0.679 0.664 0.740 0.755 - 0.770 - - - 0.809 - 0.770 0.656
international_law 0.429 0.537 0.669 0.652 0.694 0.628 0.694 0.710 - 0.826 - - - 0.801 - 0.809 0.743
jurisprudence 0.398 0.527 0.675 0.694 0.759 0.675 0.731 0.731 - 0.805 - - - 0.777 - 0.787 0.675
logical_fallacies 0.319 0.613 0.791 0.785 0.791 0.717 0.779 0.803 - 0.822 - - - 0.803 - 0.644 0.723
machine_learning 0.276 0.339 0.526 0.508 0.500 0.392 0.491 0.455 - 0.562 - - - 0.455 - 0.616 0.553
management 0.514 0.640 0.786 0.805 0.805 0.834 0.844 0.873 - 0.825 - - - 0.796 - 0.718 0.834
marketing 0.602 0.739 0.816 0.807 0.837 0.820 0.884 0.876 - 0.876 - - - 0.876 - 0.773 0.794
medical_genetics 0.370 0.580 0.750 0.710 0.810 0.750 0.780 0.750 - 0.790 - - - 0.840 - 0.750 0.750
miscellaneous 0.390 0.597 0.752 0.744 0.761 0.757 0.789 0.795 - 0.831 - - - 0.854 - 0.786 0.740
moral_disputes 0.289 0.473 0.580 0.583 0.578 0.540 0.609 0.615 - 0.638 - - - 0.699 - 0.760 0.589
moral_scenarios 0.109 0 0.145 0.140 0.230 0.234 0.322 0.269 - 0.330 - - - 0.292 - 0.518 0.082
nutrition 0.316 0.509 0.660 0.669 0.699 0.660 0.702 0.702 - 0.771 - - - 0.771 - 0.771 0.656
philosophy 0.225 0.501 0.623 0.617 0.636 0.575 0.636 0.643 - 0.675 - - - 0.710 - 0.755 0.639
prehistory 0.345 0.530 0.688 0.688 0.703 0.688 0.753 0.743 - 0.774 - - - 0.796 - 0.833 0.663
professional_accounting 0.237 0.308 0.439 0.443 0.443 0.404 0.482 0.482 - 0.510 - - - 0.546 - 0.588 0.393
professional_law 0.201 0.288 0.378 0.382 0.370 0.369 0.414 0.424 - 0.431 - - - 0.475 - 0.505 0.393
professional_medicine 0.209 0.463 0.698 0.705 0.720 0.676 0.757 0.775 - 0.786 - - - 0.830 - 0.790 0.632
professional_psychology 0.303 0.459 0.651 0.655 0.620 0.601 0.676 0.679 - 0.733 - - - 0.766 - 0.712 0.616
public_relations 0.345 0.500 0.572 0.563 0.654 0.581 0.600 0.636 - 0.645 - - - 0.709 - 0.672 0.600
security_studies 0.412 0.604 0.636 0.636 0.685 0.677 0.730 0.730 - 0.746 - - - 0.755 - 0.804 0.640
sociology 0.427 0.656 0.731 0.746 0.800 0.766 0.800 0.815 - 0.781 - - - 0.855 - 0.840 0.736
us_foreign_policy 0.470 0.610 0.720 0.710 0.740 0.780 0.830 0.830 - 0.830 - - - 0.840 - 0.860 0.800
virology 0.319 0.379 0.433 0.433 0.457 0.409 0.463 0.475 - 0.487 - - - 0.487 - 0.542 0.415
world_religions 0.362 0.637 0.719 0.719 0.760 0.783 0.783 0.777 - 0.818 - - - 0.807 - 0.865 0.748
MMLU 0.298 0.457 0.598 0.598 0.613 0.598 0.651 0.654 - 0.684 - - - 0.698 - 0.674 0.577
AGIEVAL
aquarat 0.572 0.760 0.866 0.840 0.860 0.834 0.866 0.897 - 0.885 - - - 0.860 - 0.848 0.864
logiqa 0.062 0.230 0.451 0.453 0.462 0.393 0.420 0.431 - 0.465 - - - 0.520 - 0.586 0.414
lsatar 0.208 0.313 0.486 0.500 0.904 0.430 0.486 0.517 - 0.469 - - - 0.495 - 0.678 0.808
lsatlr 0.164 0.372 0.601 0.594 0.672 0.574 0.641 0.658 - 0.725 - - - 0.768 - 0.813 0.703
lsatrc 0.327 0.464 0.669 0.665 0.710 0.657 0.687 0.713 - 0.713 - - - 0.806 - 0.828 0.695
saten 0.412 0.655 0.830 0.825 0.830 0.825 0.820 0.820 - 0.834 - - - 0.873 - 0.898 0.737
satmath 0.772 0.950 0.990 0.981 0.977 0.986 0.990 0.990 - 0.990 - - - 0.995 - 0.972 0.954
AGIEVAL 0.282 0.458 0.641 0.636 0.703 0.608 0.643 0.659 - 0.678 - - - 0.717 - 0.764 0.676
AGIEVALC_biology 0.152 0.539 0.765 0.760 0.773 0.769 0.834 0.847 - 0.856 - - - 0.878 - 0.708 0.695
AGIEVALC_chemistry 0.117 0.397 0.622 0.602 0.661 0.568 0.647 0.656 - 0.720 - - - 0.803 - 0.754 0.558
AGIEVALC_chinese 0.081 0.365 0.516 0.504 0.573 0.581 0.609 0.634 - 0.678 - - - 0.739 - 0.707 0.536
AGIEVALC_english 0.477 0.728 0.856 0.849 0.866 0.820 0.856 0.856 - 0.866 - - - 0.872 - 0.888 0.813
AGIEVALC_geography 0.281 0.547 0.708 0.683 0.628 0.648 0.758 0.743 - 0.768 - - - 0.829 - 0.819 0.628
AGIEVALC_history 0.319 0.612 0.736 0.727 0.774 0.702 0.761 0.770 - 0.821 - - - 0.872 - 0.889 0.702
AGIEVALC_jecqaca 0.183 0.303 0.397 0.392 0.369 0.378 0.425 0.439 - 0.482 - - - 0.566 - 0.652 0.367
AGIEVALC_jecqakd 0.123 0.359 0.480 0.484 0.492 0.513 0.561 0.574 - 0.613 - - - 0.676 - 0.701 0.473
AGIEVALC_logiqa 0.122 0.317 0.496 0.483 0.508 0.471 0.499 0.497 - 0.562 - - - 0.599 - 0.642 0.465
AGIEVALC_mathcloze 0.728 0.669 0.923 0.957 0.957 0.838 0.830 0.923 - 0.881 - - - 0.932 0.864 0.889 0.983
AGIEVALC_mathqa 0.500 0.704 0.828 0.812 0.928 0.764 0.863 0.851 - 0.813 - - - 0.852 0.828 0.904 0.892
AGIEVALC_physics 0.080 0.333 0.436 0.431 0.471 0.454 0.563 0.545 - 0.626 - - - 0.701 0.741 0.528 0.419
AGIEVALC 0.225 0.448 0.602 0.589 0.611 0.580 0.639 0.646 - 0.680 - - - 0.729 0.811 0.733 0.574
BBH
boolean_expressions 0.724 0.728 0.820 0.812 0.648 0.612 0.560 0.620 - 0.900 - - - 0.832 - 0.768 0.840
causal_judgement 0.491 0.561 0.593 0.582 0.604 0.540 0.588 0.572 - 0.604 - - - 0.631 - 0.636 0.636
date_understanding 0.504 0.752 0.880 0.888 0.928 0.852 0.936 0.912 - 0.916 - - - 0.940 - 0.884 0.904
disambiguation_qa 0.448 0.464 0.648 0.588 0.680 0.464 0.544 0.520 - 0.636 - - - 0.448 - 0.436 0.568
dyck_languages 0.412 0.524 0.580 0.572 0.500 0.672 0.696 0.688 - 0.772 - - - 0.816 - 0.848 0.448
formal_fallacies 0.800 0.800 0.748 0.776 1.000 0.568 0.992 0.604 - 0.768 - - - 0.728 - 0.976 0.520
geometric_shapes 0.228 0.572 0.536 0.556 0.804 0.692 0.716 0.676 - 0.688 - - - 0.728 - 0.780 0.764
hyperbaton 0.576 0.692 0.872 0.856 0.960 0.912 0.952 0.960 - 0.976 - - - 0.948 - 0.940 0.984
logical_deduction_five_objects 0.416 0.772 0.884 0.868 1.000 0.856 0.872 0.928 - 0.936 - - - 0.972 - 0.988 0.988
logical_deduction_seven_objects 0.360 0.664 0.856 0.840 1.000 0.816 0.860 0.880 - 0.888 - - - 0.924 - 0.968 0.996
logical_deduction_three_objects 0.612 0.932 0.988 0.988 0.996 0.988 0.984 0.980 - 0.996 - - - 1.000 - 0.988 0.980
movie_recommendation 0.360 0.416 0.528 0.504 0.604 0.492 0.520 0.544 - 0.572 - - - 0.616 - 0.668 0.684
multistep_arithmetic_two 0.896 0.988 0.996 0.988 1.000 0.984 1.000 1.000 - 0.572 - - - 0.996 - 1.000 0.984
navigate 0.516 0.576 0.580 0.580 1.000 0.508 0.992 0.608 - 0.680 - - - 0.728 - 0.996 0.996
object_counting 0.664 0.872 0.992 0.996 1.000 0.996 0.992 0.996 - 0.996 - - - 1.000 - 0.996 0.952
penguins_in_a_table 0.602 0.897 0.945 0.958 0.986 0.993 1.000 0.993 - 1.000 - - - 1.000 - 1.000 0.986
reasoning_about_colored_objects 0.520 0.792 0.952 0.960 0.996 0.928 0.940 0.960 - 0.948 - - - 0.984 - 0.964 0.968
ruin_names 0.164 0.512 0.508 0.516 0.632 0.604 0.656 0.652 - 0.772 - - - 0.776 - 0.768 0.776
salient_translation_error_detection 0.316 0.488 0.612 0.632 0.696 0.604 0.628 0.572 - 0.660 - - - 0.680 - 0.576 0.652
snarks 0.471 0.573 0.730 0.685 0.780 0.735 0.792 0.735 - 0.780 - - - 0.837 - 0.831 0.764
sports_understanding 0.472 0.524 0.624 0.596 0.708 0.540 0.644 0.636 - 0.560 - - - 0.776 - 0.464 0.600
temporal_sequences 0.136 0.400 0.912 0.892 1.000 0.940 0.992 0.992 - 0.980 - - - 0.992 - 0.948 0.996
tracking_shuffled_objects_five_objects 0.280 0.648 0.940 0.964 1.000 0.968 0.956 0.936 - 0.996 - - - 0.996 - 0.988 0.988
tracking_shuffled_objects_seven_objects 0.232 0.564 0.852 0.884 0.996 0.924 0.952 0.972 - 0.944 - - - 0.948 - 0.980 0.952
tracking_shuffled_objects_three_objects 0.408 0.736 0.896 0.884 1.000 0.920 0.920 0.952 - 0.996 - - - 0.996 - 0.992 0.992
web_of_lies 0.456 0.460 0.552 0.544 1.000 0.488 - 0.488 - 0.540 - - - 0.492 - 1.000 1.000
word_sorting 0.080 0.136 0.220 0.228 0.280 0.292 0.292 0.288 - 0.324 - - - 0.324 - 0.400 0.220
BBH 0.446 0.628 0.748 0.744 0.845 0.734 0.763 0.763 - 0.791 - - - 0.817 - 0.825 0.819
MUSR
murder_mystery 0.524 0.560 0.640 0.668 0.588 0.584 0.636 0.652 - 0.672 - - - 0.636 - 0.560 0.620
object_placements 0.480 0.512 0.566 0.556 0.476 0.536 0.582 0.578 - 0.528 - - - 0.516 - 0.436 0.476
team_allocation 0.280 0.468 0.628 0.612 0.708 0.648 0.656 0.668 - 0.564 - - - 0.632 - 0.728 0.464
MUSR 0.428 0.513 0.611 0.612 0.590 0.589 0.624 0.632 - 0.588 - - - 0.594 - 0.574 0.520
GPQA_diamond 0.262 0.282 0.434 0.489 0.616 0.398 0.530 - - 0.439 - - - 0.555 - 0.555 0.540
GPQA 0.262 0.282 0.434 0.489 0.616 0.398 0.530 - - 0.439 - - - 0.555 - 0.555 0.540
MMLUPRO
biology 0.268 0.596 0.799 0.784 0.852 0.804 0.822 0.831 - 0.824 - - - 0.852 - 0.812 0.716
business 0.348 0.548 0.717 0.692 0.784 0.724 0.740 0.738 - 0.784 - - - 0.812 - 0.788 0.676
chemistry 0.300 0.564 0.720 0.700 0.848 0.700 0.746 0.747 - 0.796 - - - 0.780 - 0.740 0.704
computer_science 0.232 0.532 0.680 0.664 0.744 0.676 0.704 0.707 - 0.704 - - - 0.804 - 0.780 0.676
economics 0.276 0.544 0.716 0.732 0.784 0.732 0.741 0.759 - 0.808 - - - 0.832 - 0.796 0.700
engineering 0.232 0.456 0.557 0.596 0.624 0.536 0.587 0.600 - 0.620 - - - 0.676 - 0.476 0.496
health 0.192 0.312 0.559 0.564 0.668 0.632 0.630 0.639 - 0.652 - - - 0.680 - 0.652 0.576
history 0.200 0.300 0.488 0.524 0.604 0.536 0.561 0.556 - 0.600 - - - 0.680 - 0.588 0.492
law 0.076 0.208 0.295 0.304 0.336 0.348 0.353 0.379 - 0.384 - - - 0.468 - 0.392 0.320
math 0.444 0.676 0.832 0.800 0.884 0.780 0.824 0.827 - 0.844 - - - 0.872 - 0.860 0.796
other 0.188 0.356 0.484 0.528 0.600 0.560 0.590 0.593 - 0.656 - - - 0.704 - 0.636 0.544
philosophy 0.168 0.336 0.504 0.520 0.600 0.492 0.559 0.567 - 0.580 - - - 0.664 - 0.612 0.484
physics 0.232 0.580 0.708 0.740 0.852 0.720 0.753 0.752 - 0.756 - - - 0.824 - 0.736 0.724
psychology 0.212 0.512 0.672 0.632 0.700 0.668 0.725 0.692 - 0.728 - - - 0.748 - 0.672 0.624
MMLUPRO 0.240 0.465 0.622 0.627 0.705 0.636 0.674 0.679 - 0.695 - - - 0.742 - 0.681 0.609
CATEGORIES
REASONING 0.332 0.593 0.744 0.746 0.742 0.738 0.787 0.792 0.827 0.802 0.815 0.832 0.838 0.824 0.885 0.850 0.730
UNDERSTANDING 0.353 0.521 0.653 0.651 0.661 0.646 0.693 0.697 0.722 0.725 0.699 0.747 0.849 0.743 0.809 0.731 0.643
LANGUAGE 0.471 0.590 0.644 0.638 0.662 0.660 0.700 0.700 0.729 0.714 0.698 0.725 0.692 0.701 0.780 0.651 0.522
KNOWLEDGE 0.367 0.456 0.552 0.548 0.531 0.558 0.527 0.541 0.657 0.632 0.520 0.501 0.599 0.563 0.581 0.659 0.531
COT 0.307 0.478 0.616 0.611 0.712 0.611 0.667 0.670 - 0.682 - - - 0.717 - 0.674 0.636
MATHCOT 0.545 0.766 0.900 0.890 0.954 0.882 0.884 0.895 0.910 0.904 0.820 0.954 0.910 0.924 0.927 0.948 0.949
CODE 0.317 0.443 0.538 0.524 0.534 0.532 0.582 0.573 - 0.618 - 0.755 - 0.586 0.321 0.506 0.551
DISCIPLINES
NLP 0.402 0.562 0.674 0.672 0.673 0.671 0.694 0.700 0.767 0.739 0.769 0.764 0.768 0.723 0.772 0.765 0.654
MATH 0.433 0.638 0.789 0.767 0.829 0.779 0.808 0.815 0.910 0.824 0.820 0.954 0.910 0.828 0.927 0.829 0.803
SCIENCE 0.340 0.648 0.783 0.789 0.813 0.782 0.808 0.818 - 0.847 - - - 0.866 0.946 0.793 0.774
ENGINEERING 0.265 0.463 0.561 0.589 0.607 0.554 0.595 0.606 - 0.655 - - - 0.693 - 0.491 0.518
MEDICINE 0.232 0.392 0.554 0.552 0.578 0.566 0.613 0.620 - 0.642 - 0.597 - 0.668 0.598 0.642 0.547
HUMANITIES 0.283 0.449 0.595 0.593 0.627 0.598 0.642 0.639 0.480 0.677 0.520 0.470 0.470 0.707 0.600 0.717 0.577
BUSINESS 0.358 0.555 0.717 0.717 0.744 0.725 0.756 0.762 - 0.807 - - - 0.815 - 0.724 0.683
LAW 0.200 0.329 0.438 0.458 0.490 0.448 0.473 0.488 - 0.535 - - - 0.586 - 0.625 0.489
COMPOSITE AVERAGE
AVG 0.363 0.539 0.664 0.661 0.676 0.663 0.693 0.699 0.767 0.732 0.768 0.764 0.767 0.731 0.759 0.743 0.652

CODE MODELS:

MODEL Codestral-22B-v0.1 Codestral-22B-Instruct-v0.1 Deepseek-Coder-V2-Lite-Instruct Qwen2.5-Coder-0.5B-32k-Instruct Qwen2.5-Coder-1.5B-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-Coder-3B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-Coder-7B Qwen2.5-Coder-7B Qwen2.5-Coder-14B-Instruct Qwen2.5-Coder-14B Qwen2.5-Coder-32B-Instruct Qwen3-Coder-30B-A3B-Instruct
params 22B 22B 14.77B 0.49403B 1.54B 3.09B 3.09B 7.62B 7.62B 7.62B 14.77B 14.77B 32.76B 30.53B
quant IQ4_XS IQ4_XS IQ4_XS Q6_K Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q6_K IQ4_XS IQ4_XS IQ4_XS Q4_K_H
engine llama.cpp version: 4132 llama.cpp version: 4191 llama.cpp version: 4488 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4094 llama.cpp version: 4295 llama.cpp version: 4132 llama.cpp version: 4120 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 5935
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc acc
HUMANEVAL 0.664 0.810 0.847 0.518 0.676 0.780 0.835 0.829 0.640 0.713 0.878 0.676 0.884 0.939
HUMANEVALP 0.554 0.682 - 0.432 0.567 0.682 0.719 0.707 0.530 0.579 0.756 0.536 0.756 0.810
HUMANEVALFIM 0.719 0.719 0.621 0.518 0.524 - 0.634 0.493 0.713 0.756 0.829 0.518 0.890 0.713
MBPP 0.630 0.653 - 0.408 0.560 0.599 0.618 0.735 0.614 0.571 0.727 0.661 0.715 0.696
MBPPP 0.558 0.593 - 0.352 0.504 0.584 0.589 0.687 0.540 0.513 0.665 0.558 0.669 0.669
HUMANEVALX_cpp 0.640 0.621 - 0.286 0.426 0.237 0.567 0.676 0.548 0.475 0.506 0.573 0.689 0.817
HUMANEVALX_java 0.756 0.670 - 0.512 0.609 0.615 0.743 0.798 0.725 0.652 0.201 0.762 0.841 0.884
HUMANEVALX_js 0.658 0.621 - 0.493 0.615 0.682 0.670 0.798 0.628 0.658 0.817 0.695 0.835 0.871
HUMANEVALX 0.684 0.638 - 0.430 0.550 0.512 0.660 0.758 0.634 0.595 0.508 0.676 0.788 0.857
CRUXEVAL_input 0.438 0.351 - 0.435 0.416 0.347 0.481 0.578 0.255 0.267 0.677 0.281 0.676 0.577
CRUXEVAL_output 0.465 0.447 - 0.278 0.332 0.311 0.413 0.507 0.381 0.435 0.577 0.422 0.610 0.558
CRUXEVAL 0.451 0.399 - 0.356 0.374 0.329 0.447 0.543 0.318 0.351 0.627 0.351 0.643 0.568
CRUXEVALFIM_input 0.295 0.351 - 0.017 0.155 - 0.208 0.322 0.296 0.313 0.421 0.346 0.515 0.440
CRUXEVALFIM_output 0.441 0.355 - 0.098 0.222 - 0.323 0.481 0.352 0.365 0.546 0.481 0.557 0.340
CRUXEVALFIM 0.368 0.353 - 0.058 0.188 - 0.266 0.401 0.324 0.339 0.483 0.413 0.536 0.390
CODE 0.483 0.467 0.734 0.278 0.368 0.449 0.453 0.548 0.413 0.427 0.593 0.458 0.648 0.576

MATH MODELS:

MODEL Deepseek-R1-Distill-Llama-8B Deepseek-R1-Distill-Llama-8B Deepseek-R1-Distill-Qwen-1.5B Deepseek-R1-Distill-Qwen-7B Deepseek-R1-Distill-Qwen-14B Deepseek-R1-Distill-Qwen-32B GLM-Z1-9B-0414 Qwen2.5-Math-1.5B-Instruct Qwen2.5-Math-7B-Instruct Qwen3-32B QwQ-32B QwQ-32B
params 8.03B 8.03B 1.78B 7.62B 14.77B 32.76B 9.40B 1.54B 7.62B 32.8B 32.76B 32.76B
quant Q6_K Q6_K_H Q8_0 IQ4_XS IQ4_XS IQ4_XS Q6_K_H IQ4_XS Q6_K Q4_K_H IQ4_XS Q4_K_H
engine llama.cpp version: 4707 llama.cpp version: 5898 llama.cpp version: 4763 llama.cpp version: 4644 llama.cpp version: 4657 llama.cpp version: 4559 llama.cpp version: 5935 llama.cpp version: 4406 llama.cpp version: 4394 llama.cpp version: 5633 llama.cpp version: 4820 llama.cpp version: 6026
TEST acc acc acc acc acc acc acc acc acc acc acc acc
GSM8K - 0.888 - - - - 0.964 - - - - 0.964
APPLE - 0.870 - - - - 0.880 - - - - 0.880
GPQA_diamond - 0.308 - - - - 0.434 - - - - 0.555
GPQA - 0.308 - - - - 0.434 - - - - 0.555
MATH1_algebra 0.933 0.977 0.918 0.962 0.925 0.962 0.992 0.859 0.955 0.992 0.992 1.000
MATH1_counting_and_probability 0.820 0.948 0.794 0.948 0.923 0.948 1.000 0.897 0.974 1.000 0.974 1.000
MATH1_geometry 0.842 0.868 0.710 0.736 0.868 0.921 0.947 0.710 0.842 0.842 0.921 0.921
MATH1_intermediate_algebra 0.923 0.980 0.730 0.903 0.865 0.961 0.961 0.730 0.711 0.923 1.000 0.980
MATH1_number_theory 0.700 0.900 0.866 0.800 0.700 0.933 0.900 0.766 1.000 0.666 0.800 0.833
MATH1_prealgebra 0.813 0.941 0.883 0.965 0.883 0.953 0.988 0.837 0.883 0.930 0.953 0.976
MATH1_precalculus 0.684 1.000 0.596 0.859 0.842 1.000 0.947 0.631 0.789 0.947 0.982 0.929
MATH1 0.842 0.956 0.814 0.910 0.878 0.958 0.972 0.794 0.885 0.931 0.963 0.965
MATH2_algebra 0.845 0.975 0.825 0.930 0.900 0.995 0.980 0.910 0.860 0.970 0.975 0.990
MATH2_counting_and_probability 0.831 0.930 0.782 0.851 0.841 0.950 0.980 0.683 0.861 0.970 0.990 0.970
MATH2_geometry 0.841 0.963 0.743 0.914 0.792 0.914 0.987 0.621 0.743 0.792 0.963 0.963
MATH2_intermediate_algebra 0.859 0.953 0.664 0.875 0.835 0.968 0.968 0.671 0.710 0.953 0.960 0.984
MATH2_number_theory 0.826 0.913 0.782 0.826 0.891 0.934 0.956 0.695 0.880 0.891 0.945 0.967
MATH2_prealgebra 0.898 0.966 0.887 0.909 0.875 0.971 0.988 0.836 0.881 0.932 0.971 0.960
MATH2_precalculus 0.787 0.964 0.663 0.902 0.805 0.955 0.955 0.557 0.725 0.858 0.964 0.964
MATH2 0.846 0.956 0.777 0.893 0.856 0.963 0.975 0.742 0.817 0.921 0.968 0.973
MATH3_algebra 0.873 0.938 0.854 0.934 0.911 0.992 0.980 0.881 0.850 0.969 0.996 0.984
MATH3_counting_and_probability 0.800 0.890 0.730 0.770 0.830 0.930 0.970 0.710 0.880 0.950 1.000 1.000
MATH3_geometry 0.794 0.901 0.627 0.911 0.794 0.901 0.970 0.696 0.764 0.833 0.970 0.931
MATH3_intermediate_algebra 0.825 0.969 0.635 0.902 0.882 0.964 0.969 0.574 0.738 0.933 0.969 0.938
MATH3_number_theory 0.819 0.934 0.696 0.754 0.770 0.926 0.926 0.655 0.819 0.811 0.942 0.918
MATH3_prealgebra 0.875 0.950 0.763 0.883 0.892 0.946 0.986 0.816 0.883 0.946 0.982 0.977
MATH3_precalculus 0.661 0.929 0.582 0.874 0.818 0.968 0.968 0.480 0.685 0.858 0.897 0.905
MATH3 0.822 0.937 0.719 0.876 0.859 0.954 0.970 0.714 0.810 0.915 0.969 0.955
MATH4_algebra 0.848 0.950 0.805 0.897 0.922 0.957 0.989 0.851 0.865 0.968 0.992 0.985
MATH4_counting_and_probability 0.729 0.882 0.639 0.738 0.711 0.945 0.963 0.558 0.783 0.945 0.981 0.981
MATH4_geometry 0.792 0.896 0.576 0.776 0.768 0.832 0.920 0.432 0.616 0.712 0.872 0.840
MATH4_intermediate_algebra 0.778 0.947 0.588 0.858 0.850 0.935 0.939 0.512 0.649 0.911 0.947 0.907
MATH4_number_theory 0.795 0.950 0.697 0.809 0.725 0.894 0.929 0.619 0.823 0.823 0.943 0.936
MATH4_prealgebra 0.806 0.931 0.785 0.874 0.827 0.921 0.958 0.748 0.801 0.879 0.926 0.942
MATH4_precalculus 0.719 0.956 0.570 0.868 0.728 0.947 0.947 0.333 0.578 0.859 0.973 0.868
MATH4 0.792 0.935 0.684 0.845 0.816 0.925 0.953 0.620 0.746 0.887 0.952 0.930
MATH5_algebra 0.768 0.947 0.752 0.899 0.853 0.970 0.960 0.674 0.762 0.964 0.964 0.977
MATH5_counting_and_probability 0.699 0.910 0.569 0.756 0.699 0.910 0.934 0.495 0.642 0.910 0.934 0.902
MATH5_geometry 0.712 0.886 0.545 0.810 0.727 0.840 0.878 0.348 0.507 0.734 0.833 0.742
MATH5_intermediate_algebra 0.682 0.900 0.453 0.821 0.778 0.810 0.889 0.253 0.389 0.807 0.860 0.800
MATH5_number_theory 0.811 0.909 0.707 0.727 0.792 0.935 0.961 0.525 0.753 0.870 0.941 0.935
MATH5_prealgebra 0.777 0.849 0.720 0.808 0.782 0.875 0.953 0.580 0.797 0.911 0.927 0.948
MATH5_precalculus 0.562 0.903 0.437 0.851 0.792 0.814 0.888 0.259 0.429 0.777 0.851 0.770
MATH5 0.723 0.904 0.609 0.822 0.787 0.884 0.926 0.462 0.617 0.865 0.907 0.879
MATHCOT 0.795 0.930 0.700 0.860 0.831 0.930 0.954 0.637 0.751 0.897 0.948 0.933
COMPOSITE AVERAGE
AVG 0.795 0.907 0.700 0.860 0.831 0.930 0.936 0.637 0.751 0.897 0.948 0.920

VISION MODELS:

MODEL gemma-3-4b-it gemma-3-4b-it gemma-3-12b-it gemma-3-27b-it LFM2-VL-1.6B Llama-4-Scout-17B-16E-Instruct MiniCPM-V-4_5 Mistral-Small-3.1-24B-Instruct-2503 Mistral-Small-3.2-24B-Instruct-2506 Qwen2.5-Omni-7B Qwen2.5-VL-3B-Instruct Qwen2.5-VL-7B-Instruct Qwen2.5-VL-32B-Instruct
params 3.88B 3.88B 11.77B 27.01B 1.17B 107.77B 8.19B 23.57B 23.57B 7.62B 3.09B 7.62B 32.76B
quant Q6_K Q6_K_H Q4_K_H Q4_K_H Q8_0_H Q2_K_H Q6_K_H_0 Q4_K_H Q4_K_H Q6_K_H Q8_0_H Q6_K_H Q4_K_H
engine llama.cpp version: 5706 llama.cpp version: 5819 llama.cpp version: 5819 llama.cpp version: 5780 llama.cpp version: 6768 llama.cpp version: 5935 llama.cpp version: 6451 llama.cpp version: 5662 llama.cpp version: 5780 llama.cpp version: 5752 llama.cpp version: 5819 llama.cpp version: 5745 llama.cpp version: 5902
TEST acc acc acc acc acc acc acc acc acc acc acc acc acc
CHARTQA 0.464 0.456 0.558 0.662 0.444 0.719 0.752 0.743 0.716 0.554 0.706 0.651 0.711
DOCVQA 0.567 0.563 0.711 0.795 0.548 0.862 0.836 0.892 0.866 0.744 0.686 0.735 0.795
MMMU_Accounting 0.366 0.400 0.566 0.700 0.300 0.866 0.766 0.466 0.733 0.466 0.433 0.533 0.533
MMMU_Agriculture 0.400 0.400 0.500 0.533 0.300 0.600 0.600 0.500 0.533 0.333 0.500 0.433 0.566
MMMU_Architecture_and_Engineering 0.200 0.166 0.400 0.333 0.266 0.366 0.500 0.400 0.400 0.333 0.133 0.400 0.366
MMMU_Art_Theory 0.533 0.666 0.833 0.866 0.400 0.800 0.733 0.866 0.700 0.700 0.400 0.633 0.766
MMMU_Art 0.566 0.566 0.700 0.766 0.433 0.766 0.600 0.633 0.666 0.600 0.400 0.500 0.600
MMMU_Basic_Medical_Science 0.333 0.533 0.633 0.566 0.433 0.700 0.600 0.733 0.600 0.566 0.333 0.500 0.633
MMMU_Biology 0.300 0.166 0.300 0.366 0.366 0.500 0.500 0.400 0.433 0.333 0.200 0.500 0.500
MMMU_Chemistry 0.033 0.266 0.333 0.333 0.166 0.433 0.400 0.366 0.366 0.300 0.233 0.166 0.400
MMMU_Clinical_Medicine 0.066 0.466 0.533 0.600 0.300 0.633 0.666 0.633 0.733 0.533 0.333 0.566 0.666
MMMU_Computer_Science 0.400 0.466 0.466 0.600 0.233 0.533 0.500 0.400 0.433 0.566 0.133 0.433 0.566
MMMU_Design 0.633 0.766 0.733 0.766 0.533 0.866 0.800 0.666 0.800 0.800 0.566 0.566 0.800
MMMU_Diagnostics_and_Laboratory_Medicine 0.100 0.200 0.300 0.233 0.300 0.433 0.433 0.400 0.433 0.400 0.300 0.300 0.466
MMMU_Economics 0.466 0.533 0.500 0.600 0.300 0.766 0.700 0.666 0.766 0.600 0.500 0.600 0.666
MMMU_Electronics 0.066 0.133 0.233 0.400 0.200 0.466 0.366 0.400 0.400 0.366 0.233 0.300 0.333
MMMU_Energy_and_Power 0.333 0.233 0.400 0.500 0.166 0.600 0.566 0.500 0.400 0.400 0.266 0.233 0.466
MMMU_Finance 0.333 0.333 0.466 0.533 0.133 0.500 0.466 0.500 0.466 0.366 0.266 0.400 0.466
MMMU_Geography 0.200 0.266 0.333 0.366 0.300 0.533 0.533 0.533 0.433 0.400 0.233 0.466 0.533
MMMU_History 0.566 0.633 0.733 0.800 0.400 0.800 0.766 0.500 0.700 0.633 0.533 0.533 0.866
MMMU_Literature 0.666 0.766 0.866 0.900 0.633 0.800 0.866 0.866 0.866 0.833 0.866 0.766 0.800
MMMU_Manage 0.233 0.333 0.333 0.500 0.366 0.500 0.500 0.466 0.566 0.500 0.366 0.466 0.533
MMMU_Marketing 0.333 0.400 0.466 0.666 0.300 0.800 0.633 0.633 0.700 0.500 0.233 0.566 0.533
MMMU_Materials 0.133 0.233 0.133 0.300 0.133 0.533 0.433 0.300 0.366 0.266 0.133 0.333 0.400
MMMU_Math 0.300 0.400 0.533 0.566 0.266 0.566 0.600 0.466 0.566 0.466 0.433 0.566 0.333
MMMU_Mechanical_Engineering 0.166 0.166 0.300 0.333 0.233 0.733 0.466 0.433 0.466 0.300 0.266 0.366 0.500
MMMU_Music 0.166 0.333 0.200 0.333 0.200 0.233 0.400 0.400 0.433 0.200 0.533 0.400 0.133
MMMU_Pharmacy 0.333 0.366 0.600 0.633 0.300 0.766 0.566 0.566 0.700 0.366 0.433 0.566 0.666
MMMU_Physics 0.166 0.300 0.433 0.600 0.200 0.666 0.566 0.433 0.600 0.500 0.400 0.333 0.633
MMMU_Psychology 0.366 0.433 0.500 0.566 0.400 0.566 0.633 0.633 0.466 0.433 0.500 0.366 0.566
MMMU_Public_Health 0.433 0.700 0.700 0.800 0.400 0.866 0.866 0.766 0.800 0.666 0.333 0.733 0.866
MMMU_Sociology 0.366 0.600 0.600 0.700 0.466 0.666 0.633 0.533 0.733 0.466 0.566 0.400 0.633
MMMU 0.318 0.407 0.487 0.558 0.314 0.628 0.588 0.535 0.575 0.473 0.368 0.464 0.560
MMMUPRO_Accounting 0.224 0.310 0.534 0.603 0.120 0.741 0.689 0.551 0.586 0.362 0.293 0.396 0.603
MMMUPRO_Agriculture 0.200 0.200 0.350 0.450 0.100 0.383 0.300 0.283 0.266 0.150 0.166 0.233 0.250
MMMUPRO_Architecture_and_Engineering 0.100 0.133 0.216 0.333 0.100 0.433 0.233 0.316 0.366 0.250 0.200 0.266 0.366
MMMUPRO_Art_Theory 0.472 0.490 0.636 0.709 0.218 0.672 0.654 0.618 0.527 0.563 0.400 0.581 0.654
MMMUPRO_Art 0.396 0.452 0.547 0.622 0.301 0.603 0.603 0.471 0.528 0.547 0.207 0.415 0.509
MMMUPRO_Basic_Medical_Science 0.269 0.250 0.384 0.442 0.173 0.596 0.423 0.384 0.403 0.307 0.250 0.423 0.365
MMMUPRO_Biology 0.169 0.237 0.288 0.322 0.118 0.423 0.389 0.355 0.372 0.237 0.101 0.305 0.440
MMMUPRO_Chemistry 0.200 0.266 0.333 0.350 0.116 0.366 0.466 0.383 0.450 0.216 0.250 0.316 0.433
MMMUPRO_Clinical_Medicine 0.118 0.135 0.237 0.372 0.135 0.322 0.372 0.474 0.389 0.271 0.101 0.203 0.406
MMMUPRO_Computer_Science 0.283 0.350 0.383 0.300 0.200 0.483 0.483 0.333 0.383 0.300 0.150 0.366 0.400
MMMUPRO_Design 0.433 0.500 0.533 0.616 0.216 0.616 0.650 0.533 0.616 0.616 0.366 0.550 0.683
MMMUPRO_Diagnostics_and_Laboratory_Medicine 0.116 0.200 0.200 0.233 0.100 0.383 0.283 0.200 0.300 0.216 0.100 0.183 0.250
MMMUPRO_Economics 0.423 0.457 0.559 0.644 0.118 0.677 0.610 0.661 0.627 0.389 0.254 0.491 0.576
MMMUPRO_Electronics 0.233 0.316 0.350 0.316 0.033 0.616 0.600 0.600 0.600 0.400 0.266 0.433 0.466
MMMUPRO_Energy_and_Power 0.172 0.172 0.120 0.327 0.086 0.568 0.344 0.224 0.275 0.155 0.137 0.172 0.413
MMMUPRO_Finance 0.283 0.366 0.533 0.516 0.133 0.633 0.616 0.600 0.650 0.350 0.216 0.383 0.433
MMMUPRO_Geography 0.346 0.307 0.346 0.403 0.192 0.480 0.423 0.384 0.384 0.269 0.134 0.307 0.384
MMMUPRO_History 0.375 0.392 0.535 0.553 0.250 0.607 0.517 0.428 0.553 0.464 0.410 0.392 0.553
MMMUPRO_Literature 0.500 0.461 0.634 0.692 0.307 0.730 0.653 0.615 0.634 0.557 0.615 0.557 0.653
MMMUPRO_Manage 0.220 0.240 0.320 0.400 0.120 0.480 0.480 0.420 0.420 0.320 0.320 0.260 0.440
MMMUPRO_Marketing 0.288 0.305 0.440 0.440 0.169 0.627 0.559 0.525 0.593 0.338 0.271 0.508 0.559
MMMUPRO_Materials 0.083 0.133 0.150 0.250 0.116 0.316 0.266 0.166 0.300 0.166 0.150 0.166 0.250
MMMUPRO_Math 0.283 0.233 0.316 0.466 0.150 0.483 0.500 0.416 0.466 0.283 0.200 0.250 0.316
MMMUPRO_Mechanical_Engineering 0.152 0.186 0.271 0.305 0.118 0.474 0.338 0.440 0.372 0.271 0.135 0.338 0.406
MMMUPRO_Music 0.216 0.250 0.266 0.233 0.150 0.183 0.283 0.183 0.233 0.300 0.250 0.233 0.216
MMMUPRO_Pharmacy 0.298 0.298 0.456 0.491 0.210 0.596 0.508 0.508 0.684 0.385 0.333 0.333 0.491
MMMUPRO_Physics 0.166 0.116 0.416 0.400 0.083 0.533 0.433 0.466 0.466 0.300 0.200 0.266 0.433
MMMUPRO_Psychology 0.366 0.333 0.300 0.383 0.100 0.500 0.416 0.350 0.416 0.400 0.166 0.200 0.366
MMMUPRO_Public_Health 0.241 0.293 0.396 0.551 0.120 0.758 0.603 0.482 0.551 0.327 0.172 0.448 0.655
MMMUPRO_Sociology 0.333 0.462 0.574 0.629 0.166 0.592 0.500 0.574 0.518 0.407 0.425 0.314 0.592
MMMUPRO 0.263 0.293 0.384 0.442 0.149 0.527 0.471 0.430 0.463 0.335 0.238 0.341 0.450
DISCIPLINES
NLP - - - - - - - - - - - - -
MATH 0.305 0.338 0.400 0.450 0.200 0.505 0.511 0.394 0.450 0.366 0.211 0.372 0.388
SCIENCE 0.178 0.218 0.318 0.378 0.149 0.452 0.414 0.354 0.400 0.258 0.213 0.289 0.407
ENGINEERING 0.239 0.272 0.337 0.409 0.167 0.563 0.469 0.442 0.463 0.373 0.246 0.360 0.476
MEDICINE 0.222 0.309 0.408 0.467 0.215 0.580 0.502 0.481 0.529 0.371 0.243 0.389 0.511
HUMANITIES 0.392 0.441 0.517 0.571 0.277 0.577 0.552 0.508 0.524 0.470 0.387 0.419 0.530
BUSINESS 0.309 0.360 0.477 0.550 0.183 0.653 0.600 0.552 0.603 0.399 0.300 0.447 0.532
LAW - - - - - - - - - - - - -
VISION 0.498 0.500 0.628 0.714 0.471 0.780 0.767 0.790 0.773 0.640 0.634 0.661 0.728
COMPOSITE AVERAGE
AVG 0.471 0.479 0.602 0.685 0.437 0.752 0.735 0.749 0.739 0.608 0.590 0.627 0.698

AUDIO MODELS:

MODEL Qwen2.5-Omni-7B ultravox-v0_5-llama-3_1-8b ultravox-v0_5-deepseek-r1-llama-3_1-8b ultravox-v0_5-deepseek-r1-llama-3_1-8b ultravox-v0_6-gemma-3-27b ultravox-v0_6-qwen-3-32b Voxtral-Mini-3B-2507 Voxtral-Mini-3B-2507 Voxtral-Small-24B-2507
params 7.62B 8.03B 8.03B 8.03B 27.01B 32.8B 4.01B 4.01B 23.57B
quant Q6_K_H Q6_K_H Q6_K Q6_K_H Q4_K_H Q4_K_H Q6_K Q6_K_H Q4_K_H
engine llama.cpp version: 5780 llama.cpp version: 5780 llama.cpp version: 5869 llama.cpp version: 5890 llama.cpp version: 5853 llama.cpp version: 5853 llama.cpp version: 6014 llama.cpp version: 6014 llama.cpp version: 6014
TEST acc acc acc acc acc acc acc acc acc
BBA_formal_fallacies 0.472 0.552 0.768 0.848 0.640 0.996 0.544 0.528 0.576
BBA_navigate 0.756 0.776 0.988 0.984 0.716 0.976 0.664 0.656 0.680
BBA_object_counting 0.616 0.864 0.924 0.856 0.800 0.984 0.596 0.640 0.504
BBA_web_of_lies 0.540 0.844 0.932 0.920 0.464 0.784 0.576 0.576 0.660
BBA 0.596 0.759 0.903 0.902 0.655 0.935 0.595 0.600 0.605
DISCIPLINES
NLP 0.506 0.698 0.850 0.884 0.552 0.890 0.560 0.552 0.618
MATH 0.686 0.820 0.956 0.920 0.758 0.980 0.630 0.648 0.592
SCIENCE - - - - - - - - -
ENGINEERING - - - - - - - - -
MEDICINE - - - - - - - - -
HUMANITIES - - - - - - - - -
BUSINESS - - - - - - - - -
LAW - - - - - - - - -
AUDIO - - - - - - - - -
COMPOSITE AVERAGE
AVG 0.596 0.759 0.903 0.902 0.655 0.935 0.595 0.600 0.605