Independent LLM benchmarks for a wide range of models using custom prompts including
category and discipline summaries.

Tests are run using a modified llama.cpp server (supporting logprob completion mode) and/or textsynth server where noted.

METHODOLOGY:
   All CoT, code, and math tests are zero shot.  A few BBH tests use fewshot examples.
   Math CoT test such as GSM8K, APPLE, MATH etc. are self graded against correct answer using LLM under test
     If self grade does not work reliably (such as with very small model) the result is zeroed to mark invalid test.
   All MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1.
     To score a correct answer in MC both queries must answer correctly.
   Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases).

TESTS:
   KNOWLEDGE:
      TQA - Truthful QA
      JEOPARDY - 100 Question JEOPARDY quiz
   LANGUAGE:
      LAMBADA - Language Modeling Broadened to Account for Discourse Aspects
   UNDERSTANDING:
      WG - Winogrande
      BOOLQ - Boolean questions
      STORYCLOZE - Story questions
      OBQA - Open Book Question / Answer
      SIQA - Social IQ
      RACE - Reading comprehension dataset from examinations
      MMLU - massive multitask language understanding
      MEDQA - medical QA
   REASONING
      CSQA - Common Sense Question Answer
      COPA - Choice of Plausible Alternatives
      HELLASWAG - Hella Situations with Adversarial Generations
      PIQA - Physical Interaction: Question Answering
      ARC - A12 Reasoning Challenge
      AGIEVAL - AGIEval logiqa, lsat, sat
      AGIEVALC  - Gaokao SAT, logiqa, jec (Chinese)
      MUSR - Multimodal Semantic Reasoning
   COT:
      GSM8K - Grade School Math CoT
      BBH  - Beyond the Imitation Game Bench Hard CoT
      MMLUPRO - massive multitask language understanding pro CoT
      AGIEVAL - satmath, aquarat
      AGIEVALC  - mathcloze, mathqa (Chinese)
      MUSR - Multimodal Semantic Reasoning
      APPLE - 100 custom Apple Questions
   MATH:
      MATH1..MATH5 - MATH Datasets level 1 through 5 (Hendrycks et al.)
   CODE:
      HUMANEVAL - Python
      HUMANEVALP - Python, extended test
      HUMANEVALX - Python, Java, Javascript, C++
      MBPP - Python
      MBPPP - Python, extendend test
      CRUXEVAL - Python
      USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM

GENERAL MODELS:

TEST EXAONE-3.5-2.4B-Instruct EXAONE-3.5-7.8B-Instruct Falcon3-1B-Instruct Falcon3-7B-Instruct Falcon3-10B-Instruct gemma-2-2b-it gemma-2-9b-it gemma-2-9b-it gemma-2-9b-it gemma-2-27b-it glm-4-9b-chat glm-4-9b-chat granite-3.0-2b-instruct granite-3.0-8b-instruct granite-3.1-1b-a400m-instruct granite-3.1-2b-instruct granite-3.1-8b-instruct internlm2_5-7b-chat Meta-Llama-3-8B-Instruct Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct Llama-3.2-1B-Instruct Llama-3.2-3B-Instruct Marco-o1 Ministral-8B-Instruct-2410 Mistral-7B-Instruct-v0.3 Mistral-Nemo-12B-Instruct-2407 openchat-3.5-0106 openchat-3.6-8b-20240522 Phi-3-mini-4k-instruct Phi-3-mini-128k-instruct Phi-3-mini-128k-instruct Phi-3.5-mini-8k-instruct Phi-3.5-mini-128k-instruct Phi-3-medium-128k-instruct Phi-4 Qwen2-7B-Instruct Qwen2-7B-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-7B-32k-Instruct Qwen2.5-14B-32k-Instruct Qwen2.5-32B-Instruct QwQ-32B-Preview SOLAR-10.7B-Instruct-v1.0 solar-pro-preview-instruct
params 2.67B 7.82B 1.67B 7.46B 10.31B 2.61B 9.24B 9.24B 9.24B 27.23B 9.40B 9.40B 2.63B 8.17B 1.33B 2.53B 8.17B 7.74B 8.03B 8.03B 8.03B 8.03B 8.03B 1.24B 3.21B 7.62B 8.02B 7.25B 12.25B 7.24B 8.03B 3.82B 3.82B 3.82B 3.82B 3.82B 13.96B 14.66B 7.62B 7.62B 3.09B 3.09B 7.62B 7.62B 14.77B 32.76B 32.76B 10.73B 22.14B
quant IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS Q8_0 IQ4_XS Q4_K_M Q6_K IQ4_XS IQ4_XS Q6_K Q6_K Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q6_K Q4 IQ4_XS Q4_K_M Q6_K IQ4_XS Q6_K IQ4_XS Q6_K Q8_0 IQ4_XS Q8_0 Q8_0 Q8_0 IQ4_XS Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q6_K Q4 IQ4_XS Q6_K IQ4_XS Q6_K IQ4_XS IQ4_XS IQ4_XS Q4_K_M IQ4_XS
engine llama.cpp version: 4384 llama.cpp version: 4291 llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 4341 llama.cpp version: 3496 llama.cpp version: 3334 llama.cpp version: 3325 llama.cpp version: 3266 llama.cpp version: 3389 llama.cpp version: 3496 llama.cpp version: 3334 llama.cpp version: 3985 llama.cpp version: 3985 llama.cpp version: 4384 llama.cpp version: 4341 llama.cpp version: 4363 llama.cpp version: 3496 llama.cpp version: 3266 textsynth ts_server version 2024-09-30 llama.cpp version: 3707 llama.cpp version: 3731 llama.cpp version: 3428 llama.cpp version: 4341 llama.cpp version: 3825 llama.cpp version: 4240 llama.cpp version: 3927 llama.cpp version: 3262 llama.cpp version: 3428 llama.cpp version: 3262 llama.cpp version: 3262 llama.cpp version: 3520 llama.cpp version: 3565 llama.cpp version: 3520 llama.cpp version: 3609 llama.cpp version: 3600 llama.cpp version: 3505 llama.cpp version: 4295 llama.cpp version: 3609 textsynth ts_server version 2024-09-30 llama.cpp version: 4038 llama.cpp version: 4038 llama.cpp version: 3943 llama.cpp version: 3870 llama.cpp version: 3821 llama.cpp version: 3821 llama.cpp version: 4273 llama.cpp version: 3235 llama.cpp version: 3790
--------------------------------------------- -------------------------- -------------------------- --------------------- --------------------- ---------------------- --------------- --------------- --------------- --------------- ---------------- --------------- --------------- ------------------------- ------------------------- ------------------------------- ------------------------- ------------------------- --------------------- -------------------------- ----------------------- ----------------------- ----------------------- ----------------------- ----------------------- ----------------------- ---------- ---------------------------- -------------------------- -------------------------------- ------------------- -------------------------- ------------------------ -------------------------- -------------------------- -------------------------- ---------------------------- ---------------------------- ------- ------------------- ------------------- ------------------------- ------------------------- ------------------------- ------------------------- -------------------------- ---------------------- ----------------- --------------------------- ----------------------------
WG 0.636 0.696 0.600 0.670 0.700 0.701 0.756 0.761 0.762 0.772 0.759 0.753 0.679 0.719 0.609 0.700 0.752 0.822 0.707 0.740 0.750 0.745 0.741 0.612 0.685 0.695 0.748 0.751 0.770 0.783 0.760 0.737 0.728 0.727 0.744 0.734 0.744 0.708 0.705 0.681 0.687 0.695 0.709 0.709 0.754 0.746 0.750 0.759 0.779
LAMBADA 0.613 0.680 0.524 0.688 0.692 0.624 0.733 0.732 0.735 0.755 0.786 0.783 0.746 0.799 0.665 0.732 0.790 0.732 0.710 0.729 0.740 0.738 0.747 0.610 0.705 0.715 0.776 0.766 0.714 0.744 0.733 0.613 0.638 0.618 0.677 0.613 0.632 0.750 0.735 0.564 0.685 0.682 0.722 0.724 0.769 0.781 0.780 0.654 0.708
HELLASWAG 0.646 0.788 0.308 0.684 0.716 0.496 0.766 0.798 0.775 0.810 0.834 0.840 0.583 0.696 0.053 0.517 0.693 0.916 0.667 0.739 0.684 0.696 0.696 0.293 0.559 0.809 0.835 0.591 0.726 0.800 0.824 0.738 0.695 0.743 0.716 0.669 0.807 0.801 0.755 0.765 0.670 0.713 0.820 0.822 0.863 0.894 0.875 0.826 0.823
BOOLQ 0.533 0.582 0.364 0.591 0.621 0.542 0.684 0.701 0.687 0.739 0.633 0.625 0.540 0.594 0.558 0.204 0.576 0.572 0.609 0.621 0.587 0.576 0.610 0.023 0.478 0.624 0.574 0.658 0.677 0.719 0.677 0.577 0.562 0.540 0.562 0.573 0.635 0.653 0.633 0.575 0.517 0.533 0.617 0.623 0.647 0.701 0.629 0.651 0.665
STORYCLOZE 0.922 0.950 0.774 0.949 0.947 0.877 0.948 0.959 0.958 0.973 0.967 0.976 0.925 0.924 0.440 0.864 0.955 0.996 0.930 0.685 0.884 0.859 0.895 0.421 0.870 0.895 0.937 0.917 0.921 0.982 0.980 0.907 0.906 0.891 0.531 0.921 0.969 0.754 0.959 0.964 0.913 0.896 0.920 0.915 0.938 0.981 0.964 0.973 0.949
CSQA 0.647 0.740 0.488 0.725 0.746 0.631 0.741 0.751 0.751 0.763 0.727 0.733 0.615 0.692 0.098 0.587 0.686 0.760 0.639 0.732 0.698 0.678 0.686 0.404 0.642 0.760 0.669 0.627 0.664 0.796 0.875 0.679 0.656 0.675 0.669 0.660 0.751 0.740 0.728 0.726 0.701 0.717 0.768 0.781 0.795 0.823 0.796 0.714 0.726
OBQA 0.660 0.773 0.380 0.761 0.745 0.647 0.841 0.846 0.846 0.860 0.821 0.802 0.605 0.714 0.119 0.575 0.719 0.818 0.685 0.787 0.738 0.753 0.765 0.394 0.709 0.800 0.769 0.676 0.730 0.834 0.803 0.773 0.750 0.761 0.751 0.720 0.830 0.857 0.800 0.771 0.700 0.731 0.802 0.804 0.863 0.904 0.882 0.794 0.812
COPA 0.842 0.907 0.612 0.870 0.903 0.823 0.923 0.926 0.925 0.949 0.955 0.944 0.806 0.844 0.222 0.809 0.884 0.927 0.886 0.908 0.878 0.859 0.889 0.557 0.749 0.922 0.887 0.812 0.859 0.967 0.947 0.890 0.898 0.884 0.884 0.870 0.924 0.934 0.923 0.912 0.841 0.858 0.925 0.919 0.935 0.958 0.936 0.910 0.940
PIQA 0.624 0.722 0.233 0.696 0.732 0.591 0.799 0.803 0.801 0.841 0.773 0.779 0.497 0.712 0.171 0.593 0.720 0.787 0.681 0.819 0.707 0.716 0.725 0.342 0.637 0.769 0.694 0.708 0.777 0.794 0.771 0.761 0.745 0.741 0.733 0.677 0.827 0.832 0.784 0.778 0.695 0.713 0.794 0.807 0.848 0.870 0.829 0.799 0.803
SIQA 0.667 0.697 0.425 0.658 0.688 0.597 0.691 0.694 0.693 0.731 0.664 0.665 0.627 0.678 0.139 0.592 0.684 0.735 0.624 0.712 0.634 0.643 0.648 0.374 0.622 0.713 0.684 0.620 0.655 0.730 0.726 0.675 0.662 0.675 0.667 0.661 0.729 0.639 0.699 0.696 0.656 0.663 0.721 0.712 0.746 0.742 0.714 0.727 0.716
MEDQA 0.262 0.374 0.141 0.420 0.430 0.286 0.492 0.498 0.501 0.549 0.436 0.445 0.242 0.336 0.032 0.223 0.359 0.389 0.486 0.556 0.491 0.482 0.500 0.150 0.413 0.422 0.399 0.334 0.465 0.372 0.427 0.457 0.421 0.446 0.423 0.395 0.553 0.560 0.391 0.380 0.344 0.363 0.453 0.458 0.542 0.610 0.598 0.368 0.538
JEOPARDY 0.170 0.420 0.010 0.400 0.310 0.220 0.510 0.570 0.550 0.740 0.370 0.420 0.230 0.470 0.960 0 0.400 0.300 0.370 0.450 0.500 0.390 0.510 0 0.350 0.250 0.330 0.490 0.470 0.450 0.510 0.300 0.220 0.250 0.320 0.250 0.460 0.550 0.200 0.250 0.120 0.120 0.300 0.290 0.540 0.600 0.600 0.480 0.480
GSM8K 0.814 0.902 0.485 0.890 0.918 0.645 0.878 0.881 0.890 0.899 0.855 0.839 0.712 0.811 0.416 0.741 0.843 0.855 0.817 0.851 0.859 0.859 0.872 0.490 0.822 0.912 0.887 0.611 0.828 0.775 0.811 0.870 0.706 0.833 0.855 0.714 0.667 0.946 0.871 0.535 0.829 0.856 0.909 0.917 0.938 0.950 0.962 0.730 0.812
APPLE 0.540 0.720 0.150 0.810 0.740 0.350 0.700 0.690 0.750 0.730 0.630 0.610 0.370 0.590 0.150 0.410 0.560 0.650 0.580 - 0.600 0.690 0.690 0.230 0.610 0.710 0.760 0.390 0.690 0.500 0.520 0.610 0.480 0.540 0.560 0.560 0.650 0.910 0.630 - 0.640 0.560 0.740 0.750 0.830 0.860 0.870 0.510 0.650
HUMANEVAL 0.725 0.804 0.115 0.737 0.774 0.408 0.646 0.621 0.658 0.743 0.737 0.731 0.469 0.621 0.359 0.536 0.689 0.317 0.591 0.652 0.634 0.628 0.652 0.298 0.585 0.810 0.768 0.390 0.689 0.310 0.689 0.707 0.628 0.652 0.682 0.621 0.268 0.847 0.719 0.567 0.695 0.780 0.798 0.817 0.804 0.884 0.414 0.402 0.506
HUMANEVALP 0.646 0.701 0.073 0.628 0.664 0.310 0.548 0.530 0.548 0.615 0.615 0.634 0.402 0.536 0.286 0.469 0.591 0.250 0.335 0.524 0.518 0.524 0.536 0.225 0.475 0.701 0.628 0.329 0.554 0.304 0.579 0.603 0.524 0.567 0.591 0.524 0.219 0.725 0.609 0.475 0.615 0.682 0.670 0.658 0.676 0.768 0.359 0.286 0.432
MBPP 0.548 0.618 0.334 0.669 0.653 0.459 0.591 0.595 0.595 0.642 0.579 0.591 0.470 0.548 0.424 0.513 0.591 0.521 0.050 0.544 0.575 0.583 0.564 0.381 0.498 0.642 0.626 0.451 0.513 0.470 0.326 0.536 0.482 0.451 0.610 0.498 0.412 0.673 0.587 0.501 0.595 0.599 0.669 0.661 0.669 0.684 0.404 0.373 0.330
MBPPP 0.508 0.575 0.312 0.625 0.611 0.441 0.575 0.575 0.584 0.638 0.562 0.575 0.433 0.517 0.392 0.504 0.522 0.477 0.049 0.526 0.526 0.535 0.540 0.397 0.482 0.611 0.580 0.397 0.428 0.410 0.151 0.482 0.459 0.450 0.575 0.477 0.401 0.651 0.580 0.495 0.540 0.584 0.633 0.651 0.633 0.700 0.392 0.366 0.321
HUMANEVALFIM - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0.512 - - - - - - - - - - - - - - - - - - - -
HUMANEVALX_cpp 0.420 0.597 0.054 0.506 0.603 0.152 0.500 0.500 0.512 0.579 0.439 0.432 0.231 0.347 0.213 0.268 0.408 0.231 0.317 0.256 0.420 0.445 0.457 0.158 0.323 0.310 0.573 0.225 0.292 0.347 0.060 0.262 0.243 0.243 0.280 0.219 0 0.676 0.542 0.384 0.420 0.237 0.475 0.554 0.323 0.701 0.378 0.237 0.219
HUMANEVALX_java 0.536 0.689 0.042 0.640 0.719 0.390 0.628 0.640 0.640 0.768 0.207 0.628 0.353 0.365 0.231 0.390 0.518 0.170 0.079 0.036 0.396 0.097 0.487 0 0.439 0.634 0.731 0.256 0.201 0.170 0.073 0.493 0.030 0.024 0.079 0.060 0 0.634 0.628 0.365 0.640 0.615 0.695 0.737 0.780 0.865 0.097 0.347 0
HUMANEVALX_js 0.573 0.731 0.115 0.676 0.652 0.353 0.548 0.567 0.579 0.743 0.628 0.628 0.426 0.560 0.213 0.451 0.573 0.518 0.079 0.548 0.542 0.548 0.560 0.243 0.067 0.750 0.701 0.402 0.615 0.170 0.018 0.567 0.378 0.524 0.560 0.451 0.219 0.786 0.731 0.560 0.646 0.689 0.719 0.750 0.798 0.847 0.493 0.048 0.359
HUMANEVALX 0.510 0.672 0.071 0.607 0.658 0.298 0.558 0.569 0.577 0.697 0.424 0.563 0.337 0.424 0.219 0.369 0.500 0.306 0.158 0.280 0.453 0.363 0.502 0.134 0.276 0.565 0.668 0.294 0.369 0.229 0.050 0.441 0.217 0.264 0.306 0.243 0.073 0.699 0.634 0.436 0.569 0.514 0.630 0.680 0.634 0.804 0.323 0.211 0.193
CRUXEVAL_input 0.333 0.377 0.210 0.411 0.448 0.321 0.455 0.443 0.462 0.485 0.416 0.406 0.288 0.317 0.077 0.298 0.448 0.367 0.340 0.405 0.408 0.440 0.435 0.162 0.353 0.367 0.428 0.276 0.442 0.323 0.383 0.372 0.375 0.390 0.398 0.388 0.456 0.447 0.351 0.298 0.350 0.331 0.387 0.412 0.541 0.517 0.200 0.131 0.438
CRUXEVAL_output 0.262 0.348 0.152 0.355 0.410 0.280 0.373 0.372 0.375 0.482 0.356 0.338 0.282 0.351 0.171 0.253 0.336 0.276 0.318 0.352 0.341 0.356 0.360 0.201 0.291 0.403 0.377 0.303 0.365 0.318 0.323 0.340 0.321 0.340 0.342 0.296 0.423 0.463 0.050 0.175 0.275 0.311 0.382 0.386 0.471 0.455 0.368 0.222 0.388
CRUXEVAL 0.298 0.363 0.181 0.383 0.429 0.300 0.414 0.408 0.418 0.483 0.386 0.372 0.285 0.334 0.124 0.276 0.392 0.321 0.329 0.378 0.375 0.398 0.397 0.181 0.322 0.385 0.403 0.290 0.403 0.321 0.353 0.356 0.348 0.365 0.370 0.342 0.440 0.455 0.200 0.236 0.312 0.321 0.385 0.399 0.506 0.486 0.284 0.176 0.413
CRUXEVALFIM_input - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0.195 - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM_output - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0.026 - - - - - - - - - - - - - - - - - - - -
CRUXEVALFIM - - - - - - - - - - - - - - - - - - - - - - - - - - - - 0.110 - - - - - - - - - - - - - - - - - - - -
TQA_mc 0.537 0.674 0.146 0.523 0.510 0.356 0.679 0.696 0.701 0.767 0.636 0.640 0.402 0.559 0.072 0.376 0.576 0.648 0.507 0.629 0.626 0.630 0.564 0.274 0.555 0.583 0.561 0.549 0.542 0.624 0.547 0.653 0.631 0.643 0.621 0.581 0.742 0.725 0.503 0.532 0.516 0.548 0.654 0.657 0.747 0.804 0.795 0.563 0.709
TQA_tf 0.510 0.647 0.381 0.410 0.431 0.510 0.675 0.719 0.692 0.725 0.484 0.457 0.421 0.473 0.250 0.510 0.343 0.593 0.504 0.643 0.552 0.563 0.512 0.063 0.566 0.569 0.572 0.548 0.479 0.548 0.536 0.578 0.532 0.541 0.483 0.487 0.670 0.686 0.435 0.602 0.414 0.300 0.574 0.568 0.706 0.731 0.523 0.487 0.485
TQA 0.514 0.650 0.354 0.423 0.440 0.492 0.675 0.716 0.693 0.730 0.502 0.478 0.419 0.483 0.229 0.495 0.370 0.599 0.505 0.641 0.560 0.571 0.518 0.088 0.565 0.571 0.571 0.548 0.486 0.556 0.537 0.587 0.544 0.553 0.499 0.498 0.679 0.691 0.442 0.594 0.426 0.329 0.583 0.578 0.711 0.740 0.554 0.496 0.511
ARC_challenge 0.709 0.825 0.374 0.809 0.819 0.671 0.874 0.881 0.882 0.897 0.835 0.853 0.614 0.742 0.063 0.599 0.744 0.812 0.732 0.794 0.769 0.766 0.776 0.342 0.706 0.825 0.775 0.688 0.773 0.797 0.819 0.838 0.808 0.833 0.813 0.802 0.888 0.911 0.818 0.796 0.750 0.777 0.843 0.851 0.911 0.934 0.917 0.755 0.871
ARC_easy 0.865 0.937 0.598 0.925 0.933 0.846 0.949 0.950 0.952 0.963 0.933 0.940 0.808 0.893 0.105 0.783 0.887 0.936 0.883 0.914 0.898 0.904 0.906 0.576 0.843 0.937 0.910 0.843 0.908 0.910 0.914 0.943 0.926 0.935 0.934 0.932 0.965 0.970 0.923 0.918 0.895 0.904 0.945 0.946 0.969 0.978 0.975 0.899 0.962
ARC 0.813 0.900 0.524 0.886 0.895 0.788 0.925 0.927 0.929 0.941 0.901 0.911 0.744 0.843 0.091 0.722 0.839 0.895 0.833 0.874 0.855 0.858 0.863 0.498 0.798 0.900 0.866 0.792 0.864 0.873 0.883 0.908 0.887 0.901 0.894 0.889 0.939 0.950 0.888 0.878 0.847 0.862 0.911 0.915 0.950 0.963 0.956 0.851 0.932
RACE_high 0.626 0.753 0.431 0.698 0.730 0.580 0.817 0.817 0.802 0.833 0.788 0.787 0.553 0.642 0.066 0.469 0.631 0.826 0.641 0.756 0.679 0.676 0.679 0.377 0.589 0.771 0.736 0.607 0.726 0.771 0.773 0.696 0.625 0.648 0.613 0.625 0.779 0.819 0.779 0.762 0.698 0.712 0.779 0.788 0.852 0.882 0.871 0.741 0.764
RACE_middle 0.704 0.806 0.463 0.777 0.793 0.610 0.860 0.860 0.849 0.883 0.816 0.825 0.631 0.735 0.089 0.579 0.706 0.863 0.708 0.817 0.747 0.744 0.734 0.396 0.680 0.825 0.800 0.696 0.782 0.807 0.834 0.750 0.697 0.722 0.706 0.692 0.832 0.861 0.827 0.811 0.775 0.776 0.841 0.853 0.887 0.923 - 0.809 0.824
RACE 0.648 0.769 0.440 0.721 0.748 0.589 0.830 0.829 0.816 0.847 0.796 0.798 0.576 0.669 0.072 0.501 0.653 0.837 0.660 0.774 0.699 0.696 0.695 0.382 0.615 0.786 0.755 0.633 0.743 0.781 0.791 0.712 0.646 0.670 0.640 0.645 0.795 0.831 0.793 0.776 0.720 0.730 0.797 0.807 0.862 0.894 0.871 0.761 0.781
MMLU
abstract_algebra 0.220 0.350 0.180 0.410 0.450 0.140 0.320 0.320 0.330 0.310 0.220 0.210 0.170 0.200 0.020 0.140 0.270 0.230 0.140 0.390 0.210 0.140 0.200 0.140 0.270 0.370 0.210 0.190 0.330 0.220 0.170 0.340 0.330 0.250 0.300 0.210 0.390 0.410 0.480 0.370 0.240 0.250 0.440 0.430 0.570 0.600 - 0.140 0.340
anatomy 0.370 0.540 0.318 0.577 0.592 0.414 0.604 0.611 0.626 0.607 0.503 0.511 0.362 0.474 0.066 0.407 0.511 0.614 0.544 0.688 0.540 0.511 0.555 0.348 0.540 0.585 0.540 0.447 0.555 0.552 0.537 0.562 0.540 0.577 0.570 0.585 0.666 0.703 0.488 0.488 0.525 0.562 0.622 0.622 0.644 0.733 - 0.477 0.607
astronomy 0.513 0.723 0.263 0.736 0.756 0.467 0.753 0.740 0.760 0.828 0.644 0.651 0.519 0.651 0.059 0.493 0.631 0.723 0.640 0.723 0.651 0.657 0.677 0.421 0.565 0.756 0.671 0.573 0.651 0.620 0.646 0.710 0.684 0.677 0.703 0.703 0.796 0.776 0.723 0.651 0.618 0.657 0.763 0.769 0.868 0.875 - 0.586 0.756
business_ethics 0.550 0.600 0.260 0.570 0.560 0.430 0.620 0.630 0.620 0.670 0.570 0.610 0.460 0.490 0.020 0.400 0.530 0.640 0.510 0.640 0.540 0.520 0.550 0.280 0.480 0.640 0.540 0.520 0.630 0.540 0.530 0.620 0.570 0.620 0.620 0.620 0.710 0.740 0.670 0.620 0.630 0.590 0.680 0.710 0.750 0.800 - 0.570 0.740
clinical_knowledge 0.528 0.649 0.373 0.652 0.683 0.550 0.724 0.743 0.743 0.788 0.618 0.622 0.490 0.584 0.105 0.475 0.656 0.716 0.690 0.750 0.656 0.698 0.675 0.298 0.592 0.701 0.664 0.581 0.664 0.637 0.649 0.716 0.709 0.686 0.713 0.698 0.750 0.781 0.686 0.671 0.633 0.645 0.709 0.713 0.803 0.815 - 0.577 0.735
college_biology 0.590 0.715 0.340 0.763 0.777 0.625 0.847 0.833 0.854 0.895 0.687 0.715 0.486 0.659 0.013 0.465 0.673 0.701 0.652 0.777 0.722 0.687 0.722 0.250 0.625 0.736 0.708 0.625 0.694 0.631 0.659 0.812 0.756 0.791 0.805 0.763 0.819 0.868 0.743 0.701 0.694 0.694 0.784 0.784 0.854 0.923 - 0.618 0.833
college_chemistry 0.320 0.390 0.180 0.470 0.430 0.330 0.440 0.450 0.470 0.430 0.380 0.380 0.340 0.350 0.020 0.310 0.380 0.410 0.380 0.490 0.390 0.390 0.400 0.160 0.310 0.400 0.370 0.350 0.340 0.380 0.400 0.440 0.450 0.450 0.460 0.430 0.440 0.520 0.380 0.340 0.310 0.370 0.480 0.490 0.460 0.530 - 0.330 0.450
college_computer_science 0.360 0.520 0.110 0.540 0.590 0.290 0.480 0.460 0.460 0.580 0.470 0.480 0.250 0.400 0.060 0.260 0.400 0.490 0.340 0.500 0.380 0.340 0.400 0.200 0.350 0.550 0.410 0.320 0.400 0.440 0.400 0.480 0.470 0.440 0.480 0.410 0.510 0.600 0.520 0.460 0.390 0.460 0.620 0.590 0.630 0.720 - 0.370 0.480
college_mathematics 0.170 0.240 0.090 0.320 0.320 0.100 0.290 0.270 0.260 0.300 0.240 0.280 0.120 0.180 0.030 0.200 0.200 0.230 0.200 0.330 0.220 0.200 0.260 0.150 0.210 0.310 0.180 0.180 0.180 0.200 0.200 0.300 0.270 0.200 0.270 0.170 0.340 0.340 0.260 0.280 0.200 0.180 0.380 0.350 0.490 0.540 - 0.170 0.310
college_medicine 0.497 0.618 0.283 0.566 0.612 0.491 0.635 0.641 0.658 0.716 0.572 0.589 0.439 0.520 0.034 0.473 0.514 0.606 0.543 0.664 0.612 0.624 0.589 0.254 0.491 0.583 0.543 0.456 0.572 0.566 0.543 0.618 0.572 0.572 0.612 0.566 0.682 0.728 0.589 0.549 0.560 0.606 0.606 0.624 0.710 0.739 - 0.485 0.653
college_physics 0.284 0.343 0.186 0.372 0.411 0.235 0.401 0.382 0.352 0.421 0.313 0.323 0.215 0.225 0.088 0.215 0.274 0.362 0.313 0.450 0.313 0.294 0.313 0.205 0.303 0.333 0.264 0.254 0.245 0.205 0.303 0.362 0.392 0.352 0.333 0.294 0.372 0.529 0.343 0.333 0.382 0.392 0.401 0.372 0.519 0.656 - 0.245 0.382
computer_security 0.590 0.590 0.370 0.710 0.690 0.580 0.710 0.740 0.730 0.710 0.710 0.730 0.630 0.690 0.070 0.560 0.670 0.690 0.680 0.770 0.700 0.690 0.690 0.360 0.620 0.740 0.640 0.600 0.680 0.670 0.660 0.700 0.690 0.680 0.700 0.650 0.700 0.730 0.650 0.670 0.650 0.690 0.720 0.710 0.730 0.800 - 0.610 0.700
conceptual_physics 0.374 0.570 0.234 0.680 0.680 0.395 0.608 0.629 0.638 0.727 0.561 0.587 0.353 0.463 0.029 0.314 0.442 0.612 0.404 0.570 0.455 0.476 0.463 0.208 0.361 0.587 0.404 0.365 0.446 0.468 0.472 0.604 0.519 0.565 0.565 0.553 0.685 0.748 0.561 0.557 0.485 0.519 0.642 0.642 0.800 0.834 - 0.442 0.693
econometrics 0.289 0.403 0.122 0.649 0.587 0.271 0.566 0.566 0.557 0.587 0.456 0.464 0.228 0.377 0.035 0.219 0.412 0.482 0.371 0.552 0.447 0.438 0.482 0.114 0.359 0.535 0.333 0.318 0.429 0.362 0.424 0.535 0.473 0.456 0.456 0.421 0.543 0.596 0.535 0.508 0.421 0.438 0.605 0.596 0.649 0.675 - 0.345 0.526
electrical_engineering 0.406 0.489 0.220 0.641 0.648 0.462 0.558 0.558 0.558 0.593 0.544 0.572 0.358 0.427 0.048 0.337 0.420 0.544 0.468 0.641 0.558 0.496 0.524 0.248 0.462 0.586 0.455 0.393 0.482 0.468 0.510 0.510 0.544 0.468 0.496 0.475 0.565 0.634 0.606 0.524 0.441 0.434 0.606 0.606 0.648 0.703 - 0.324 0.586
elementary_mathematics 0.312 0.462 0.113 0.505 0.497 0.261 0.484 0.470 0.476 0.476 0.367 0.373 0.171 0.277 0.044 0.211 0.293 0.529 0.296 0.481 0.333 0.312 0.357 0.134 0.280 0.526 0.309 0.222 0.312 0.283 0.304 0.428 0.412 0.373 0.423 0.388 0.537 0.544 0.481 0.497 0.407 0.417 0.560 0.568 0.791 0.838 - 0.304 0.455
formal_logic 0.341 0.468 0.182 0.444 0.484 0.214 0.412 0.420 0.293 0.468 0.325 0.357 0.230 0.333 0.039 0.261 0.380 0.396 0.261 0.460 0.373 0.357 0.420 0.174 0.253 0.404 0.349 0.277 0.396 0.261 0.261 0.444 0.420 0.412 0.452 0.380 0.523 0.531 0.420 0.404 0.325 0.341 0.452 0.428 0.539 0.626 - 0.190 0.484
global_facts 0.120 0.220 0.120 0.190 0.290 0.100 0.330 0.320 0.330 0.370 0.200 0.240 0.120 0.230 0.010 0.140 0.200 0.390 0.140 0.360 0.160 0.150 0.150 0.090 0.110 0.240 0.200 0.160 0.280 0.160 0.210 0.210 0.190 0.220 0.240 0.130 0.360 0.320 0.300 0.180 0.140 0.200 0.260 0.260 0.470 0.430 - 0.220 0.240
high_school_biology 0.664 0.783 0.348 0.764 0.774 0.651 0.845 0.845 0.851 0.890 0.800 0.809 0.577 0.696 0.061 0.525 0.693 0.790 0.680 0.812 0.732 0.738 0.729 0.358 0.677 0.780 0.741 0.654 0.748 0.677 0.706 0.809 0.770 0.774 0.793 0.774 0.861 0.887 0.761 0.745 0.722 0.754 0.803 0.806 0.845 0.896 - 0.670 0.858
high_school_chemistry 0.374 0.487 0.216 0.522 0.507 0.315 0.571 0.561 0.586 0.600 0.546 0.517 0.295 0.413 0.029 0.320 0.384 0.522 0.438 0.635 0.453 0.458 0.467 0.211 0.433 0.517 0.389 0.310 0.379 0.384 0.428 0.522 0.536 0.482 0.512 0.492 0.551 0.655 0.467 0.448 0.413 0.463 0.532 0.536 0.596 0.724 - 0.315 0.581
high_school_computer_science 0.540 0.700 0.250 0.740 0.740 0.440 0.690 0.720 0.710 0.770 0.660 0.660 0.500 0.630 0.040 0.390 0.580 0.700 0.620 0.670 0.610 0.590 0.610 0.230 0.540 0.770 0.610 0.490 0.630 0.580 0.560 0.680 0.620 0.610 0.610 0.580 0.690 0.870 0.710 0.610 0.600 0.660 0.770 0.770 0.830 0.870 - 0.560 0.720
high_school_european_history 0.624 0.733 0.490 0.757 0.745 0.672 0.800 0.800 0.806 0.830 0.812 0.830 0.624 0.745 0.121 0.557 0.684 0.751 0.703 0.642 0.690 0.696 0.709 0.369 0.672 0.775 0.696 0.678 0.709 0.733 0.733 0.751 0.715 0.690 0.727 0.672 0.806 0.812 0.751 0.715 0.733 0.733 0.787 0.800 0.824 0.818 - 0.745 0.787
high_school_geography 0.595 0.777 0.393 0.717 0.747 0.676 0.863 0.868 0.878 0.888 0.792 0.818 0.555 0.732 0.070 0.570 0.707 0.843 0.727 0.797 0.772 0.747 0.757 0.419 0.671 0.838 0.727 0.671 0.752 0.727 0.732 0.813 0.787 0.747 0.792 0.737 0.843 0.888 0.797 0.787 0.712 0.732 0.833 0.833 0.868 0.883 - 0.717 0.843
high_school_government_and_politics 0.694 0.839 0.487 0.875 0.875 0.730 0.921 0.926 0.926 0.963 0.875 0.870 0.709 0.823 0.056 0.658 0.854 0.865 0.821 0.834 0.808 0.792 0.818 0.326 0.725 0.901 0.849 0.805 0.875 0.863 0.836 0.896 0.823 0.875 0.849 0.834 0.937 0.937 0.865 0.839 0.772 0.797 0.917 0.917 0.958 0.968 - 0.805 0.917
high_school_macroeconomics 0.497 0.630 0.235 0.653 0.687 0.487 0.704 0.706 0.717 0.758 0.651 0.653 0.415 0.520 0.035 0.420 0.494 0.633 0.498 0.656 0.558 0.535 0.556 0.176 0.497 0.656 0.525 0.478 0.528 0.532 0.521 0.682 0.661 0.635 0.646 0.635 0.756 0.807 0.687 0.643 0.564 0.592 0.684 0.684 0.802 0.825 - 0.496 0.710
high_school_mathematics 0.200 0.307 0.088 0.344 0.337 0.200 0.307 0.325 0.277 0.325 0.237 0.240 0.159 0.177 0.044 0.155 0.185 0.348 0.162 0.455 0.244 0.225 0.255 0.100 0.233 0.355 0.270 0.162 0.162 0.237 0.203 0.259 0.244 0.203 0.214 0.203 0.281 0.274 0.351 0.414 0.270 0.244 0.440 0.422 0.500 0.537 - 0.174 0.266
high_school_microeconomics 0.588 0.756 0.268 0.823 0.827 0.521 0.780 0.772 0.801 0.852 0.760 0.773 0.462 0.634 0.025 0.516 0.634 0.714 0.654 0.756 0.647 0.630 0.684 0.260 0.575 0.802 0.609 0.540 0.630 0.603 0.654 0.798 0.789 0.773 0.794 0.743 0.886 0.861 0.802 0.773 0.672 0.697 0.827 0.827 0.857 0.907 - 0.594 0.848
high_school_physics 0.178 0.384 0.099 0.509 0.496 0.218 0.463 0.463 0.423 0.496 0.344 0.364 0.145 0.225 0.039 0.211 0.198 0.337 0.211 0.456 0.298 0.337 0.317 0.105 0.211 0.443 0.284 0.165 0.251 0.245 0.331 0.423 0.384 0.384 0.377 0.384 0.463 0.569 0.298 0.344 0.317 0.311 0.470 0.456 0.635 0.695 - 0.192 0.456
high_school_psychology 0.724 0.820 0.445 0.827 0.853 0.761 0.895 0.889 0.896 0.910 0.840 0.858 0.669 0.785 0.042 0.660 0.776 0.838 0.771 0.864 0.840 0.831 0.834 0.456 0.761 0.855 0.809 0.764 0.814 0.817 0.797 0.858 0.838 0.823 0.855 0.844 0.884 0.904 0.827 0.807 0.803 0.796 0.858 0.856 0.882 0.902 - 0.779 0.880
high_school_statistics 0.412 0.532 0.185 0.564 0.625 0.347 0.592 0.601 0.574 0.615 0.509 0.500 0.319 0.393 0.037 0.398 0.453 0.490 0.393 0.569 0.458 0.435 0.462 0.240 0.342 0.615 0.467 0.361 0.476 0.402 0.421 0.550 0.504 0.472 0.569 0.523 0.615 0.643 0.550 0.583 0.481 0.518 0.615 0.648 0.717 0.782 - 0.407 0.555
high_school_us_history 0.617 0.784 0.436 0.764 0.794 0.656 0.829 0.829 0.829 0.867 0.833 0.867 0.602 0.740 0.053 0.588 0.750 0.759 0.709 0.710 0.764 0.759 0.784 0.348 0.696 0.828 0.823 0.699 0.799 0.782 0.792 0.794 0.754 0.764 0.759 0.735 0.833 0.877 0.803 0.779 0.715 0.759 0.843 0.852 0.882 0.906 - 0.803 0.857
high_school_world_history 0.670 0.810 0.535 0.759 0.818 0.700 0.881 0.864 0.872 0.881 0.810 0.827 0.637 0.763 0.088 0.594 0.751 0.780 0.745 0.746 0.805 0.784 0.789 0.405 0.725 0.793 0.776 0.720 0.797 0.750 0.826 0.759 0.759 0.729 0.746 0.742 0.835 0.869 0.805 0.784 0.776 0.793 0.818 0.827 0.869 0.877 - 0.783 0.848
human_aging 0.484 0.587 0.309 0.596 0.627 0.497 0.681 0.690 0.690 0.739 0.582 0.591 0.506 0.560 0.067 0.434 0.488 0.650 0.614 0.690 0.569 0.600 0.618 0.286 0.569 0.672 0.605 0.542 0.609 0.632 0.623 0.587 0.565 0.596 0.582 0.547 0.672 0.726 0.609 0.587 0.569 0.587 0.681 0.690 0.717 0.771 - 0.587 0.695
human_sexuality 0.503 0.603 0.351 0.648 0.694 0.519 0.761 0.730 0.746 0.755 0.648 0.633 0.549 0.702 0.152 0.572 0.603 0.702 0.638 0.793 0.671 0.702 0.671 0.419 0.587 0.709 0.679 0.569 0.618 0.615 0.646 0.671 0.648 0.618 0.664 0.587 0.748 0.740 0.648 0.664 0.625 0.625 0.740 0.717 0.786 0.839 - 0.584 0.770
international_law 0.644 0.710 0.404 0.727 0.776 0.644 0.785 0.785 0.801 0.760 0.735 0.752 0.628 0.694 0.099 0.528 0.677 0.785 0.760 0.826 0.752 0.727 0.776 0.388 0.710 0.743 0.752 0.710 0.694 0.768 0.743 0.768 0.735 0.694 0.735 0.727 0.826 0.892 0.776 0.768 0.710 0.685 0.768 0.785 0.834 0.867 - 0.685 0.859
jurisprudence 0.629 0.712 0.444 0.740 0.768 0.611 0.794 0.794 0.785 0.833 0.675 0.722 0.638 0.675 0.064 0.555 0.657 0.712 0.672 0.796 0.694 0.740 0.731 0.333 0.574 0.731 0.722 0.626 0.750 0.719 0.719 0.777 0.731 0.722 0.722 0.750 0.787 0.787 0.787 0.740 0.694 0.712 0.759 0.750 0.824 0.824 - 0.654 0.824
logical_fallacies 0.619 0.717 0.380 0.711 0.730 0.625 0.792 0.805 0.811 0.797 0.730 0.754 0.656 0.711 0.030 0.619 0.687 0.785 0.691 0.791 0.736 0.723 0.736 0.306 0.687 0.730 0.705 0.660 0.730 0.666 0.691 0.779 0.773 0.791 0.785 0.754 0.852 0.779 0.736 0.717 0.705 0.723 0.773 0.766 0.834 0.877 - 0.641 0.809
machine_learning 0.348 0.410 0.196 0.508 0.491 0.241 0.401 0.410 0.437 0.571 0.419 0.401 0.276 0.366 0.035 0.205 0.419 0.464 0.339 0.410 0.339 0.312 0.366 0.080 0.285 0.419 0.366 0.321 0.366 0.366 0.348 0.437 0.473 0.383 0.437 0.375 0.500 0.544 0.383 0.375 0.339 0.321 0.437 0.410 0.526 0.642 - 0.312 0.526
management 0.708 0.766 0.417 0.825 0.786 0.737 0.805 0.825 0.825 0.844 0.737 0.766 0.572 0.718 0.097 0.553 0.699 0.815 0.747 0.834 0.757 0.757 0.737 0.427 0.669 0.815 0.766 0.708 0.699 0.737 0.737 0.815 0.805 0.786 0.786 0.776 0.815 0.854 0.737 0.718 0.689 0.718 0.805 0.825 0.825 0.864 - 0.747 0.786
marketing 0.722 0.803 0.517 0.820 0.854 0.760 0.871 0.880 0.863 0.893 0.850 0.858 0.726 0.803 0.064 0.717 0.773 0.829 0.824 0.876 0.811 0.824 0.837 0.465 0.799 0.863 0.794 0.756 0.811 0.833 0.816 0.846 0.841 0.824 0.820 0.803 0.880 0.914 0.841 0.837 0.811 0.816 0.888 0.893 0.897 0.901 - 0.782 0.846
medical_genetics 0.570 0.690 0.340 0.720 0.750 0.580 0.780 0.750 0.780 0.810 0.630 0.640 0.570 0.650 0.060 0.510 0.600 0.690 0.700 0.770 0.700 0.690 0.720 0.320 0.660 0.770 0.660 0.600 0.740 0.630 0.660 0.760 0.680 0.710 0.710 0.700 0.830 0.860 0.690 0.660 0.660 0.690 0.770 0.770 0.820 0.900 - 0.640 0.820
miscellaneous 0.637 0.757 0.420 0.749 0.768 0.698 0.832 0.832 0.830 0.854 0.775 0.796 0.650 0.776 0.094 0.646 0.766 0.787 0.760 0.822 0.773 0.776 0.773 0.454 0.736 0.796 0.759 0.727 0.782 0.766 0.756 0.795 0.770 0.756 0.777 0.759 0.837 0.864 0.814 0.798 0.724 0.726 0.807 0.814 0.871 0.885 - 0.746 0.828
moral_disputes 0.468 0.624 0.323 0.609 0.618 0.526 0.686 0.671 0.680 0.736 0.604 0.612 0.456 0.592 0.083 0.442 0.580 0.589 0.607 0.696 0.618 0.578 0.621 0.283 0.560 0.644 0.572 0.524 0.552 0.598 0.645 0.647 0.644 0.635 0.615 0.621 0.696 0.748 0.658 0.630 0.537 0.566 0.664 0.676 0.725 0.760 - 0.554 0.708
moral_scenarios 0.234 0.344 0.115 0.165 0.411 0.227 0.330 0.410 0.325 0.366 0.307 0.360 0.082 0.164 0.007 0.004 0.243 0.280 0.243 0.439 0.184 0.153 0.205 0.136 0.410 0.283 0.246 0.122 0.226 0.229 0.327 0.391 0.317 0.288 0.366 0.404 0.538 0.582 0.336 0.139 0.130 0.058 0.318 0.368 0.546 0.565 - 0.188 0.477
nutrition 0.539 0.666 0.313 0.650 0.666 0.591 0.683 0.669 0.683 0.758 0.643 0.653 0.486 0.611 0.039 0.496 0.591 0.650 0.653 0.751 0.686 0.660 0.689 0.405 0.620 0.735 0.647 0.555 0.614 0.624 0.633 0.660 0.630 0.630 0.669 0.620 0.751 0.771 0.692 0.647 0.647 0.630 0.745 0.745 0.790 0.797 - 0.575 0.722
philosophy 0.491 0.598 0.327 0.681 0.675 0.527 0.654 0.641 0.658 0.713 0.652 0.659 0.472 0.623 0.051 0.485 0.594 0.636 0.590 0.723 0.614 0.639 0.617 0.340 0.578 0.646 0.598 0.587 0.633 0.612 0.580 0.636 0.630 0.598 0.630 0.588 0.704 0.784 0.646 0.578 0.562 0.565 0.675 0.688 0.774 0.778 - 0.554 0.717
prehistory 0.549 0.685 0.308 0.660 0.697 0.518 0.727 0.730 0.728 0.783 0.635 0.663 0.484 0.638 0.061 0.469 0.601 0.669 0.638 0.746 0.679 0.691 0.700 0.348 0.604 0.709 0.623 0.580 0.675 0.697 0.648 0.731 0.688 0.675 0.697 0.663 0.774 0.805 0.675 0.629 0.641 0.666 0.762 0.756 0.836 0.861 - 0.595 0.783
professional_accounting 0.319 0.400 0.184 0.418 0.432 0.326 0.507 0.489 0.496 0.514 0.404 0.425 0.262 0.354 0.028 0.244 0.319 0.453 0.386 0.492 0.358 0.386 0.393 0.102 0.336 0.432 0.382 0.336 0.421 0.361 0.382 0.414 0.421 0.397 0.418 0.386 0.578 0.510 0.443 0.478 0.386 0.414 0.457 0.460 0.560 0.631 - 0.358 0.514
professional_law 0.292 0.393 0.202 0.397 0.417 0.307 0.486 0.478 0.478 0.528 0.404 0.408 0.329 0.383 0.073 0.305 0.387 0.359 0.363 0.367 0.359 0.377 0.397 0.180 0.369 0.372 0.383 0.333 0.379 0.399 0.383 0.440 0.402 0.405 0.410 0.401 0.498 0.492 0.423 0.386 0.340 0.337 0.401 0.402 0.477 0.541 - 0.350 0.481
professional_medicine 0.426 0.558 0.235 0.639 0.636 0.485 0.749 0.774 0.756 0.794 0.654 0.680 0.375 0.588 0.014 0.419 0.591 0.665 0.682 0.761 0.713 0.727 0.724 0.323 0.713 0.643 0.672 0.564 0.705 0.642 0.619 0.705 0.654 0.643 0.687 0.658 0.794 0.823 0.658 0.613 0.573 0.580 0.680 0.683 0.812 0.845 - 0.645 0.794
professional_psychology 0.446 0.591 0.300 0.647 0.665 0.477 0.722 0.717 0.728 0.805 0.598 0.609 0.449 0.535 0.044 0.388 0.547 0.599 0.580 0.686 0.616 0.619 0.642 0.256 0.509 0.669 0.565 0.521 0.602 0.560 0.588 0.686 0.648 0.638 0.655 0.617 0.764 0.799 0.671 0.637 0.586 0.591 0.707 0.702 0.776 0.810 - 0.529 0.759
public_relations 0.463 0.563 0.409 0.563 0.600 0.563 0.690 0.700 0.700 0.672 0.572 0.627 0.500 0.581 0.072 0.454 0.590 0.636 0.590 0.636 0.536 0.518 0.518 0.245 0.545 0.590 0.581 0.554 0.627 0.581 0.518 0.618 0.572 0.627 0.554 0.572 0.672 0.727 0.636 0.618 0.563 0.572 0.627 0.645 0.736 0.663 - 0.554 0.581
security_studies 0.616 0.730 0.240 0.608 0.644 0.616 0.710 0.751 0.746 0.763 0.624 0.632 0.514 0.648 0.102 0.530 0.583 0.759 0.644 0.738 0.665 0.653 0.665 0.440 0.616 0.697 0.673 0.600 0.608 0.612 0.628 0.685 0.693 0.697 0.669 0.673 0.738 0.730 0.665 0.669 0.620 0.653 0.718 0.718 0.767 0.775 - 0.575 0.730
sociology 0.701 0.791 0.412 0.781 0.791 0.666 0.800 0.825 0.815 0.860 0.736 0.741 0.626 0.716 0.039 0.691 0.741 0.810 0.766 0.830 0.751 0.756 0.786 0.452 0.741 0.835 0.771 0.716 0.786 0.776 0.781 0.805 0.800 0.800 0.820 0.781 0.850 0.870 0.825 0.820 0.716 0.736 0.815 0.825 0.855 0.860 - 0.741 0.835
us_foreign_policy 0.630 0.820 0.510 0.780 0.790 0.690 0.838 0.868 0.868 0.840 0.780 0.800 0.700 0.790 0.150 0.680 0.760 0.790 0.777 0.890 0.790 0.770 0.800 0.450 0.800 0.820 0.840 0.757 0.740 0.787 0.787 0.790 0.760 0.740 0.760 0.770 0.850 0.890 0.800 0.820 0.750 0.780 0.820 0.820 0.890 0.880 - 0.757 0.810
virology 0.325 0.463 0.246 0.433 0.445 0.433 0.457 0.475 0.472 0.506 0.415 0.439 0.367 0.439 0.096 0.307 0.379 0.415 0.460 0.506 0.415 0.439 0.439 0.301 0.415 0.463 0.475 0.387 0.421 0.436 0.448 0.421 0.391 0.379 0.403 0.367 0.487 0.500 0.457 0.421 0.373 0.427 0.463 0.457 0.487 0.518 - 0.381 0.487
world_religions 0.637 0.713 0.403 0.748 0.801 0.678 0.800 0.817 0.800 0.847 0.766 0.766 0.643 0.760 0.052 0.614 0.736 0.801 0.729 0.801 0.766 0.783 0.789 0.508 0.742 0.801 0.777 0.747 0.789 0.800 0.747 0.766 0.754 0.742 0.742 0.725 0.801 0.836 0.766 0.748 0.783 0.760 0.818 0.818 0.859 0.871 - 0.705 0.812
MMLU 0.465 0.585 0.285 0.591 0.623 0.475 0.646 0.652 0.647 0.687 0.580 0.595 0.429 0.530 0.055 0.413 0.529 0.595 0.537 0.640 0.556 0.552 0.570 0.281 0.525 0.613 0.550 0.486 0.555 0.544 0.553 0.618 0.590 0.578 0.599 0.578 0.682 0.710 0.610 0.576 0.532 0.540 0.639 0.643 0.721 0.757 - 0.509 0.666
AGIEVAL
aquarat 0.645 0.763 0.374 0.602 0.562 0.460 0.677 0.696 0.665 0.602 0.653 0.637 0.488 0.614 0.279 0.488 0.594 0.657 0.145 0.653 0.681 0.673 0.598 0.370 0.633 0.755 0.712 0.279 0.322 0.582 0.157 0.129 0.212 0.338 0.409 0.574 0.614 0.834 0.712 0.566 0.732 0.728 0.799 0.830 0.822 0.870 - 0.425 0.590
logiqa 0.274 0.359 0.208 0.356 0.337 0.321 0.443 0.440 0.447 0.477 0.399 0.416 0.282 0.324 0.052 0.248 0.321 0.393 0.333 0.413 0.311 0.321 0.328 0.168 0.265 0.376 0.324 0.264 0.311 0.330 0.327 0.339 0.308 0.285 0.281 0.267 0.405 0.445 0.359 0.351 0.316 0.342 0.427 0.436 0.493 0.554 - 0.290 0.391
lsatar 0.260 0.256 0.213 0.213 0.282 0.191 0.234 0.239 0.208 0.260 0.073 0.217 0.234 0.226 0.208 0.221 0.213 0.239 0.200 0.252 0.278 0.278 0.295 0.200 0.239 0.252 0.186 0.186 0.200 0.278 0.208 0.265 0.260 0.252 0.256 0.247 0.208 0.369 0.252 0.178 0.230 0.226 0.260 0.300 0.321 0.400 - 0.173 0.226
lsatlr 0.400 0.525 0.203 0.486 0.537 0.337 0.625 0.627 0.635 0.654 0.505 0.515 0.296 0.415 0.043 0.243 0.449 0.513 0.445 0.490 0.425 0.433 0.441 0.180 0.327 0.541 0.447 0.366 0.445 0.523 0.570 0.456 0.431 0.429 0.415 0.386 0.598 0.621 0.456 0.466 0.452 0.449 0.598 0.603 0.729 0.811 - 0.449 0.576
lsatrc 0.513 0.680 0.312 0.594 0.646 0.431 0.747 0.747 0.750 0.754 0.635 0.643 0.390 0.557 0.048 0.379 0.565 0.643 0.605 0.672 0.591 0.635 0.624 0.223 0.486 0.657 0.635 0.520 0.650 0.613 0.617 0.568 0.513 0.557 0.531 0.524 0.672 0.762 0.583 0.572 0.553 0.617 0.661 0.687 0.810 0.836 - 0.654 0.706
saten 0.703 0.796 0.470 0.791 0.810 0.665 0.839 0.844 0.834 0.868 0.815 0.820 0.519 0.791 0.165 0.582 0.757 0.820 0.762 0.834 0.776 0.762 0.781 0.339 0.689 0.810 0.825 0.679 0.747 0.786 0.786 0.737 0.708 0.713 0.713 0.708 0.800 0.830 0.781 0.776 0.733 0.776 0.810 0.844 0.888 0.922 - 0.757 0.796
satmath 0.840 0.936 0.559 0.790 0.822 0.627 0.900 0.872 0.886 0.768 0.863 0.868 0.577 0.745 0.390 0.650 0.800 0.904 0.377 0.804 0.768 0.822 0.618 0.550 0.845 0.968 0.886 0.400 0.395 0.690 0.413 0.190 0.331 0.509 0.713 0.754 0.727 0.977 0.900 0.731 0.900 0.922 0.963 0.963 0.990 0.981 - 0.540 0.768
AGIEVAL 0.459 0.558 0.294 0.503 0.523 0.398 0.600 0.600 0.598 0.602 0.525 0.546 0.364 0.473 0.131 0.352 0.479 0.547 0.397 0.544 0.489 0.501 0.480 0.253 0.433 0.567 0.512 0.359 0.416 0.501 0.432 0.382 0.381 0.409 0.429 0.438 0.546 0.638 0.522 0.481 0.501 0.520 0.599 0.616 0.681 0.734 - 0.434 0.544
AGIEVALC_biology - - - - - - - - - - 0.756 0.778 0.204 0.334 0.004 0.204 0.326 0.830 - - - - - - - 0.769 0.430 - 0.526 0.304 0.408 - - - - - - - 0.691 - 0.660 0.700 0.804 0.813 0.834 0.582 - 0.356 0.508
AGIEVALC_chemistry - - - - - - - - - - 0.642 0.691 0.142 0.250 0 0.117 0.215 0.598 - - - - - - - 0.509 0.259 - 0.343 0.215 0.313 - - - - - - - 0.563 - 0.441 0.470 0.583 0.627 0.696 0.789 - 0.171 0.348
AGIEVALC_chinese - - - - - - - - - - 0.642 0.650 0.186 0.186 0.024 0.101 0.138 0.682 - - - - - - - 0.552 0.337 - 0.325 0.300 0.313 - - - - - - - 0.650 - 0.508 0.504 0.585 0.593 0.760 0.735 - 0.239 0.272
AGIEVALC_english - - - - - - - - - - 0.823 0.833 0.647 0.748 0.094 0.588 0.728 0.947 - - - - - - - 0.846 0.839 - 0.797 0.807 0.862 - - - - - - - 0.866 - 0.794 0.839 0.856 0.849 0.915 0.924 - 0.830 0.774
AGIEVALC_geography - - - - - - - - - - 0.728 0.728 0.316 0.386 0.040 0.311 0.412 0.778 - - - - - - - 0.718 0.537 - 0.572 0.371 0.457 - - - - - - - 0.768 - 0.643 0.633 0.753 0.778 0.804 0.839 - 0.346 0.577
AGIEVALC_history - - - - - - - - - - 0.829 0.834 0.314 0.400 0.021 0.323 0.412 0.817 - - - - - - - 0.753 0.629 - 0.642 0.378 0.485 - - - - - - - 0.821 - 0.740 0.744 0.774 0.800 0.842 0.923 - 0.357 0.557
AGIEVALC_jecqaca - - - - - - - - - - 0.414 0.440 0.196 0.232 0.022 0.206 0.221 0.514 - - - - - - - 0.425 0.258 - 0.247 0.223 0.273 - - - - - - - 0.566 - 0.425 0.424 0.482 0.487 0.564 0.622 - 0.185 0.266
AGIEVALC_jecqakd - - - - - - - - - - 0.549 0.559 0.179 0.212 0.033 0.148 0.215 0.620 - - - - - - - 0.540 0.290 - 0.304 0.281 0.275 - - - - - - - 0.636 - 0.498 0.526 0.592 0.605 0.732 0.747 - 0.242 0.288
AGIEVALC_logiqa - - - - - - - - - - 0.479 0.490 0.218 0.296 0.035 0.195 0.279 0.519 - - - - - - - 0.442 0.313 - 0.310 0.317 0.330 - - - - - - - 0.462 - 0.399 0.405 0.497 0.500 0.565 0.588 - 0.274 0.357
AGIEVALC_mathcloze - - - - - - - - - - 0.491 0.542 0.186 0.237 - 0.288 0.415 0.576 - - - - - - - 0.652 0.508 - 0.288 0.152 0.245 - - - - - - - 0.449 - 0.508 0.440 0.694 0.686 0.737 0.805 0.864 0.110 0.618
AGIEVALC_mathqa - - - - - - - - - - 0.621 0.648 0.343 0.404 0.281 0.401 0.465 0.662 - - - - - - - 0.656 0.578 - 0.485 0.296 0.334 - - - - - - - 0.543 - 0.595 0.683 0.779 0.755 0.808 0.834 0.828 0.261 0.357
AGIEVALC_physics - - - - - - - - - - 0.396 0.425 0.166 0.229 0.034 0.166 0.189 0.436 - - - - - - - 0.402 0.206 - 0.258 0.183 0.235 - - - - - - - 0.494 - 0.390 0.413 0.431 0.500 0.683 0.770 0.741 0.178 0.310
AGIEVALC - - - - - - - - - - 0.589 0.607 0.257 0.322 0.076 0.248 0.322 0.645 - - - - - - - 0.576 0.409 - 0.404 0.325 0.371 - - - - - - - 0.612 - 0.529 0.548 0.627 0.636 0.716 0.734 0.811 0.298 0.403
BBH
boolean_expressions 0.720 0.740 0.544 0.860 0.876 0.556 0.764 0.776 0.768 0.460 0.848 0.868 0.812 0.856 0.724 0.752 0.700 0.688 0.796 0.804 0.824 0.832 0.844 0.460 0.480 0.896 0.728 0.764 0.780 0.824 0.664 0.816 0.848 0.800 0.852 0.832 0.696 0.936 0.808 0.776 0.756 0.796 0.864 0.880 0.888 0.808 - 0.720 0.540
causal_judgement 0.588 0.598 0.550 0.577 0.582 0.524 0.609 0.604 0.598 0.604 0.550 0.550 0.566 0.598 0.566 0.604 0.588 0.689 0.540 0.513 0.545 0.518 0.540 0.502 0.518 0.577 0.593 0.588 0.625 0.614 0.604 0.556 0.508 0.598 0.588 0.593 0.588 0.647 0.625 0.582 0.497 0.529 0.508 0.513 0.647 0.700 - 0.636 0.641
date_understanding 0.576 0.752 0.324 0.668 0.748 0.592 0.780 0.764 0.748 0.788 0.580 0.572 0.560 0.684 0.284 0.628 0.660 0.832 0.700 0.660 0.732 0.724 0.716 0.400 0.664 0.728 0.772 0.548 0.668 0.592 0.608 0.644 0.464 0.568 0.696 0.576 0.780 0.932 0.544 0.544 0.616 0.648 0.764 0.740 0.856 0.872 - 0.556 0.724
disambiguation_qa 0.612 0.588 0.400 0.712 0.668 0.532 0.688 0.652 0.660 0.720 0.584 0.636 0.628 0.640 0.380 0.616 0.648 0.732 0.584 0.536 0.552 0.540 0.516 0.424 0.472 0.688 0.644 0.600 0.596 0.728 0.704 0.604 0.640 0.592 0.720 0.752 0.692 0.768 0.660 0.696 0.544 0.556 0.656 0.636 0.764 0.780 - 0.576 0.640
dyck_languages 0.596 0.664 0.424 0.704 0.712 0.476 0.752 0.720 0.728 0.600 0.516 0.544 0.560 0.704 0.356 0.700 0.664 0.728 0.792 0.228 0.832 0.724 0.796 0.536 0.680 0.736 0.756 0.744 0.712 0.664 0.732 0.752 0.532 0.424 0.580 0.468 0.532 0.776 0.756 0.576 0.596 0.628 0.868 0.836 0.648 0.820 - 0.684 0.572
formal_fallacies 0.764 0.636 0.624 0.740 0.660 0.532 0.868 0.824 0.832 0.760 0.568 0.660 0.960 0.640 0.896 0.992 0.832 0.920 0.780 0.992 0.920 0.988 0.984 0.992 0.816 0.672 0.532 0.852 0.996 0.632 0.564 0.876 0.876 0.920 0.808 0.808 0.944 0.804 0.632 0.716 0.928 0.852 0.628 0.628 0.784 0.812 - 0.776 0.576
geometric_shapes 0.420 0.520 0.056 0.544 0.456 0.204 0.400 0.384 0.436 0.420 0.392 0.400 0.268 0.368 0.096 0.352 0.488 0.840 0.352 0.400 0.440 0.488 0.440 0.088 0.416 0.564 0.520 0.288 0.404 0.348 0.344 0.468 0.372 0.248 0.416 0.292 0.328 0.648 0.356 0.276 0.204 0.212 0.544 0.604 0.584 0.640 - 0.268 0.400
hyperbaton 0.724 0.872 0.512 0.572 0.680 0.704 0.888 0.856 0.884 0.836 0.740 0.824 0.612 0.724 0.468 0.604 0.704 0.928 0.712 0.752 0.824 0.768 0.880 0.588 0.624 0.664 0.804 0.656 0.644 0.828 0.724 0.800 0.968 0.940 0.936 0.936 0.952 0.996 0.704 0.656 0.636 0.676 0.832 0.792 0.868 0.956 - 0.744 0.900
logical_deduction_five_objects 0.592 0.724 0.176 0.700 0.532 0.300 0.596 0.636 0.568 0.608 0.528 0.516 0.352 0.464 0.204 0.424 0.520 0.660 0.500 0.576 0.540 0.536 0.568 0.236 0.484 0.752 0.592 0.352 0.556 0.384 0.472 0.580 0.464 0.432 0.632 0.532 0.532 0.940 0.556 0.524 0.468 0.528 0.752 0.728 0.876 0.924 - 0.436 0.612
logical_deduction_seven_objects 0.540 0.672 0.152 0.556 0.492 0.284 0.580 0.564 0.560 0.552 0.444 0.500 0.284 0.388 0.140 0.376 0.464 0.648 0.472 0.516 0.472 0.484 0.488 0.216 0.408 0.640 0.500 0.296 0.452 0.320 0.400 0.564 0.476 0.308 0.568 0.500 0.444 0.920 0.464 0.416 0.420 0.436 0.668 0.656 0.792 0.864 - 0.388 0.560
logical_deduction_three_objects 0.780 0.980 0.376 0.868 0.820 0.440 0.860 0.868 0.844 0.892 0.836 0.840 0.524 0.664 0.320 0.596 0.744 0.896 0.632 0.736 0.760 0.764 0.804 0.340 0.652 0.932 0.844 0.608 0.800 0.664 0.620 0.836 0.724 0.688 0.844 0.804 0.884 0.992 0.736 0.716 0.696 0.720 0.940 0.956 0.980 0.992 - 0.664 0.888
movie_recommendation 0.504 0.428 0.424 0.652 0.676 0.568 0.560 0.552 0.552 0.508 0.604 0.648 0.440 0.528 0.224 0.380 0.476 0.884 0.532 0.504 0.548 0.540 0.536 0.336 0.456 0.564 0.604 0.508 0.448 0.552 0.540 0.572 0.544 0.540 0.520 0.508 0.584 0.992 0.548 0.492 0.604 0.568 0.556 0.536 0.672 0.648 - 0.584 0.676
multistep_arithmetic_two 0.368 0.536 0.136 0.944 0.968 0.288 0.480 0.472 0.488 0.472 0.580 0.524 0.340 0.464 0.072 0.272 0.508 0.372 0.248 0.060 0.712 0.704 0.700 0.240 0.532 0.824 0.540 0.108 0.432 0.164 0.292 0.612 0.624 0.272 0.836 0.420 0.460 0.984 0.532 0.324 0.852 0.876 0.896 0.948 0.964 0.976 - 0.252 0.536
navigate 0.556 0.608 0.540 0.580 0.588 0.580 0.588 0.588 0.596 0.648 0.420 0.420 0.580 0.592 0.552 0.588 0.580 0.452 0.576 0.560 0.520 0.580 0.580 0.580 0.580 0.596 0.572 0.600 0.588 0.568 0.580 0.612 0.644 0.596 0.588 0.584 0.636 0.640 0.596 0.592 0.576 0.572 0.596 0.596 0.624 0.684 - 0.520 0.652
object_counting 0.704 0.756 0.464 0.764 0.820 0.612 0.800 0.808 0.848 0.856 0.616 0.660 0.624 0.760 0.460 0.680 0.776 0.644 0.852 0.896 0.820 0.772 0.864 0.524 0.808 0.872 0.908 0.608 0.716 0.564 0.796 0.876 0.696 0.244 0.836 0.344 0.372 0.996 0.660 0.676 0.740 0.764 0.848 0.804 0.892 0.896 - 0.680 0.756
penguins_in_a_table 0.835 0.952 0.369 0.842 0.746 0.506 0.883 0.869 0.890 0.842 0.917 0.917 0.527 0.705 0.260 0.595 0.705 0.815 0.767 0.863 0.856 0.821 0.856 0.356 0.801 0.958 0.917 0.623 0.801 0.575 0.760 0.719 0.486 0.465 0.883 0.712 0.815 1.000 0.835 0.719 0.821 0.849 0.945 0.924 0.958 0.986 - 0.636 0.828
reasoning_about_colored_objects 0.744 0.872 0.276 0.860 0.800 0.484 0.700 0.700 0.744 0.900 0.876 0.796 0.548 0.668 0.200 0.528 0.768 0.904 0.740 0.800 0.760 0.820 0.824 0.276 0.568 0.880 0.904 0.608 0.752 0.648 0.752 0.696 0.664 0.656 0.808 0.656 0.896 0.968 0.764 0.716 0.700 0.764 0.904 0.868 0.944 0.984 - 0.600 0.840
ruin_names 0.428 0.552 0.176 0.484 0.636 0.480 0.720 0.692 0.716 0.760 0.696 0.652 0.348 0.528 0.208 0.356 0.524 0.932 0.724 0.736 0.676 0.680 0.744 0.348 0.532 0.488 0.556 0.400 0.584 0.408 0.592 0.628 0.528 0.596 0.612 0.600 0.636 0.816 0.564 0.560 0.396 0.324 0.440 0.544 0.692 0.760 - 0.536 0.616
salient_translation_error_detection 0.540 0.608 0.212 0.448 0.508 0.420 0.532 0.580 0.548 0.568 0.476 0.488 0.360 0.516 0.164 0.360 0.468 0.644 0.452 0.504 0.436 0.504 0.512 0.188 0.464 0.572 0.556 0.444 0.472 0.524 0.560 0.508 0.448 0.408 0.520 0.532 0.596 0.636 0.456 0.444 0.452 0.432 0.560 0.572 0.612 0.700 - 0.532 0.588
snarks 0.606 0.752 0.483 0.685 0.707 0.584 0.646 0.696 0.691 0.719 0.702 0.707 0.561 0.685 0.488 0.612 0.640 0.820 0.668 0.696 0.651 0.685 0.651 0.488 0.657 0.730 0.691 0.606 0.691 0.533 0.640 0.617 0.612 0.735 0.747 0.786 0.747 0.882 0.657 0.651 0.662 0.623 0.747 0.780 0.831 0.865 - 0.646 0.837
sports_understanding 0.644 0.692 0.584 0.672 0.692 0.724 0.824 0.796 0.788 0.816 0.472 0.468 0.708 0.780 0.460 0.684 0.772 0.920 0.684 0.696 0.636 0.744 0.720 0.572 0.644 0.680 0.640 0.716 0.800 0.836 0.792 0.612 0.600 0.596 0.596 0.600 0.748 0.740 0.776 0.784 0.620 0.616 0.676 0.684 0.680 0.748 - 0.828 0.740
temporal_sequences 0.408 0.796 0.164 0.528 0.540 0.124 0.680 0.680 0.708 0.748 0.756 0.840 0.216 0.576 0.272 0.500 0.700 0.976 0.792 0.688 0.804 0.788 0.856 0.204 0.712 0.508 0.360 0.404 0.544 0.524 0.508 0.860 0.612 0.800 0.784 0.508 0.892 1.000 0.596 0.356 0.324 0.388 0.800 0.820 0.988 0.992 - 0.568 0.920
tracking_shuffled_objects_five_objects 0.976 1.000 0.208 0.560 0.616 0.216 0.536 0.600 0.600 0.692 0.544 0.536 0.588 0.536 0.400 0.496 0.408 0.572 0.552 0.520 0.568 0.596 0.656 0.152 0.500 0.852 0.792 0.344 0.736 0.356 0.468 0.848 0.664 0.612 0.940 0.712 0.776 1.000 0.476 0.400 0.420 0.452 0.840 0.908 0.924 0.972 - 0.364 0.420
tracking_shuffled_objects_seven_objects 0.932 0.952 0.140 0.324 0.524 0.152 0.576 0.572 0.572 0.640 0.512 0.436 0.484 0.596 0.220 0.396 0.344 0.480 0.436 0.420 0.488 0.536 0.592 0.120 0.420 0.760 0.728 0.296 0.596 0.284 0.396 0.780 0.640 0.568 0.896 0.612 0.652 0.984 0.416 0.320 0.292 0.312 0.800 0.868 0.848 0.980 - 0.372 0.436
tracking_shuffled_objects_three_objects 0.996 0.996 0.288 0.696 0.732 0.292 0.716 0.708 0.732 0.848 0.620 0.696 0.604 0.592 0.448 0.724 0.740 0.528 0.680 0.696 0.704 0.704 0.728 0.304 0.608 0.828 0.832 0.436 0.832 0.412 0.724 0.780 0.836 0.572 0.960 0.788 0.888 1.000 0.524 0.420 0.604 0.664 0.832 0.872 0.856 0.996 - 0.536 0.660
web_of_lies 0.524 0.544 0.476 0.576 0.520 0.508 0.524 0.556 0.520 0.488 0.476 0.488 0.480 0.544 0.488 0.508 0.504 0.536 0.524 0.508 0.440 0.512 0.512 0.516 0.544 0.552 0.492 0.488 0.512 0.488 0.512 0.512 0.492 0.512 0.488 0.492 0.548 0.512 0.552 0.488 0.512 0.512 0.528 0.532 0.544 0.624 - 0.488 0.520
word_sorting 0.176 0.276 0.056 0.204 0.292 0.100 0.424 0.424 0.404 0.540 0.404 0.392 0.216 0.360 0.092 0.232 0.292 0.452 0.544 0 0.556 0.500 0.512 0.160 0.360 0.248 0.340 0.280 0.392 0.344 0.500 0.280 0.224 0.168 0.204 0.152 0.236 0.360 0.208 0.188 0.156 0.156 0.212 0.220 0.292 0.400 - 0.336 0.276
BBH 0.621 0.702 0.334 0.638 0.650 0.432 0.663 0.661 0.664 0.674 0.596 0.608 0.507 0.595 0.347 0.536 0.598 0.719 0.613 0.582 0.650 0.659 0.681 0.373 0.566 0.691 0.652 0.506 0.631 0.531 0.583 0.667 0.602 0.549 0.696 0.592 0.658 0.846 0.587 0.536 0.554 0.567 0.709 0.718 0.775 0.827 - 0.549 0.637
MUSR
murder_mystery 0.632 0.620 0.552 0.640 0.592 0.552 0.688 0.672 0.668 0.576 0.616 0.584 0.544 0.492 0.568 0.524 0.568 0.572 0.576 0.560 0.616 0.596 0.584 0.540 0.576 0.636 0.624 0.516 0.656 0.592 0.272 0.616 0.612 0.636 0.636 0.620 0.600 0.708 0.516 0.156 0.544 0.612 0.604 0.584 0.652 0.640 - 0.588 0.532
object_placements 0.468 0.503 0.429 0.535 0.578 0.449 0.539 0.535 0.519 0.542 0.492 0.531 0.480 0.496 0.316 0.484 0.496 0.492 0.500 0.531 0.542 0.488 0.546 0.363 0.523 0.488 0.484 0.453 0.542 0.527 0.523 0.496 0.437 0.496 0.503 0.457 0.519 0.464 0.511 0.425 0.472 0.476 0.531 0.554 0.519 0.265 - 0.500 0.425
team_allocation 0.412 0.508 0.436 0.512 0.496 0.352 0.460 0.484 0.460 0.476 0.572 0.588 0.364 0.460 0.336 0.440 0.396 0.500 0.412 0.512 0.416 0.468 0.460 0.380 0.396 0.488 0.504 0.356 0.448 0.456 0.516 0.540 0.548 0.520 0.536 0.480 0.560 0.628 0.440 0.084 0.444 0.384 0.512 0.476 0.556 0.592 - 0.504 0.556
MUSR 0.503 0.543 0.472 0.562 0.555 0.451 0.562 0.563 0.548 0.531 0.559 0.567 0.462 0.482 0.406 0.482 0.486 0.521 0.496 0.534 0.525 0.517 0.530 0.427 0.498 0.537 0.537 0.441 0.548 0.525 0.437 0.550 0.531 0.550 0.558 0.518 0.559 0.599 0.489 0.223 0.486 0.490 0.548 0.538 0.575 0.497 - 0.530 0.503
MMLUPRO
biology 0.641 0.728 0.324 0.708 0.702 0.582 0.754 0.687 0.747 0.772 0.676 0.695 0.538 0.608 0.207 0.523 0.598 0.687 0.627 0.641 0.675 0.672 0.686 0.334 0.623 0.707 0.659 0.582 0.651 0.619 0.592 0.697 0.656 0.676 0.702 0.662 0.725 0.835 0.668 0.571 0.610 0.638 0.709 0.729 0.797 0.764 - 0.570 0.684
business 0.490 0.624 0.190 0.624 0.525 0.356 0.579 0.679 0.583 0.626 0.522 0.562 0.307 0.423 0.145 0.338 0.441 0.510 0.465 0.555 0.558 0.536 0.558 0.211 0.458 0.617 0.536 0.335 0.510 0.404 0.429 0.520 0.396 0.465 0.571 0.509 0.476 0.785 0.590 0.415 0.504 0.558 0.647 0.661 0.718 0.755 - 0.335 0.496
chemistry 0.399 0.559 0.166 0.639 0.500 0.271 0.502 0.639 0.503 0.546 0.465 0.467 0.227 0.291 0.106 0.244 0.325 0.343 0.332 0.414 0.451 0.439 0.467 0.161 0.390 0.545 0.375 - 0.366 0.263 0.271 0.456 0.367 0.431 0.463 0.296 0.312 0.765 0.413 0.271 0.387 0.451 0.559 0.580 0.684 0.701 - 0.196 0.407
computer_science 0.424 0.600 0.197 0.602 0.590 0.300 0.473 0.507 0.482 0.560 0.497 0.502 0.329 0.443 0.124 0.368 0.446 0.487 0.392 0.521 0.495 0.534 0.485 0.195 0.414 0.551 0.541 - 0.456 0.426 0.424 0.495 0.426 0.458 0.475 0.448 0.521 0.734 0.482 0.370 0.434 0.402 0.590 0.604 0.663 0.734 - 0.339 0.512
economics 0.541 0.659 0.236 0.663 0.662 0.408 0.622 0.648 0.668 0.678 0.617 0.610 0.395 0.510 0.187 0.427 0.528 0.540 0.510 0.556 0.569 0.571 0.568 0.254 0.492 0.630 0.542 - 0.541 0.484 0.490 0.600 0.555 0.558 0.609 0.587 0.575 0.792 0.574 0.521 0.521 0.550 0.674 0.687 0.721 0.787 - 0.463 -
engineering 0.340 0.420 0.157 0.437 0.424 0.253 0.391 0.406 0.406 0.414 0.303 0.298 0.204 0.284 0.117 0.240 0.269 0.301 0.311 0.348 0.391 0.391 0.378 0.157 0.302 0.380 0.330 - 0.317 0.237 0.237 0.274 0.264 0.297 0.297 0.283 0.342 0.589 0.356 0.247 0.296 0.309 0.418 0.420 0.512 0.573 - 0.180 -
health 0.353 0.506 0.158 0.503 0.517 0.333 0.561 0.535 0.545 0.621 0.492 0.496 0.273 0.433 0.134 0.322 0.460 0.458 0.465 0.530 0.556 0.562 0.558 0.220 0.437 0.493 0.464 - 0.498 0.422 0.414 0.541 0.433 0.479 0.515 0.466 0.588 0.700 0.442 0.394 0.388 0.416 0.556 0.569 0.643 0.690 - 0.381 -
history 0.325 0.477 0.149 0.406 0.467 0.275 0.482 0.522 0.493 0.490 0.425 0.438 0.244 0.388 0.139 0.259 0.370 0.391 0.406 0.419 0.433 0.409 0.451 0.149 0.380 0.438 0.409 - 0.425 0.359 0.364 0.406 0.398 0.388 0.380 0.380 0.496 0.627 0.391 0.325 0.333 0.367 0.459 0.464 0.566 0.624 - 0.380 -
law 0.207 0.318 0.123 0.268 0.295 0.198 0.356 0.353 0.343 0.405 0.299 0.284 0.204 0.266 0.128 0.195 0.282 0.237 0.260 0.306 0.295 0.309 0.303 0.129 0.243 0.283 0.276 - 0.279 0.238 0.262 0.327 0.285 0.306 0.276 0 0.384 0.500 0.271 0.207 0.220 0.237 0.300 0.292 0.366 0.455 - 0.217 -
math 0.518 0.686 0.203 0.694 0.564 0.309 0.537 0.621 0.538 0.570 0.490 0.523 0.318 0.417 0.148 0.367 0.454 0.508 0.382 0.496 0.532 0.516 0.555 0.273 0.511 0.679 0.543 - 0.416 0.369 0.418 0.482 0.369 0.468 0.522 0.458 0.391 0.816 0.592 0.385 0.581 0.603 0.712 0.723 0.775 0.814 - 0.270 -
other 0.360 0.514 0.164 0.450 0.496 0.325 0.528 0.542 0.551 0.574 0.464 0.458 0.308 0.423 0.162 0.312 0.432 0.440 0.411 0.493 0.482 0.478 0.487 0.222 0.389 0.589 0.464 - 0.456 0.416 0.401 0.466 0.396 0.457 0.500 0.433 0.532 0.706 0.444 0.406 0.410 0.405 0.529 0.551 0.611 0.664 - 0.400 -
philosophy 0.300 0.450 0.148 0.442 0.462 0.272 0.460 0.436 0.448 0.488 0.408 0.412 0.286 0.366 0.142 0.300 0.352 0.372 0.356 0.434 0.424 0.438 0.382 0.192 0.326 0.424 0.366 - 0.390 0.360 0.346 0.422 0.354 0.386 0.406 0.390 0.494 0.633 0.374 0.336 0.376 0.364 0.480 0.464 0.557 0.599 - 0.326 -
physics 0.404 0.557 0.159 0.583 0.493 0.275 0.491 0.494 0.501 0.559 0.441 0.461 0.222 0.318 0.133 0.280 0.342 0.344 0.334 0.457 0.484 0.492 0.488 0.187 0.397 0.547 0.414 - 0.370 0.317 0.309 0.432 0.352 0.423 0.455 0.425 0.367 0.765 0.457 0.297 0.419 0.456 0.589 0.602 0.702 0.543 - 0.240 -
psychology 0.502 0.642 0.258 0.621 0.645 0.494 0.666 0.657 0.647 0.692 0.586 0.602 0.436 0.531 0.184 0.448 0.543 0.572 0.560 0.581 0.624 0.601 0.637 0.317 0.518 0.604 0.595 - 0.588 0.543 0.552 0.626 0.573 0.583 0.621 0.572 0.676 0.759 0.595 0.536 0.526 0.563 0.636 0.644 0.721 0.749 - 0.525 -
MMLUPRO 0.416 0.554 0.186 0.552 0.517 0.326 0.524 0.553 0.528 0.568 0.471 0.480 0.298 0.395 0.145 0.324 0.410 0.432 0.404 0.475 0.494 0.491 0.499 0.215 0.419 0.539 0.458 0.453 0.436 0.376 0.382 0.475 0.405 0.451 0.482 0.408 0.470 0.719 0.475 0.368 0.430 0.457 0.564 0.575 0.649 0.671 - 0.326 0.509
CATEGORIES
REASONING 0.658 0.774 0.367 0.713 0.738 0.570 0.782 0.800 0.788 0.814 0.804 0.811 0.592 0.702 0.096 0.561 0.703 0.845 0.684 0.754 0.702 0.708 0.713 0.352 0.606 0.791 0.779 0.628 0.730 0.785 0.799 0.744 0.713 0.741 0.724 0.691 0.805 0.809 0.755 0.747 0.689 0.719 0.805 0.809 0.850 0.874 0.885 0.784 0.806
UNDERSTANDING 0.548 0.652 0.366 0.644 0.670 0.538 0.708 0.712 0.707 0.742 0.661 0.670 0.511 0.598 0.116 0.481 0.599 0.685 0.602 0.680 0.622 0.618 0.631 0.330 0.579 0.672 0.633 0.563 0.633 0.644 0.651 0.661 0.629 0.629 0.614 0.622 0.727 0.728 0.674 0.649 0.605 0.613 0.692 0.696 0.761 0.793 0.809 0.617 0.713
LANGUAGE 0.613 0.680 0.524 0.688 0.692 0.624 0.733 0.732 0.735 0.755 0.786 0.783 0.746 0.799 0.665 0.732 0.790 0.732 0.710 0.729 0.740 0.738 0.747 0.610 0.705 0.715 0.776 0.766 0.714 0.744 0.733 0.613 0.638 0.618 0.677 0.613 0.632 0.750 0.735 0.564 0.685 0.682 0.722 0.724 0.769 0.781 0.780 0.654 0.708
KNOWLEDGE 0.516 0.627 0.354 0.476 0.496 0.505 0.677 0.710 0.689 0.733 0.553 0.543 0.406 0.470 0.266 0.358 0.404 0.601 0.536 0.633 0.568 0.571 0.547 0.066 0.536 0.580 0.526 0.582 0.511 0.542 0.533 0.581 0.546 0.546 0.517 0.519 0.663 0.678 0.530 0.585 0.469 0.426 0.595 0.597 0.693 0.725 0.581 0.489 0.521
COT 0.438 0.561 0.220 0.552 0.530 0.350 0.548 0.568 0.550 0.582 0.485 0.500 0.336 0.431 0.188 0.365 0.445 0.505 0.446 0.488 0.522 0.521 0.530 0.255 0.446 0.545 0.479 0.498 0.466 0.416 0.424 0.502 0.437 0.474 0.506 0.440 0.503 0.725 0.492 0.398 0.443 0.462 0.570 0.581 0.653 0.684 - 0.377 0.563
MATHCOT 0.731 0.819 0.369 0.745 0.752 0.480 0.728 0.733 0.735 0.740 0.682 0.679 0.545 0.638 0.359 0.571 0.649 0.700 0.622 0.680 0.708 0.721 0.728 0.386 0.647 0.808 0.750 0.493 0.652 0.546 0.578 0.695 0.612 0.591 0.767 0.638 0.665 0.919 0.671 0.548 0.667 0.694 0.823 0.829 0.869 0.903 0.927 0.535 0.662
CODE 0.416 0.498 0.176 0.499 0.534 0.331 0.487 0.483 0.495 0.568 0.456 0.475 0.339 0.410 0.210 0.356 0.466 0.344 0.269 0.411 0.440 0.439 0.463 0.217 0.366 0.498 0.514 0.321 0.326 0.324 0.316 0.430 0.372 0.389 0.427 0.376 0.350 0.568 0.390 0.346 0.437 0.445 0.510 0.528 0.578 0.612 0.321 0.233 0.368
DISCIPLINES
NLP 0.620 0.720 0.408 0.647 0.670 0.568 0.751 0.767 0.755 0.786 0.729 0.728 0.588 0.667 0.263 0.549 0.648 0.774 0.650 0.713 0.675 0.678 0.677 0.329 0.609 0.722 0.723 0.642 0.685 0.739 0.737 0.681 0.655 0.667 0.647 0.637 0.744 0.755 0.693 0.687 0.632 0.630 0.731 0.734 0.791 0.818 0.772 0.712 0.725
MATH 0.597 0.712 0.294 0.674 0.659 0.398 0.634 0.650 0.637 0.653 0.590 0.597 0.445 0.544 0.265 0.482 0.564 0.638 0.528 0.594 0.612 0.613 0.629 0.318 0.556 0.711 0.634 0.451 0.558 0.483 0.508 0.611 0.527 0.525 0.646 0.543 0.585 0.817 0.612 0.493 0.576 0.599 0.741 0.747 0.799 0.843 0.927 0.455 0.615
SCIENCE 0.606 0.726 0.350 0.741 0.713 0.555 0.735 0.749 0.739 0.769 0.686 0.698 0.481 0.576 0.093 0.480 0.579 0.664 0.608 0.685 0.668 0.668 0.676 0.346 0.605 0.716 0.621 0.673 0.618 0.581 0.596 0.701 0.658 0.685 0.696 0.660 0.697 0.845 0.674 0.618 0.629 0.657 0.738 0.748 0.815 0.806 0.946 0.544 0.731
ENGINEERING 0.349 0.429 0.166 0.464 0.453 0.280 0.412 0.426 0.426 0.438 0.334 0.333 0.224 0.303 0.108 0.253 0.289 0.333 0.332 0.386 0.412 0.404 0.397 0.169 0.323 0.407 0.346 0.393 0.339 0.267 0.272 0.305 0.300 0.319 0.323 0.308 0.371 0.595 0.388 0.283 0.315 0.325 0.443 0.444 0.530 0.590 - 0.199 0.586
MEDICINE 0.379 0.504 0.216 0.524 0.540 0.400 0.591 0.590 0.595 0.648 0.521 0.530 0.346 0.464 0.069 0.347 0.469 0.515 0.544 0.620 0.568 0.572 0.577 0.243 0.496 0.543 0.512 0.447 0.541 0.485 0.503 0.558 0.507 0.525 0.537 0.501 0.633 0.672 0.510 0.482 0.459 0.478 0.574 0.580 0.655 0.702 0.598 0.457 0.635
HUMANITIES 0.472 0.594 0.291 0.572 0.615 0.485 0.643 0.652 0.645 0.679 0.593 0.610 0.416 0.521 0.094 0.395 0.517 0.603 0.545 0.624 0.571 0.563 0.578 0.292 0.529 0.609 0.552 0.536 0.552 0.535 0.547 0.605 0.567 0.567 0.588 0.567 0.671 0.740 0.597 0.544 0.527 0.533 0.629 0.638 0.716 0.742 0.600 0.508 0.660
BUSINESS 0.536 0.657 0.252 0.679 0.655 0.450 0.660 0.697 0.678 0.709 0.623 0.637 0.408 0.523 0.115 0.428 0.530 0.594 0.537 0.626 0.592 0.582 0.598 0.251 0.517 0.667 0.564 0.466 0.565 0.514 0.526 0.632 0.575 0.589 0.637 0.604 0.636 0.801 0.644 0.566 0.565 0.596 0.701 0.710 0.759 0.802 - 0.479 0.652
LAW 0.316 0.419 0.200 0.396 0.427 0.302 0.489 0.485 0.483 0.524 0.417 0.429 0.280 0.344 0.075 0.258 0.348 0.420 0.374 0.412 0.380 0.397 0.406 0.175 0.344 0.420 0.365 0.370 0.367 0.367 0.374 0.427 0.393 0.399 0.392 0.310 0.498 0.541 0.446 0.376 0.374 0.383 0.451 0.456 0.541 0.604 - 0.327 0.465
COMPOSITE AVERAGE
AVG 0.561 0.668 0.342 0.627 0.641 0.502 0.688 0.701 0.692 0.724 0.648 0.654 0.493 0.581 0.198 0.476 0.576 0.670 0.582 0.651 0.622 0.623 0.629 0.306 0.561 0.668 0.635 0.578 0.598 0.610 0.616 0.634 0.595 0.606 0.616 0.585 0.674 0.748 0.633 0.597 0.578 0.586 0.686 0.691 0.754 0.783 0.759 0.578 0.675

CODE MODELS:

TEST codegemma-2b codegemma-1.1-7b-it codegemma-7b CodeLlama-7b-hf Codestral-22B-v0.1 Codestral-22B-Instruct-v0.1 CodeQwen1.5-7B-Chat CodeQwen1.5-7B CodeQwen1.5-7B granite-8b-code-instruct Qwen2.5-Coder-0.5B-32k-Instruct Qwen2.5-Coder-1.5B-Instruct Qwen2.5-3B-32k-Instruct Qwen2.5-Coder-3B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-Coder-7B Qwen2.5-Coder-7B Qwen2.5-Coder-14B-Instruct Qwen2.5-Coder-14B Qwen2.5-Coder-32B-Instruct
params 2.51B 8.54B 8.54B 6.74B 22B 22B 7.25B 7.25B 7.25B 8.05B 0.49403B 1.54B 3.09B 3.09B 7.62B 7.62B 7.62B 14.77B 14.77B 32.76B
quant Q6_K Q6_K Q6_K IQ4_XS IQ4_XS IQ4_XS Q6_K Q6_K Q8_0 Q4_K_M Q6_K Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q6_K IQ4_XS IQ4_XS IQ4_XS
engine llama.cpp version: 4255 llama.cpp version: 4150 llama.cpp version: 4255 llama.cpp version: 4191 llama.cpp version: 4132 llama.cpp version: 4191 llama.cpp version: 4094 llama.cpp version: 4132 llama.cpp version: 4191 llama.cpp version: 4080 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4150 llama.cpp version: 4094 llama.cpp version: 4295 llama.cpp version: 4132 llama.cpp version: 4120 llama.cpp version: 4150 llama.cpp version: 4150
--------------------------------------------- -------------- --------------------- -------------- ----------------- -------------------- ----------------------------- --------------------- ---------------- ---------------- -------------------------- --------------------------------- ----------------------------- ------------------------- --------------------------- --------------------------- ------------------ ------------------ ---------------------------- ------------------- ----------------------------
HUMANEVAL 0.292 0.591 0.451 0.280 0.664 0.810 0.859 0.518 0.567 0.487 0.518 0.676 0.780 0.835 0.829 0.640 0.713 0.878 0.676 0.884
HUMANEVALP 0.201 0.475 0.335 0.182 0.554 0.682 0.701 0.414 0.445 0.365 0.432 0.567 0.682 0.719 0.707 0.530 0.579 0.756 0.536 0.756
MBPP 0.447 0.552 0.521 0.404 0.630 0.653 0.712 0.536 0.525 0.501 0.408 0.560 0.599 0.618 0.735 0.614 0.571 0.727 0.661 0.715
MBPPP 0.415 0.517 0.455 0.375 0.558 0.593 0.665 0.486 0.486 0.473 0.352 0.504 0.584 0.589 0.687 0.540 0.513 0.665 0.558 0.669
HUMANEVALFIM 0.268 - 0.463 - 0.719 0.719 0.731 0.518 0.475 0.402 0.518 0.524 - 0.634 0.493 0.713 0.756 0.829 0.518 0.890
HUMANEVALX_cpp 0.170 0.359 0.384 0.256 0.640 0.621 0.676 0.463 0.475 0.457 0.286 0.426 0.237 0.567 0.676 0.548 0.475 0.506 0.573 0.689
HUMANEVALX_java 0.341 0.469 0.493 0.371 0.756 0.670 0.774 0.591 0.609 0.524 0.512 0.609 0.615 0.743 0.798 0.725 0.652 0.201 0.762 0.841
HUMANEVALX_js 0.347 0.560 0.493 0.347 0.658 0.621 0.768 0.567 0.585 0.493 0.493 0.615 0.682 0.670 0.798 0.628 0.658 0.817 0.695 0.835
HUMANEVALX 0.286 0.463 0.457 0.325 0.684 0.638 0.739 0.540 0.556 0.491 0.430 0.550 0.512 0.660 0.758 0.634 0.595 0.508 0.676 0.788
CRUXEVAL_input 0.057 0.406 0.206 0.156 0.438 0.351 0.456 0.192 0.203 0.358 0.435 0.416 0.347 0.481 0.578 0.255 0.267 0.677 0.281 0.676
CRUXEVAL_output 0.253 0.368 0.306 0.281 0.465 0.447 0.363 0.363 0.363 0.322 0.278 0.332 0.311 0.413 0.507 0.381 0.435 0.577 0.422 0.610
CRUXEVAL 0.155 0.387 0.256 0.218 0.451 0.399 0.410 0.278 0.283 0.340 0.356 0.374 0.329 0.447 0.543 0.318 0.351 0.627 0.351 0.643
CRUXEVALFIM_input 0.325 - 0.378 - 0.295 0.351 0.237 0.206 0.210 0.171 0.017 0.155 - 0.208 0.322 0.296 0.313 0.421 0.346 0.515
CRUXEVALFIM_output 0.153 - 0.332 - 0.441 0.355 0.212 0.280 0.266 0.323 0.098 0.222 - 0.323 0.481 0.352 0.365 0.546 0.481 0.557
CRUXEVALFIM 0.239 - 0.355 - 0.368 0.353 0.225 0.243 0.238 0.247 0.058 0.188 - 0.266 0.401 0.324 0.339 0.483 0.413 0.536
CODE 0.237 0.441 0.352 0.266 0.483 0.467 0.447 0.336 0.342 0.348 0.278 0.368 0.449 0.453 0.548 0.413 0.427 0.593 0.458 0.648