LLM benchmarks for a wide range of models using custom prompts.

METHODOLOGY: All CoT tests are zero shot.
             All MC tests do two queries, 1 with answers in test order and 2nd with answers circularly shifted 1.
             To score a correct answer in MC both queries must answer correctly.
             Winogrande using logprob completion (evaluates the probability of a common completion for the two possible cases).
             Tests are run using a modified llama.cpp server (supporting logprob completion mode) and/or textsynth server where noted.

TESTS:
   KNOWLEDGE:
      TQA - Truthful QA
      JEOPARDY - 100 Question JEOPARDY quiz
   LANGUAGE:
      LAMBADA - Language Modeling Broadened to Account for Discourse Aspects
   UNDERSTANDING:
      WG - Winogrande
      BOOLQ - Boolean questions
      STORYCLOZE - Story questions
      OBQA - Open Book Question / Answer
      SIQA - Social IQ
      RACE - Reading comprehension dataset from examinations
      MMLU - massive multitask language understanding
      MEDQA - medical QA
   REASONING
      CSQA - Common Sense Question Answer
      COPA - Choice of Plausible Alternatives
      HELLASWAG - Hella Situations with Adversarial Generations
      PIQA - Physical Interaction: Question Answering
      ARC - A12 Reasoning Challenge
      AGIEVAL - AGIEval logiqa, lsat, sat
      AGIEVALC  - Gaokao SAT, logiqa, jec (Chinese)
      MUSR - Multimodal Semantic Reasoning
   COT:
      GSM8K - Grade School Math CoT
      BBH  - Beyond the Imitation Game Bench Hard CoT
      MMLUPRO - massive multitask language understanding pro CoT
      AGIEVAL - satmath, aquarat
      AGIEVALC  - mathcloze, mathqa (Chinese)
      MUSR - Multimodal Semantic Reasoning
      APPLE - 100 custom Apple Questions
   CODE:
      HUMANEVAL - Python
      HUMANEVALP - Python, extended test
      HUMANEVALX - Python, Java, Javascript, C++
      MBPP - Python
      MBPPP - Python, extendend test
      CRUXEVAL - Python
      USE {TEST}FIM FOR FIM TEST, i.e. HUMANEVAL->HUMANEVALFIM
TEST gemma-2-2b-it gemma-2-9b-it gemma-2-9b-it gemma-2-9b-it gemma-2-27b-it glm-4-9b-chat granite-3.0-2b-instruct granite-3.0-8b-instruct internlm2_5-7b-chat Meta-Llama-3.1-8B-Instruct Meta-Llama-3.1-8B-Instruct Meta-Llama-3.1-8B-Instruct Llama-3.2-3B-Instruct Ministral-8B-Instruct-2410 Mistral-7B-Instruct-v0.3 Mistral-Nemo-12B-Instruct-2407 openchat-3.5-0106 openchat-3.6-8b-20240522 Phi-3-mini-128k-instruct Phi-3-mini-128k-instruct Phi-3.5-mini-8k-instruct Phi-3.5-mini-128k-instruct Phi-3-medium-128k-instruct Qwen2.5-Coder-7B-Instruct Qwen2-7B-Instruct Qwen2.5-7B-Instruct Qwen2.5-7B-Instruct Qwen2.5-32B-Instruct SOLAR-10.7B-Instruct-v1.0 solar-pro-preview-instruct
params 2.61B 9.24B 9.24B 9.24B 27.23B 9.40B 2.63B 8.17B 7.74B 8.03B 8.03B 8.03B 3.21B 8.02B 7.25B 12.25B 7.24B 8.03B 3.82B 3.82B 3.82B 3.82B 13.96B 7.62B 7.62B 7.62B 7.62B 32.76B 10.73B 22.14B
quant Q8_0 IQ4_XS Q4_K_M Q6_K IQ4_XS Q6_K Q6_K Q6_K IQ4_XS IQ4_XS Q4_K_M Q6_K Q6_K Q6_K Q8_0 IQ4_XS Q8_0 Q8_0 IQ4_XS Q6_K Q6_K Q6_K IQ4_XS Q6_K Q6_K IQ4_XS Q6_K IQ4_XS Q4_K_M IQ4_XS
engine llama.cpp version: 3496 llama.cpp version: 3334 llama.cpp version: 3325 llama.cpp version: 3266 llama.cpp version: 3389 llama.cpp version: 3334 llama.cpp version: 3985 (524afeec) llama.cpp version: 3985 (524afeec) llama.cpp version: 3496 llama.cpp version: 3707 llama.cpp version: 3731 llama.cpp version: 3428 llama.cpp version: 3825 llama.cpp version: 3927 (10433e8b) llama.cpp version: 3262 llama.cpp version: 3428 llama.cpp version: 3262 llama.cpp version: 3262 llama.cpp version: 3565 llama.cpp version: 3520 llama.cpp version: 3609 llama.cpp version: 3600 llama.cpp version: 3505 llama.cpp version: 3870 (841713e1) llama.cpp version: 3609 llama.cpp version: 3943 (cda0e4b6) llama.cpp version: 3870 (841713e1) llama.cpp version: 3821 llama.cpp version: 3235 llama.cpp version: 3790
--------------------------------------------- --------------- --------------- --------------- --------------- ---------------- --------------- ------------------------- ------------------------- --------------------- ---------------------------- ---------------------------- ---------------------------- ----------------------- ---------------------------- -------------------------- -------------------------------- ------------------- -------------------------- -------------------------- -------------------------- -------------------------- ---------------------------- ---------------------------- --------------------------- ------------------- --------------------- --------------------- ---------------------- --------------------------- ----------------------------
WG 0.701 0.756 0.761 0.762 0.772 0.753 0.679 0.719 0.822 0.750 0.745 0.741 0.685 0.748 0.751 0.770 0.783 0.760 0.728 0.727 0.744 0.734 0.744 - 0.705 0.709 0.709 0.746 0.759 0.779
LAMBADA 0.624 0.733 0.732 0.735 0.755 0.783 0.746 0.799 0.732 0.740 0.738 0.747 0.705 0.776 0.766 0.714 0.744 0.733 0.638 0.618 0.677 0.613 0.632 - 0.735 0.722 0.724 0.781 0.654 0.708
HELLASWAG 0.496 0.766 0.798 0.775 0.810 0.840 0.583 0.696 0.916 0.684 0.696 0.696 0.559 0.835 0.591 0.726 0.800 0.824 0.695 0.743 0.716 0.669 0.807 - 0.755 0.820 0.822 0.894 0.826 0.823
BOOLQ 0.542 0.684 0.701 0.687 0.739 0.625 0.540 0.594 0.572 0.587 0.576 0.610 0.478 0.574 0.658 0.677 0.719 0.677 0.562 0.540 0.562 0.573 0.635 - 0.633 0.617 0.623 0.701 0.651 0.665
STORYCLOZE 0.877 0.948 0.959 0.958 0.973 0.976 0.925 0.924 0.996 0.884 0.859 0.895 0.870 0.937 0.917 0.921 0.982 0.980 0.906 0.891 0.531 0.921 0.969 - 0.959 0.920 0.915 0.981 0.973 0.949
CSQA 0.631 0.741 0.751 0.751 0.763 0.733 0.615 0.692 0.760 0.698 0.678 0.686 0.642 0.669 0.627 0.664 0.796 0.875 0.656 0.675 0.669 0.660 0.751 - 0.728 0.768 0.781 0.823 0.714 0.726
OBQA 0.647 0.841 0.846 0.846 0.860 0.802 0.605 0.714 0.818 0.738 0.753 0.765 0.709 0.769 0.676 0.730 0.834 0.803 0.750 0.761 0.751 0.720 0.830 - 0.800 0.802 0.804 0.904 0.794 0.812
COPA 0.823 0.923 0.926 0.925 0.949 0.944 0.806 0.844 0.927 0.878 0.859 0.889 0.749 0.887 0.812 0.859 0.967 0.947 0.898 0.884 0.884 0.870 0.924 - 0.923 0.925 0.919 0.958 0.910 0.940
PIQA 0.591 0.799 0.803 0.801 0.841 0.779 0.497 0.712 0.787 0.707 0.716 0.725 0.637 0.694 0.708 0.777 0.794 0.771 0.745 0.741 0.733 0.677 0.827 - 0.784 0.794 0.807 0.870 0.799 0.803
SIQA 0.597 0.691 0.694 0.693 0.731 0.665 0.627 0.678 0.735 0.634 0.643 0.648 0.622 0.684 0.620 0.655 0.730 0.726 0.662 0.675 0.667 0.661 0.729 - 0.699 0.721 0.712 0.742 0.727 0.716
MEDQA 0.286 0.492 0.498 0.501 0.549 0.445 0.242 0.336 0.389 0.491 0.482 0.500 0.413 0.399 0.334 0.465 0.372 0.427 0.421 0.446 0.423 0.395 0.553 - 0.391 0.453 0.458 0.610 0.368 0.538
JEOPARDY 0.220 0.510 0.570 0.550 0.740 0.420 0.230 0.470 0.300 0.500 0.390 0.510 0.350 0.330 0.490 0.470 0.450 0.510 0.220 0.250 0.320 0.250 0.460 - 0.200 0.300 0.290 0.600 0.480 0.480
GSM8K 0.645 0.878 0.881 0.890 0.899 0.839 0.712 0.811 0.855 0.859 0.859 0.872 0.822 0.887 0.611 0.828 0.775 0.811 0.706 0.833 0.855 0.714 0.667 - 0.871 0.909 0.917 0.950 0.730 0.812
HUMANEVAL 0.408 0.646 0.621 0.658 0.743 0.731 0.469 0.621 0.317 0.634 0.628 0.652 0.585 0.768 0.390 0.689 0.310 0.689 0.628 0.652 0.682 0.621 0.268 0.835 0.719 0.798 0.817 0.884 0.402 0.506
HUMANEVALP 0.310 0.548 0.530 0.548 0.615 0.634 0.402 0.536 0.250 0.518 0.524 0.536 0.475 0.628 0.329 0.554 0.304 0.579 0.524 0.567 0.591 0.524 0.219 - 0.609 0.670 0.658 0.768 0.286 0.432
MBPP 0.459 0.591 0.595 0.595 0.642 0.591 0.470 0.548 0.521 0.575 0.583 0.564 0.498 0.626 0.451 0.513 0.470 0.326 0.482 0.451 0.610 0.498 0.412 - 0.587 0.669 0.661 0.684 0.373 0.330
MBPPP 0.441 0.575 0.575 0.584 0.638 0.575 0.433 0.517 0.477 0.526 0.535 0.540 0.482 0.580 0.397 0.428 0.410 0.151 0.459 0.450 0.575 0.477 0.401 - 0.580 0.633 0.651 0.700 0.366 0.321
HUMANEVALX_cpp 0.152 0.500 0.500 0.512 0.579 0.432 0.231 0.347 0.231 0.420 0.445 0.457 0.323 0.573 0.225 0.292 0.347 0.060 0.243 0.243 0.280 0.219 0 - 0.542 0.475 0.554 0.701 0.237 0.219
HUMANEVALX_java 0.390 0.628 0.640 0.640 0.768 0.628 0.353 0.365 0.170 0.396 0.097 0.487 0.439 0.731 0.256 0.201 0.170 0.073 0.030 0.024 0.079 0.060 0 - 0.628 0.695 0.737 0.865 0.347 0
HUMANEVALX_js 0.353 0.548 0.567 0.579 0.743 0.628 0.426 0.560 0.518 0.542 0.548 0.560 0.067 0.701 0.402 0.615 0.170 0.018 0.378 0.524 0.560 0.451 0.219 - 0.731 0.719 0.750 0.847 0.048 0.359
HUMANEVALX 0.298 0.558 0.569 0.577 0.697 0.563 0.337 0.424 0.306 0.453 0.363 0.502 0.276 0.668 0.294 0.369 0.229 0.050 0.217 0.264 0.306 0.243 0.073 - 0.634 0.630 0.680 0.804 0.211 0.193
CRUXEVAL_input 0.321 0.455 0.443 0.462 0.485 0.406 0.288 0.317 0.367 0.408 0.440 0.435 0.353 0.428 0.276 0.442 0.323 0.383 0.375 0.390 0.398 0.388 0.456 0.585 0.351 0.387 0.412 0.517 0.131 0.438
CRUXEVAL_output 0.280 0.373 0.372 0.375 0.482 0.338 0.282 0.351 0.276 0.341 0.356 0.360 0.291 0.377 0.303 0.365 0.318 0.323 0.321 0.340 0.342 0.296 0.423 0.511 0.050 0.382 0.386 0.455 0.222 0.388
CRUXEVAL 0.300 0.414 0.408 0.418 0.483 0.372 0.285 0.334 0.321 0.375 0.398 0.397 0.322 0.403 0.290 0.403 0.321 0.353 0.348 0.365 0.370 0.342 0.440 0.548 0.200 0.385 0.399 0.486 0.176 0.413
TQA_mc 0.356 0.679 0.696 0.701 0.767 0.640 0.402 0.559 0.648 0.626 0.630 0.564 0.555 0.561 0.549 0.542 0.624 0.547 0.631 0.643 0.621 0.581 0.742 - 0.503 0.654 0.657 0.804 0.563 0.709
TQA_tf 0.510 0.675 0.719 0.692 0.725 0.457 0.421 0.473 0.593 0.552 0.563 0.512 0.566 0.572 0.548 0.479 0.548 0.536 0.532 0.541 0.483 0.487 0.670 - 0.435 0.574 0.568 0.731 0.487 0.485
TQA 0.492 0.675 0.716 0.693 0.730 0.478 0.419 0.483 0.599 0.560 0.571 0.518 0.565 0.571 0.548 0.486 0.556 0.537 0.544 0.553 0.499 0.498 0.679 - 0.442 0.583 0.578 0.740 0.496 0.511
ARC_challenge 0.671 0.874 0.881 0.882 0.897 0.853 0.614 0.742 0.812 0.769 0.766 0.776 0.706 0.775 0.688 0.773 0.797 0.819 0.808 0.833 0.813 0.802 0.888 - 0.818 0.843 0.851 0.934 0.755 0.871
ARC_easy 0.846 0.949 0.950 0.952 0.963 0.940 0.808 0.893 0.936 0.898 0.904 0.906 0.843 0.910 0.843 0.908 0.910 0.914 0.926 0.935 0.934 0.932 0.965 - 0.923 0.945 0.946 0.978 0.899 0.962
ARC 0.788 0.925 0.927 0.929 0.941 0.911 0.744 0.843 0.895 0.855 0.858 0.863 0.798 0.866 0.792 0.864 0.873 0.883 0.887 0.901 0.894 0.889 0.939 - 0.888 0.911 0.915 0.963 0.851 0.932
RACE_high 0.580 0.817 0.817 0.802 0.833 0.787 0.553 0.642 0.826 0.679 0.676 0.679 0.589 0.736 0.607 0.726 0.771 0.773 0.625 0.648 0.613 0.625 0.779 - 0.779 0.779 0.788 0.882 0.741 0.764
RACE_middle 0.610 0.860 0.860 0.849 0.883 0.825 0.631 0.735 0.863 0.747 0.744 0.734 0.680 0.800 0.696 0.782 0.807 0.834 0.697 0.722 0.706 0.692 0.832 - 0.827 0.841 0.853 0.923 0.809 0.824
RACE 0.589 0.830 0.829 0.816 0.847 0.798 0.576 0.669 0.837 0.699 0.696 0.695 0.615 0.755 0.633 0.743 0.781 0.791 0.646 0.670 0.640 0.645 0.795 - 0.793 0.797 0.807 0.894 0.761 0.781
MMLU
abstract_algebra 0.140 0.320 0.320 0.330 0.310 0.210 0.170 0.200 0.230 0.210 0.140 0.200 0.270 0.210 0.190 0.330 0.220 0.170 0.330 0.250 0.300 0.210 0.390 - 0.480 0.440 0.430 0.600 0.140 0.340
anatomy 0.414 0.604 0.611 0.626 0.607 0.511 0.362 0.474 0.614 0.540 0.511 0.555 0.540 0.540 0.447 0.555 0.552 0.537 0.540 0.577 0.570 0.585 0.666 - 0.488 0.622 0.622 0.733 0.477 0.607
astronomy 0.467 0.753 0.740 0.760 0.828 0.651 0.519 0.651 0.723 0.651 0.657 0.677 0.565 0.671 0.573 0.651 0.620 0.646 0.684 0.677 0.703 0.703 0.796 - 0.723 0.763 0.769 0.875 0.586 0.756
business_ethics 0.430 0.620 0.630 0.620 0.670 0.610 0.460 0.490 0.640 0.540 0.520 0.550 0.480 0.540 0.520 0.630 0.540 0.530 0.570 0.620 0.620 0.620 0.710 - 0.670 0.680 0.710 0.800 0.570 0.740
clinical_knowledge 0.550 0.724 0.743 0.743 0.788 0.622 0.490 0.584 0.716 0.656 0.698 0.675 0.592 0.664 0.581 0.664 0.637 0.649 0.709 0.686 0.713 0.698 0.750 - 0.686 0.709 0.713 0.815 0.577 0.735
college_biology 0.625 0.847 0.833 0.854 0.895 0.715 0.486 0.659 0.701 0.722 0.687 0.722 0.625 0.708 0.625 0.694 0.631 0.659 0.756 0.791 0.805 0.763 0.819 - 0.743 0.784 0.784 0.923 0.618 0.833
college_chemistry 0.330 0.440 0.450 0.470 0.430 0.380 0.340 0.350 0.410 0.390 0.390 0.400 0.310 0.370 0.350 0.340 0.380 0.400 0.450 0.450 0.460 0.430 0.440 - 0.380 0.480 0.490 0.530 0.330 0.450
college_computer_science 0.290 0.480 0.460 0.460 0.580 0.480 0.250 0.400 0.490 0.380 0.340 0.400 0.350 0.410 0.320 0.400 0.440 0.400 0.470 0.440 0.480 0.410 0.510 - 0.520 0.620 0.590 0.720 0.370 0.480
college_mathematics 0.100 0.290 0.270 0.260 0.300 0.280 0.120 0.180 0.230 0.220 0.200 0.260 0.210 0.180 0.180 0.180 0.200 0.200 0.270 0.200 0.270 0.170 0.340 - 0.260 0.380 0.350 0.540 0.170 0.310
college_medicine 0.491 0.635 0.641 0.658 0.716 0.589 0.439 0.520 0.606 0.612 0.624 0.589 0.491 0.543 0.456 0.572 0.566 0.543 0.572 0.572 0.612 0.566 0.682 - 0.589 0.606 0.624 0.739 0.485 0.653
college_physics 0.235 0.401 0.382 0.352 0.421 0.323 0.215 0.225 0.362 0.313 0.294 0.313 0.303 0.264 0.254 0.245 0.205 0.303 0.392 0.352 0.333 0.294 0.372 - 0.343 0.401 0.372 0.656 0.245 0.382
computer_security 0.580 0.710 0.740 0.730 0.710 0.730 0.630 0.690 0.690 0.700 0.690 0.690 0.620 0.640 0.600 0.680 0.670 0.660 0.690 0.680 0.700 0.650 0.700 - 0.650 0.720 0.710 0.800 0.610 0.700
conceptual_physics 0.395 0.608 0.629 0.638 0.727 0.587 0.353 0.463 0.612 0.455 0.476 0.463 0.361 0.404 0.365 0.446 0.468 0.472 0.519 0.565 0.565 0.553 0.685 - 0.561 0.642 0.642 0.834 0.442 0.693
econometrics 0.271 0.566 0.566 0.557 0.587 0.464 0.228 0.377 0.482 0.447 0.438 0.482 0.359 0.333 0.318 0.429 0.362 0.424 0.473 0.456 0.456 0.421 0.543 - 0.535 0.605 0.596 0.675 0.345 0.526
electrical_engineering 0.462 0.558 0.558 0.558 0.593 0.572 0.358 0.427 0.544 0.558 0.496 0.524 0.462 0.455 0.393 0.482 0.468 0.510 0.544 0.468 0.496 0.475 0.565 - 0.606 0.606 0.606 0.703 0.324 0.586
elementary_mathematics 0.261 0.484 0.470 0.476 0.476 0.373 0.171 0.277 0.529 0.333 0.312 0.357 0.280 0.309 0.222 0.312 0.283 0.304 0.412 0.373 0.423 0.388 0.537 - 0.481 0.560 0.568 0.838 0.304 0.455
formal_logic 0.214 0.412 0.420 0.293 0.468 0.357 0.230 0.333 0.396 0.373 0.357 0.420 0.253 0.349 0.277 0.396 0.261 0.261 0.420 0.412 0.452 0.380 0.523 - 0.420 0.452 0.428 0.626 0.190 0.484
global_facts 0.100 0.330 0.320 0.330 0.370 0.240 0.120 0.230 0.390 0.160 0.150 0.150 0.110 0.200 0.160 0.280 0.160 0.210 0.190 0.220 0.240 0.130 0.360 - 0.300 0.260 0.260 0.430 0.220 0.240
high_school_biology 0.651 0.845 0.845 0.851 0.890 0.809 0.577 0.696 0.790 0.732 0.738 0.729 0.677 0.741 0.654 0.748 0.677 0.706 0.770 0.774 0.793 0.774 0.861 - 0.761 0.803 0.806 0.896 0.670 0.858
high_school_chemistry 0.315 0.571 0.561 0.586 0.600 0.517 0.295 0.413 0.522 0.453 0.458 0.467 0.433 0.389 0.310 0.379 0.384 0.428 0.536 0.482 0.512 0.492 0.551 - 0.467 0.532 0.536 0.724 0.315 0.581
high_school_computer_science 0.440 0.690 0.720 0.710 0.770 0.660 0.500 0.630 0.700 0.610 0.590 0.610 0.540 0.610 0.490 0.630 0.580 0.560 0.620 0.610 0.610 0.580 0.690 - 0.710 0.770 0.770 0.870 0.560 0.720
high_school_european_history 0.672 0.800 0.800 0.806 0.830 0.830 0.624 0.745 0.751 0.690 0.696 0.709 0.672 0.696 0.678 0.709 0.733 0.733 0.715 0.690 0.727 0.672 0.806 - 0.751 0.787 0.800 0.818 0.745 0.787
high_school_geography 0.676 0.863 0.868 0.878 0.888 0.818 0.555 0.732 0.843 0.772 0.747 0.757 0.671 0.727 0.671 0.752 0.727 0.732 0.787 0.747 0.792 0.737 0.843 - 0.797 0.833 0.833 0.883 0.717 0.843
high_school_government_and_politics 0.730 0.921 0.926 0.926 0.963 0.870 0.709 0.823 0.865 0.808 0.792 0.818 0.725 0.849 0.805 0.875 0.863 0.836 0.823 0.875 0.849 0.834 0.937 - 0.865 0.917 0.917 0.968 0.805 0.917
high_school_macroeconomics 0.487 0.704 0.706 0.717 0.758 0.653 0.415 0.520 0.633 0.558 0.535 0.556 0.497 0.525 0.478 0.528 0.532 0.521 0.661 0.635 0.646 0.635 0.756 - 0.687 0.684 0.684 0.825 0.496 0.710
high_school_mathematics 0.200 0.307 0.325 0.277 0.325 0.240 0.159 0.177 0.348 0.244 0.225 0.255 0.233 0.270 0.162 0.162 0.237 0.203 0.244 0.203 0.214 0.203 0.281 - 0.351 0.440 0.422 0.537 0.174 0.266
high_school_microeconomics 0.521 0.780 0.772 0.801 0.852 0.773 0.462 0.634 0.714 0.647 0.630 0.684 0.575 0.609 0.540 0.630 0.603 0.654 0.789 0.773 0.794 0.743 0.886 - 0.802 0.827 0.827 0.907 0.594 0.848
high_school_physics 0.218 0.463 0.463 0.423 0.496 0.364 0.145 0.225 0.337 0.298 0.337 0.317 0.211 0.284 0.165 0.251 0.245 0.331 0.384 0.384 0.377 0.384 0.463 - 0.298 0.470 0.456 0.695 0.192 0.456
high_school_psychology 0.761 0.895 0.889 0.896 0.910 0.858 0.669 0.785 0.838 0.840 0.831 0.834 0.761 0.809 0.764 0.814 0.817 0.797 0.838 0.823 0.855 0.844 0.884 - 0.827 0.858 0.856 0.902 0.779 0.880
high_school_statistics 0.347 0.592 0.601 0.574 0.615 0.500 0.319 0.393 0.490 0.458 0.435 0.462 0.342 0.467 0.361 0.476 0.402 0.421 0.504 0.472 0.569 0.523 0.615 - 0.550 0.615 0.648 0.782 0.407 0.555
high_school_us_history 0.656 0.829 0.829 0.829 0.867 0.867 0.602 0.740 0.759 0.764 0.759 0.784 0.696 0.823 0.699 0.799 0.782 0.792 0.754 0.764 0.759 0.735 0.833 - 0.803 0.843 0.852 0.906 0.803 0.857
high_school_world_history 0.700 0.881 0.864 0.872 0.881 0.827 0.637 0.763 0.780 0.805 0.784 0.789 0.725 0.776 0.720 0.797 0.750 0.826 0.759 0.729 0.746 0.742 0.835 - 0.805 0.818 0.827 0.877 0.783 0.848
human_aging 0.497 0.681 0.690 0.690 0.739 0.591 0.506 0.560 0.650 0.569 0.600 0.618 0.569 0.605 0.542 0.609 0.632 0.623 0.565 0.596 0.582 0.547 0.672 - 0.609 0.681 0.690 0.771 0.587 0.695
human_sexuality 0.519 0.761 0.730 0.746 0.755 0.633 0.549 0.702 0.702 0.671 0.702 0.671 0.587 0.679 0.569 0.618 0.615 0.646 0.648 0.618 0.664 0.587 0.748 - 0.648 0.740 0.717 0.839 0.584 0.770
international_law 0.644 0.785 0.785 0.801 0.760 0.752 0.628 0.694 0.785 0.752 0.727 0.776 0.710 0.752 0.710 0.694 0.768 0.743 0.735 0.694 0.735 0.727 0.826 - 0.776 0.768 0.785 0.867 0.685 0.859
jurisprudence 0.611 0.794 0.794 0.785 0.833 0.722 0.638 0.675 0.712 0.694 0.740 0.731 0.574 0.722 0.626 0.750 0.719 0.719 0.731 0.722 0.722 0.750 0.787 - 0.787 0.759 0.750 0.824 0.654 0.824
logical_fallacies 0.625 0.792 0.805 0.811 0.797 0.754 0.656 0.711 0.785 0.736 0.723 0.736 0.687 0.705 0.660 0.730 0.666 0.691 0.773 0.791 0.785 0.754 0.852 - 0.736 0.773 0.766 0.877 0.641 0.809
machine_learning 0.241 0.401 0.410 0.437 0.571 0.401 0.276 0.366 0.464 0.339 0.312 0.366 0.285 0.366 0.321 0.366 0.366 0.348 0.473 0.383 0.437 0.375 0.500 - 0.383 0.437 0.410 0.642 0.312 0.526
management 0.737 0.805 0.825 0.825 0.844 0.766 0.572 0.718 0.815 0.757 0.757 0.737 0.669 0.766 0.708 0.699 0.737 0.737 0.805 0.786 0.786 0.776 0.815 - 0.737 0.805 0.825 0.864 0.747 0.786
marketing 0.760 0.871 0.880 0.863 0.893 0.858 0.726 0.803 0.829 0.811 0.824 0.837 0.799 0.794 0.756 0.811 0.833 0.816 0.841 0.824 0.820 0.803 0.880 - 0.841 0.888 0.893 0.901 0.782 0.846
medical_genetics 0.580 0.780 0.750 0.780 0.810 0.640 0.570 0.650 0.690 0.700 0.690 0.720 0.660 0.660 0.600 0.740 0.630 0.660 0.680 0.710 0.710 0.700 0.830 - 0.690 0.770 0.770 0.900 0.640 0.820
miscellaneous 0.698 0.832 0.832 0.830 0.854 0.796 0.650 0.776 0.787 0.773 0.776 0.773 0.736 0.759 0.727 0.782 0.766 0.756 0.770 0.756 0.777 0.759 0.837 - 0.814 0.807 0.814 0.885 0.746 0.828
moral_disputes 0.526 0.686 0.671 0.680 0.736 0.612 0.456 0.592 0.589 0.618 0.578 0.621 0.560 0.572 0.524 0.552 0.598 0.645 0.644 0.635 0.615 0.621 0.696 - 0.658 0.664 0.676 0.760 0.554 0.708
moral_scenarios 0.227 0.330 0.410 0.325 0.366 0.360 0.082 0.164 0.280 0.184 0.153 0.205 0.410 0.246 0.122 0.226 0.229 0.327 0.317 0.288 0.366 0.404 0.538 - 0.336 0.318 0.368 0.565 0.188 0.477
nutrition 0.591 0.683 0.669 0.683 0.758 0.653 0.486 0.611 0.650 0.686 0.660 0.689 0.620 0.647 0.555 0.614 0.624 0.633 0.630 0.630 0.669 0.620 0.751 - 0.692 0.745 0.745 0.797 0.575 0.722
philosophy 0.527 0.654 0.641 0.658 0.713 0.659 0.472 0.623 0.636 0.614 0.639 0.617 0.578 0.598 0.587 0.633 0.612 0.580 0.630 0.598 0.630 0.588 0.704 - 0.646 0.675 0.688 0.778 0.554 0.717
prehistory 0.518 0.727 0.730 0.728 0.783 0.663 0.484 0.638 0.669 0.679 0.691 0.700 0.604 0.623 0.580 0.675 0.697 0.648 0.688 0.675 0.697 0.663 0.774 - 0.675 0.762 0.756 0.861 0.595 0.783
professional_accounting 0.326 0.507 0.489 0.496 0.514 0.425 0.262 0.354 0.453 0.358 0.386 0.393 0.336 0.382 0.336 0.421 0.361 0.382 0.421 0.397 0.418 0.386 0.578 - 0.443 0.457 0.460 0.631 0.358 0.514
professional_law 0.307 0.486 0.478 0.478 0.528 0.408 0.329 0.383 0.359 0.359 0.377 0.397 0.369 0.383 0.333 0.379 0.399 0.383 0.402 0.405 0.410 0.401 0.498 - 0.423 0.401 0.402 0.541 0.350 0.481
professional_medicine 0.485 0.749 0.774 0.756 0.794 0.680 0.375 0.588 0.665 0.713 0.727 0.724 0.713 0.672 0.564 0.705 0.642 0.619 0.654 0.643 0.687 0.658 0.794 - 0.658 0.680 0.683 0.845 0.645 0.794
professional_psychology 0.477 0.722 0.717 0.728 0.805 0.609 0.449 0.535 0.599 0.616 0.619 0.642 0.509 0.565 0.521 0.602 0.560 0.588 0.648 0.638 0.655 0.617 0.764 - 0.671 0.707 0.702 0.810 0.529 0.759
public_relations 0.563 0.690 0.700 0.700 0.672 0.627 0.500 0.581 0.636 0.536 0.518 0.518 0.545 0.581 0.554 0.627 0.581 0.518 0.572 0.627 0.554 0.572 0.672 - 0.636 0.627 0.645 0.663 0.554 0.581
security_studies 0.616 0.710 0.751 0.746 0.763 0.632 0.514 0.648 0.759 0.665 0.653 0.665 0.616 0.673 0.600 0.608 0.612 0.628 0.693 0.697 0.669 0.673 0.738 - 0.665 0.718 0.718 0.775 0.575 0.730
sociology 0.666 0.800 0.825 0.815 0.860 0.741 0.626 0.716 0.810 0.751 0.756 0.786 0.741 0.771 0.716 0.786 0.776 0.781 0.800 0.800 0.820 0.781 0.850 - 0.825 0.815 0.825 0.860 0.741 0.835
us_foreign_policy 0.690 0.838 0.868 0.868 0.840 0.800 0.700 0.790 0.790 0.790 0.770 0.800 0.800 0.840 0.757 0.740 0.787 0.787 0.760 0.740 0.760 0.770 0.850 - 0.800 0.820 0.820 0.880 0.757 0.810
virology 0.433 0.457 0.475 0.472 0.506 0.439 0.367 0.439 0.415 0.415 0.439 0.439 0.415 0.475 0.387 0.421 0.436 0.448 0.391 0.379 0.403 0.367 0.487 - 0.457 0.463 0.457 0.518 0.381 0.487
world_religions 0.678 0.800 0.817 0.800 0.847 0.766 0.643 0.760 0.801 0.766 0.783 0.789 0.742 0.777 0.747 0.789 0.800 0.747 0.754 0.742 0.742 0.725 0.801 - 0.766 0.818 0.818 0.871 0.705 0.812
MMLU 0.475 0.646 0.652 0.647 0.687 0.595 0.429 0.530 0.595 0.556 0.552 0.570 0.525 0.550 0.486 0.555 0.544 0.553 0.590 0.578 0.599 0.578 0.682 - 0.610 0.639 0.643 0.757 0.509 0.666
AGIEVAL
aquarat 0.460 0.677 0.696 0.665 0.602 0.637 0.488 0.614 0.657 0.681 0.673 0.598 0.633 0.712 0.279 0.322 0.582 0.157 0.212 0.338 0.409 0.574 0.614 - 0.712 0.799 0.830 0.870 0.425 0.590
logiqa 0.321 0.443 0.440 0.447 0.477 0.416 0.282 0.324 0.393 0.311 0.321 0.328 0.265 0.324 0.264 0.311 0.330 0.327 0.308 0.285 0.281 0.267 0.405 - 0.359 0.427 0.436 0.554 0.290 0.391
lsatar 0.191 0.234 0.239 0.208 0.260 0.217 0.234 0.226 0.239 0.278 0.278 0.295 0.239 0.186 0.186 0.200 0.278 0.208 0.260 0.252 0.256 0.247 0.208 - 0.252 0.260 0.300 0.400 0.173 0.226
lsatlr 0.337 0.625 0.627 0.635 0.654 0.515 0.296 0.415 0.513 0.425 0.433 0.441 0.327 0.447 0.366 0.445 0.523 0.570 0.431 0.429 0.415 0.386 0.598 - 0.456 0.598 0.603 0.811 0.449 0.576
lsatrc 0.431 0.747 0.747 0.750 0.754 0.643 0.390 0.557 0.643 0.591 0.635 0.624 0.486 0.635 0.520 0.650 0.613 0.617 0.513 0.557 0.531 0.524 0.672 - 0.583 0.661 0.687 0.836 0.654 0.706
saten 0.665 0.839 0.844 0.834 0.868 0.820 0.519 0.791 0.820 0.776 0.762 0.781 0.689 0.825 0.679 0.747 0.786 0.786 0.708 0.713 0.713 0.708 0.800 - 0.781 0.810 0.844 0.922 0.757 0.796
satmath 0.627 0.900 0.872 0.886 0.768 0.868 0.577 0.745 0.904 0.768 0.822 0.618 0.845 0.886 0.400 0.395 0.690 0.413 0.331 0.509 0.713 0.754 0.727 - 0.900 0.963 0.963 0.981 0.540 0.768
AGIEVAL 0.398 0.600 0.600 0.598 0.602 0.546 0.364 0.473 0.547 0.489 0.501 0.480 0.433 0.512 0.359 0.416 0.501 0.432 0.381 0.409 0.429 0.438 0.546 - 0.522 0.599 0.616 0.734 0.434 0.544
AGIEVALC_biology - - - - - 0.765 - - 0.830 - - - - - - - 0.304 0.408 - - - - - - - - - - 0.356 -
AGIEVALC_chemistry - - - - - 0.696 - - 0.598 - - - - - - - 0.215 0.313 - - - - - - - - - - 0.171 -
AGIEVALC_chinese - - - - - 0.650 - - 0.682 - - - - - - - 0.300 0.313 - - - - - - - - - - 0.239 -
AGIEVALC_english - - - - - 0.833 - - 0.947 - - - - - - - 0.807 0.862 - - - - - - - - - - 0.830 -
AGIEVALC_geography - - - - - 0.723 - - 0.778 - - - - - - - 0.371 0.457 - - - - - - - - - - 0.346 -
AGIEVALC_history - - - - - 0.842 - - 0.817 - - - - - - - 0.378 0.485 - - - - - - - - - - 0.357 -
AGIEVALC_jecqaca - - - - - 0.450 - - 0.514 - - - - - - - 0.223 0.273 - - - - - - - - - - 0.185 -
AGIEVALC_jecqakd - - - - - 0.574 - - 0.620 - - - - - - - 0.281 0.275 - - - - - - - - - - 0.242 -
AGIEVALC_logiqa - - - - - 0.486 - - 0.519 - - - - - - - 0.317 0.330 - - - - - - - - - - 0.274 -
AGIEVALC_mathcloze - - - - - 0.228 - - 0.194 - - - - - - - 0.050 0 - - - - - - - - - - 0.084 -
AGIEVALC_mathqa - - - - - 0.619 - - 0.662 - - - - - - - 0.296 0.334 - - - - - - - - - - 0.261 -
AGIEVALC_physics - - - - - 0.436 - - 0.436 - - - - - - - 0.183 0.235 - - - - - - - - - - 0.178 -
AGIEVALC - - - - - 0.597 - - 0.632 - - - - - - - 0.322 0.363 - - - - - - - - - - 0.297 -
BBH
boolean_expressions 0.556 0.764 0.776 0.768 0.460 0.868 0.812 0.856 0.688 0.824 0.832 0.844 0.480 0.728 0.764 0.780 0.824 0.664 0.848 0.800 0.852 0.832 0.696 - 0.808 0.864 0.880 0.808 0.720 0.540
causal_judgement 0.524 0.609 0.604 0.598 0.604 0.550 0.566 0.598 0.689 0.545 0.518 0.540 0.518 0.593 0.588 0.625 0.614 0.604 0.508 0.598 0.588 0.593 0.588 - 0.625 0.508 0.513 0.700 0.636 0.641
date_understanding 0.592 0.780 0.764 0.748 0.788 0.572 0.560 0.684 0.832 0.732 0.724 0.716 0.664 0.772 0.548 0.668 0.592 0.608 0.464 0.568 0.696 0.576 0.780 - 0.544 0.764 0.740 0.872 0.556 0.724
disambiguation_qa 0.532 0.688 0.652 0.660 0.720 0.636 0.628 0.640 0.732 0.552 0.540 0.516 0.472 0.644 0.600 0.596 0.728 0.704 0.640 0.592 0.720 0.752 0.692 - 0.660 0.656 0.636 0.780 0.576 0.640
dyck_languages 0.476 0.752 0.720 0.728 0.600 0.544 0.560 0.704 0.728 0.832 0.724 0.796 0.680 0.756 0.744 0.712 0.664 0.732 0.532 0.424 0.580 0.468 0.532 - 0.756 0.868 0.836 0.820 0.684 0.572
formal_fallacies 0.532 0.868 0.824 0.832 0.760 0.660 0.960 0.640 0.920 0.920 0.988 0.984 0.816 0.532 0.852 0.996 0.632 0.564 0.876 0.920 0.808 0.808 0.944 - 0.632 0.628 0.628 0.812 0.776 0.576
geometric_shapes 0.204 0.400 0.384 0.436 0.420 0.400 0.268 0.368 0.840 0.440 0.488 0.440 0.416 0.520 0.288 0.404 0.348 0.344 0.372 0.248 0.416 0.292 0.328 - 0.356 0.544 0.604 0.640 0.268 0.400
hyperbaton 0.704 0.888 0.856 0.884 0.836 0.824 0.612 0.724 0.928 0.824 0.768 0.880 0.624 0.804 0.656 0.644 0.828 0.724 0.968 0.940 0.936 0.936 0.952 - 0.704 0.832 0.792 0.956 0.744 0.900
logical_deduction_five_objects 0.300 0.596 0.636 0.568 0.608 0.516 0.352 0.464 0.660 0.540 0.536 0.568 0.484 0.592 0.352 0.556 0.384 0.472 0.464 0.432 0.632 0.532 0.532 - 0.556 0.752 0.728 0.924 0.436 0.612
logical_deduction_seven_objects 0.284 0.580 0.564 0.560 0.552 0.500 0.284 0.388 0.648 0.472 0.484 0.488 0.408 0.500 0.296 0.452 0.320 0.400 0.476 0.308 0.568 0.500 0.444 - 0.464 0.668 0.656 0.864 0.388 0.560
logical_deduction_three_objects 0.440 0.860 0.868 0.844 0.892 0.840 0.524 0.664 0.896 0.760 0.764 0.804 0.652 0.844 0.608 0.800 0.664 0.620 0.724 0.688 0.844 0.804 0.884 - 0.736 0.940 0.956 0.992 0.664 0.888
movie_recommendation 0.568 0.560 0.552 0.552 0.508 0.648 0.440 0.528 0.884 0.548 0.540 0.536 0.456 0.604 0.508 0.448 0.552 0.540 0.544 0.540 0.520 0.508 0.584 - 0.548 0.556 0.536 0.648 0.584 0.676
multistep_arithmetic_two 0.288 0.480 0.472 0.488 0.472 0.524 0.340 0.464 0.372 0.712 0.704 0.700 0.532 0.540 0.108 0.432 0.164 0.292 0.624 0.272 0.836 0.420 0.460 - 0.532 0.896 0.948 0.976 0.252 0.536
navigate 0.580 0.588 0.588 0.596 0.648 0.420 0.580 0.592 0.452 0.520 0.580 0.580 0.580 0.572 0.600 0.588 0.568 0.580 0.644 0.596 0.588 0.584 0.636 - 0.596 0.596 0.596 0.684 0.520 0.652
object_counting 0.612 0.800 0.808 0.848 0.856 0.660 0.624 0.760 0.644 0.820 0.772 0.864 0.808 0.908 0.608 0.716 0.564 0.796 0.696 0.244 0.836 0.344 0.372 - 0.660 0.848 0.804 0.896 0.680 0.756
penguins_in_a_table 0.506 0.883 0.869 0.890 0.842 0.917 0.527 0.705 0.815 0.856 0.821 0.856 0.801 0.917 0.623 0.801 0.575 0.760 0.486 0.465 0.883 0.712 0.815 - 0.835 0.945 0.924 0.986 0.636 0.828
reasoning_about_colored_objects 0.484 0.700 0.700 0.744 0.900 0.796 0.548 0.668 0.904 0.760 0.820 0.824 0.568 0.904 0.608 0.752 0.648 0.752 0.664 0.656 0.808 0.656 0.896 - 0.764 0.904 0.868 0.984 0.600 0.840
ruin_names 0.480 0.720 0.692 0.716 0.760 0.652 0.348 0.528 0.932 0.676 0.680 0.744 0.532 0.556 0.400 0.584 0.408 0.592 0.528 0.596 0.612 0.600 0.636 - 0.564 0.440 0.544 0.760 0.536 0.616
salient_translation_error_detection 0.420 0.532 0.580 0.548 0.568 0.488 0.360 0.516 0.644 0.436 0.504 0.512 0.464 0.556 0.444 0.472 0.524 0.560 0.448 0.408 0.520 0.532 0.596 - 0.456 0.560 0.572 0.700 0.532 0.588
snarks 0.584 0.646 0.696 0.691 0.719 0.707 0.561 0.685 0.820 0.651 0.685 0.651 0.657 0.691 0.606 0.691 0.533 0.640 0.612 0.735 0.747 0.786 0.747 - 0.657 0.747 0.780 0.865 0.646 0.837
sports_understanding 0.724 0.824 0.796 0.788 0.816 0.468 0.708 0.780 0.920 0.636 0.744 0.720 0.644 0.640 0.716 0.800 0.836 0.792 0.600 0.596 0.596 0.600 0.748 - 0.776 0.676 0.684 0.748 0.828 0.740
temporal_sequences 0.124 0.680 0.680 0.708 0.748 0.840 0.216 0.576 0.976 0.804 0.788 0.856 0.712 0.360 0.404 0.544 0.524 0.508 0.612 0.800 0.784 0.508 0.892 - 0.596 0.800 0.820 0.992 0.568 0.920
tracking_shuffled_objects_five_objects 0.216 0.536 0.600 0.600 0.692 0.536 0.588 0.536 0.572 0.568 0.596 0.656 0.500 0.792 0.344 0.736 0.356 0.468 0.664 0.612 0.940 0.712 0.776 - 0.476 0.840 0.908 0.972 0.364 0.420
tracking_shuffled_objects_seven_objects 0.152 0.576 0.572 0.572 0.640 0.436 0.484 0.596 0.480 0.488 0.536 0.592 0.420 0.728 0.296 0.596 0.284 0.396 0.640 0.568 0.896 0.612 0.652 - 0.416 0.800 0.868 0.980 0.372 0.436
tracking_shuffled_objects_three_objects 0.292 0.716 0.708 0.732 0.848 0.696 0.604 0.592 0.528 0.704 0.704 0.728 0.608 0.832 0.436 0.832 0.412 0.724 0.836 0.572 0.960 0.788 0.888 - 0.524 0.832 0.872 0.996 0.536 0.660
web_of_lies 0.508 0.524 0.556 0.520 0.488 0.488 0.480 0.544 0.536 0.440 0.512 0.512 0.544 0.492 0.488 0.512 0.488 0.512 0.492 0.512 0.488 0.492 0.548 - 0.552 0.528 0.532 0.624 0.488 0.520
word_sorting 0.100 0.424 0.424 0.404 0.540 0.392 0.216 0.360 0.452 0.556 0.500 0.512 0.360 0.340 0.280 0.392 0.344 0.500 0.224 0.168 0.204 0.152 0.236 - 0.208 0.212 0.220 0.400 0.336 0.276
BBH 0.432 0.663 0.661 0.664 0.674 0.608 0.507 0.595 0.719 0.650 0.659 0.681 0.566 0.652 0.506 0.631 0.531 0.583 0.602 0.549 0.696 0.592 0.658 - 0.587 0.709 0.718 0.827 0.549 0.637
MUSR
murder_mystery 0.552 0.688 0.672 0.668 0.576 0.584 0.544 0.492 0.572 0.616 0.596 0.584 0.576 0.624 0.516 0.656 0.592 0.272 0.612 0.636 0.636 0.620 0.600 - 0.516 0.604 0.584 0.640 0.588 0.532
object_placements 0.449 0.539 0.535 0.519 0.542 0.531 0.480 0.496 0.492 0.542 0.488 0.546 0.523 0.484 0.453 0.542 0.527 0.523 0.437 0.496 0.503 0.457 0.519 - 0.511 0.531 0.554 0.265 0.500 0.425
team_allocation 0.352 0.460 0.484 0.460 0.476 0.588 0.364 0.460 0.500 0.416 0.468 0.460 0.396 0.504 0.356 0.448 0.456 0.516 0.548 0.520 0.536 0.480 0.560 - 0.440 0.512 0.476 0.592 0.504 0.556
MUSR 0.451 0.562 0.563 0.548 0.531 0.567 0.462 0.482 0.521 0.525 0.517 0.530 0.498 0.537 0.441 0.548 0.525 0.437 0.531 0.550 0.558 0.518 0.559 - 0.489 0.548 0.538 0.497 0.530 0.503
MMLUPRO
biology 0.582 0.754 0.687 0.747 0.772 0.695 0.538 0.608 0.687 0.675 0.672 0.686 0.623 0.659 0.582 0.651 0.619 0.592 0.656 0.676 0.702 0.662 0.725 - 0.668 0.709 0.729 0.764 0.570 0.684
business 0.356 0.579 0.679 0.583 0.626 0.562 0.307 0.423 0.510 0.558 0.536 0.558 0.458 0.536 0.335 0.510 0.404 0.429 0.396 0.465 0.571 0.509 0.476 - 0.590 0.647 0.661 0.755 0.335 0.496
chemistry 0.271 0.502 0.639 0.503 0.546 0.467 0.227 0.291 0.343 0.451 0.439 0.467 0.390 0.375 - 0.366 0.263 0.271 0.367 0.431 0.463 0.296 0.312 - 0.413 0.559 0.580 0.701 0.196 0.407
computer_science 0.300 0.473 0.507 0.482 0.560 0.502 0.329 0.443 0.487 0.495 0.534 0.485 0.414 0.541 - 0.456 0.426 0.424 0.426 0.458 0.475 0.448 0.521 - 0.482 0.590 0.604 0.734 0.339 0.512
economics 0.408 0.622 0.648 0.668 0.678 0.610 0.395 0.510 0.540 0.569 0.571 0.568 0.492 0.542 - 0.541 0.484 0.490 0.555 0.558 0.609 0.587 0.575 - 0.574 0.674 0.687 0.787 0.463 -
engineering 0.253 0.391 0.406 0.406 0.414 0.298 0.204 0.284 0.301 0.391 0.391 0.378 0.302 0.330 - 0.317 0.237 0.237 0.264 0.297 0.297 0.283 0.342 - 0.356 0.418 0.420 0.573 0.180 -
health 0.333 0.561 0.535 0.545 0.621 0.496 0.273 0.433 0.458 0.556 0.562 0.558 0.437 0.464 - 0.498 0.422 0.414 0.433 0.479 0.515 0.466 0.588 - 0.442 0.556 0.569 0.690 0.381 -
history 0.275 0.482 0.522 0.493 0.490 0.438 0.244 0.388 0.391 0.433 0.409 0.451 0.380 0.409 - 0.425 0.359 0.364 0.398 0.388 0.380 0.380 0.496 - 0.391 0.459 0.464 0.624 0.380 -
law 0.198 0.356 0.353 0.343 0.405 0.284 0.204 0.266 0.237 0.295 0.309 0.303 0.243 0.276 - 0.279 0.238 0.262 0.285 0.306 0.276 0 0.384 - 0.271 0.300 0.292 0.455 0.217 -
math 0.309 0.537 0.621 0.538 0.570 0.523 0.318 0.417 0.508 0.532 0.516 0.555 0.511 0.543 - 0.416 0.369 0.418 0.369 0.468 0.522 0.458 0.391 - 0.592 0.712 0.723 0.814 0.270 -
other 0.325 0.528 0.542 0.551 0.574 0.458 0.308 0.423 0.440 0.482 0.478 0.487 0.389 0.464 - 0.456 0.416 0.401 0.396 0.457 0.500 0.433 0.532 - 0.444 0.529 0.551 0.664 0.400 -
philosophy 0.272 0.460 0.436 0.448 0.488 0.412 0.286 0.366 0.372 0.424 0.438 0.382 0.326 0.366 - 0.390 0.360 0.346 0.354 0.386 0.406 0.390 0.494 - 0.374 0.480 0.464 0.599 0.326 -
physics 0.275 0.491 0.494 0.501 0.559 0.461 0.222 0.318 0.344 0.484 0.492 0.488 0.397 0.414 - 0.370 0.317 0.309 0.352 0.423 0.455 0.425 0.367 - 0.457 0.589 0.602 0.543 0.240 -
psychology 0.494 0.666 0.657 0.647 0.692 0.602 0.436 0.531 0.572 0.624 0.601 0.637 0.518 0.595 - 0.588 0.543 0.552 0.573 0.583 0.621 0.572 0.676 - 0.595 0.636 0.644 0.749 0.525 -
MMLUPRO 0.326 0.524 0.553 0.528 0.568 0.480 0.298 0.395 0.432 0.494 0.491 0.499 0.419 0.458 0.453 0.436 0.376 0.382 0.405 0.451 0.482 0.408 0.470 - 0.475 0.564 0.575 0.671 0.326 0.509
CATEGORIES
REASONING 0.570 0.782 0.800 0.788 0.814 0.811 0.592 0.702 0.845 0.702 0.708 0.713 0.606 0.779 0.628 0.730 0.785 0.799 0.713 0.741 0.724 0.691 0.805 - 0.755 0.805 0.809 0.874 0.784 0.806
UNDERSTANDING 0.538 0.708 0.712 0.707 0.742 0.670 0.511 0.598 0.685 0.622 0.618 0.631 0.579 0.633 0.563 0.633 0.644 0.651 0.629 0.629 0.614 0.622 0.727 - 0.674 0.692 0.696 0.793 0.617 0.713
LANGUAGE 0.624 0.733 0.732 0.735 0.755 0.783 0.746 0.799 0.732 0.740 0.738 0.747 0.705 0.776 0.766 0.714 0.744 0.733 0.638 0.618 0.677 0.613 0.632 - 0.735 0.722 0.724 0.781 0.654 0.708
KNOWLEDGE 0.505 0.677 0.710 0.689 0.733 0.544 0.455 0.518 0.601 0.568 0.571 0.547 0.536 0.569 0.582 0.546 0.542 0.533 0.546 0.546 0.517 0.519 0.663 - 0.500 0.591 0.589 0.726 0.489 0.559
COT 0.350 0.548 0.568 0.550 0.582 0.500 0.336 0.431 0.505 0.522 0.521 0.530 0.446 0.479 0.498 0.466 0.416 0.424 0.437 0.474 0.506 0.440 0.503 - 0.492 0.570 0.581 0.684 0.377 0.563
MATHCOT 0.482 0.729 0.734 0.735 0.740 0.671 0.571 0.663 0.693 0.710 0.722 0.729 0.647 0.767 0.495 0.671 0.545 0.574 0.614 0.592 0.771 0.640 0.666 - 0.685 0.830 0.838 0.911 0.535 0.683
CODE 0.331 0.487 0.483 0.495 0.568 0.475 0.339 0.410 0.344 0.440 0.439 0.463 0.366 0.514 0.321 0.433 0.324 0.316 0.372 0.389 0.419 0.376 0.350 0.471 0.390 0.510 0.528 0.627 0.233 0.368
DISCIPLINES
NLP 0.568 0.751 0.767 0.755 0.786 0.728 0.588 0.667 0.774 0.675 0.678 0.677 0.609 0.723 0.642 0.685 0.739 0.737 0.655 0.667 0.647 0.637 0.744 - 0.693 0.731 0.734 0.818 0.712 0.725
MATH 0.398 0.633 0.649 0.635 0.652 0.592 0.453 0.552 0.633 0.613 0.612 0.629 0.556 0.636 0.452 0.563 0.482 0.505 0.528 0.525 0.647 0.543 0.584 - 0.617 0.740 0.747 0.844 0.454 0.625
SCIENCE 0.555 0.735 0.749 0.739 0.769 0.698 0.508 0.602 0.664 0.668 0.668 0.676 0.605 0.645 0.673 0.636 0.581 0.596 0.658 0.685 0.696 0.660 0.697 - 0.678 0.746 0.754 0.813 0.544 0.763
ENGINEERING 0.280 0.412 0.426 0.426 0.438 0.333 0.224 0.303 0.333 0.412 0.404 0.397 0.323 0.346 0.393 0.339 0.267 0.272 0.300 0.319 0.323 0.308 0.371 - 0.388 0.443 0.444 0.590 0.199 0.586
MEDICINE 0.400 0.591 0.590 0.595 0.648 0.530 0.346 0.464 0.515 0.568 0.572 0.577 0.496 0.512 0.447 0.541 0.485 0.503 0.507 0.525 0.537 0.501 0.633 - 0.510 0.574 0.580 0.702 0.457 0.635
HUMANITIES 0.485 0.643 0.652 0.645 0.679 0.610 0.430 0.540 0.603 0.571 0.563 0.578 0.529 0.563 0.536 0.564 0.535 0.547 0.567 0.567 0.588 0.567 0.671 - 0.591 0.629 0.638 0.742 0.508 0.701
LAW 0.302 0.489 0.485 0.483 0.524 0.431 0.303 0.373 0.420 0.380 0.397 0.406 0.344 0.387 0.370 0.390 0.367 0.374 0.393 0.399 0.392 0.310 0.498 - 0.409 0.431 0.435 0.586 0.327 0.527