Quick Start

This guide will help you get started with scorio for performance evaluation.

Basic Concepts

Data Format

scorio works with outcome matrices:

R: M × N integer matrix where:
- M = number of questions being evaluated
- N = number of trials/samples per question
- Entries are categories in {0, …, C}
w: Weight vector of length C+1 mapping categories to scores

Binary Evaluation

For binary outcomes (correct/incorrect):

import numpy as np
from scorio import eval

# 2 question, 5 trials each
# 0 = incorrect, 1 = correct
R = np.array([[0, 1, 1, 0, 1],
              [1, 1, 0, 1, 1]])

# Weight vector: 0→0.0, 1→1.0
w = np.array([0.0, 1.0])

# Bayesian evaluation
mu, sigma = eval.bayes(R, w)
print(f"Estimate: {mu:.4f} ± {sigma:.4f}")

Multi-Category Evaluation

For outcomes with multiple categories:

# 0 = incorrect, 1 = partial, 2 = correct
R = np.array([[0, 1, 2, 2, 1],
              [1, 1, 0, 2, 2]])

# Weight vector for 3 categories
w = np.array([0.0, 0.5, 1.0])

mu, sigma = eval.bayes(R, w)
print(f"Estimate: {mu:.4f} ± {sigma:.4f}")

Using Prior Knowledge

Incorporate prior outcomes:

R = np.array([[0, 1, 2, 2, 1],
              [1, 1, 0, 2, 2]])
w = np.array([0.0, 0.5, 1.0])

# Prior outcomes (2 trials per question)
R0 = np.array([[0, 2],
               [1, 2]])

mu, sigma = eval.bayes(R, w, R0)
print(f"With prior: {mu:.4f} ± {sigma:.4f}")

Pass@k Metrics

Standard Pass@k (at least one correct):

R = np.array([[0, 1, 1, 0, 1],
              [1, 1, 0, 1, 1]])

# Probability at least 1 of 2 samples is correct
pass_2 = eval.pass_at_k(R, k=2)
print(f"Pass@2: {pass_2:.4f}")

Pass^k (all correct):

# Probability all 2 samples are correct
pass_hat_2 = eval.pass_hat_k(R, k=2)
print(f"Pass^2: {pass_hat_2:.4f}")

Generalized Pass@k with threshold:

# Probability at least 50% of k samples are correct
g_pass = eval.g_pass_at_k_tau(R, k=3, tau=0.5)
print(f"G-Pass@3(τ=0.5): {g_pass:.4f}")

Simple Average

For basic accuracy:

R = np.array([[0, 1, 1, 0, 1],
              [1, 1, 0, 1, 1]])

avg, avg_sigma = eval.avg(R)
print(f"Average: {avg:.4f} ± {avg_sigma:.4f}")

Next Steps

See Examples for more detailed use cases
Check scorio.eval for complete API documentation
Read the paper: https://arxiv.org/abs/2510.04265