scorio.eval
Evaluation metrics for outcome matrices. Import the public API with
from scorio import eval and call functions as eval.bayes(...),
eval.pass_at_k(...), and so on.
Bayes@N
- bayes(R, w=None, R0=None)[source]
Performance evaluation using the Bayes@N framework.
References
Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don’t Pass@k: A Bayesian Framework for Large Language Model Evaluation. ICLR 2026, arXiv:2510.04265. https://arxiv.org/abs/2510.04265
- Parameters:
R (
ndarray) – \(M \times N\) int matrix with entries in \(\{0,\ldots,C\}\). Row \(\alpha\) are the N outcomes for question \(\alpha\).w (
ndarray|None) – length \((C+1)\) weight vector \((w_0,\ldots,w_C)\) that maps category k to score \(w_k\).R0 (
ndarray|None) – optional \(M \times D\) int matrix supplying D prior outcomes per row. If omitted, \(D=0\).
- Returns:
\((\mu, \sigma)\) performance metric estimate and its uncertainty.
- Return type:
- Notation:
\(\delta_{a,b}\) is the Kronecker delta. For each row \(\alpha\) and class \(k \in \{0,\ldots,C\}\):
\[ \begin{align}\begin{aligned}n_{\alpha k} &= \sum_{i=1}^N \delta_{k, R_{\alpha i}} \quad \text{(counts in R)}\\n^0_{\alpha k} &= 1 + \sum_{i=1}^D \delta_{k, R^0_{\alpha i}} \quad \text{(Dirichlet(+1) prior)}\\\nu_{\alpha k} &= n_{\alpha k} + n^0_{\alpha k}\end{aligned}\end{align} \]Effective sample size: \(T = 1 + C + D + N\) (scalar)
- Formula:
- \[\mu = w_0 + \frac{1}{M \cdot T} \sum_{\alpha=1}^M \sum_{j=0}^C \nu_{\alpha j} (w_j - w_0)\]\[\sigma = \sqrt{ \frac{1}{M^2(T+1)} \sum_{\alpha=1}^M \left[ \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0)^2 - \left( \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0) \right)^2 \right] }\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 2, 2, 1], ... [1, 1, 0, 2, 2]]) >>> w = np.array([0.0, 0.5, 1.0]) >>> R0 = np.array([[0, 2], ... [1, 2]])
With prior (D=2 → T=10):
>>> mu, sigma = bayes(R, w, R0) >>> round(mu, 6), round(sigma, 6) (0.575, 0.084275)
Without prior (D=0 → T=8):
>>> mu2, sigma2 = bayes(R, w) >>> round(mu2, 6), round(sigma2, 6) (0.5625, 0.091998)
- bayes_ci(R, w=None, R0=None, confidence=0.95, bounds=None)[source]
Bayes@N posterior mean, uncertainty, and credible interval.
This is the interval-valued companion to
bayes(). It computes \(\mu\) and \(\sigma\) with the Bayes@N posterior moments, then forms a central normal-approximation credible interval.- Parameters:
R (
ndarray) – \(M \times N\) int matrix with entries in \(\{0,\ldots,C\}\).w (
ndarray|None) – optional length \((C+1)\) weight vector \((w_0,\ldots,w_C)\). If omitted,Rmust be binary andw = [0, 1]is used.R0 (
ndarray|None) – optional \(M \times D\) int matrix supplying prior outcomes for each row.confidence (
float) – credibility level of the interval.bounds (
tuple[float,float] |None) – optional(lo, hi)clipping bounds for the interval.
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).
- Return type:
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> mu, sigma, lo, hi = bayes_ci(R, bounds=(0.0, 1.0)) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.642857, 0.118451, 0.4107, 0.875)
avg@N
- avg(R, w=None)[source]
Avg@N plus a Bayesian uncertainty estimate (uniform prior, no R0).
Under a uniform Dirichlet prior (\(D = 0\)), the Bayesian posterior mean \(\mu\) is an affine transform of the naive (weighted) average a, and the standard deviations are related by Eq. 20 of the Bayes@N paper:
\[\sigma_{\text{avg}} = \frac{T}{N}\,\sigma_{\text{Bayes}}\]This lets you report the familiar avg@N while using the Bayesian framework of
scorioto compute uncertainty on the same scale, without relying on CLT/Wald intervals or bootstrap resampling.- Parameters:
- Returns:
\((a,\; \sigma_a)\) where a is the (weighted) average and \(\sigma_a\) is the Bayesian uncertainty rescaled to the avg@N scale.
- Return type:
- Formula:
Let \(T = 1 + C + N\) (uniform prior, \(D = 0\)).
\[ \begin{align}\begin{aligned}a &= \text{avg}(R, w)\\\sigma_a &= \frac{T}{N}\,\sigma_{\text{Bayes}}(R, w)\end{aligned}\end{align} \]
Examples
Binary (no weights):
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> a, sigma = avg(R) >>> round(a, 6), round(sigma, 6) (0.7, 0.165831)
Weighted:
>>> R = np.array([[0, 1, 2, 2, 1], ... [1, 1, 0, 2, 2]]) >>> w = np.array([0.0, 0.5, 1.0]) >>> a, sigma = avg(R, w) >>> round(a, 6), round(sigma, 6) (0.6, 0.147196)
- avg_ci(R, w=None, confidence=0.95, bounds=None)[source]
Avg@N with Bayesian \(\sigma\) and a normal-approximation credible interval (CrI).
Combines
avg()with a symmetric normal credible interval clipped to optionalbounds.- Parameters:
R (
ndarray) – \(M \times N\) int matrix with entries in \(\{0, \ldots, C\}\). Row \(\alpha\) contains the N outcomes for question \(\alpha\).w (
ndarray|None) – optional length \((C+1)\) weight vector \((w_0, \ldots, w_C)\). If None, R must be binary and \(w = (0, 1)\) is used.confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float] |None) – optional(lo, hi)clipping bounds for the interval.
- Returns:
\((a,\; \sigma_a,\; \text{lo},\; \text{hi})\)
- Return type:
- Formula:
- \[\text{lo},\; \text{hi} = a \pm z_{(1+\gamma)/2}\,\sigma_a\]
where \(\gamma\) is the requested
confidencelevel and the interval is clipped toboundswhen provided.
Examples
Binary (no weights):
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> a, sigma, lo, hi = avg_ci(R, bounds=(0.0, 1.0)) >>> round(a, 4), round(sigma, 4), round(lo, 4), round(hi, 4) (0.7, 0.1658, 0.375, 1.0)
Weighted:
>>> R = np.array([[0, 1, 2, 2, 1], ... [1, 1, 0, 2, 2]]) >>> w = np.array([0.0, 0.5, 1.0]) >>> a, sigma, lo, hi = avg_ci(R, w, confidence=0.95) >>> round(a, 4), round(sigma, 4), round(lo, 4), round(hi, 4) (0.6, 0.1472, 0.3115, 0.8885)
Pass@k
unanimous_at_k and unanimous_at_k_ci are aliases for
pass_hat_k and pass_hat_k_ci.
- pass_at_k(R, k)[source]
Unbiased Pass@k estimator.
Computes the probability that at least one of k randomly selected samples is correct, averaged over all M questions.
References
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374
- Parameters:
- Returns:
The average Pass@k score across all M questions.
- Return type:
- Notation:
For each row \(\alpha\):
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).
- Formula:
- \[\text{Pass@k}_\alpha = 1 - \frac{C(N - \nu_\alpha, k)}{C(N, k)}\]\[\text{Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{Pass@k}_\alpha\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(pass_at_k(R, 1), 6) 0.7 >>> round(pass_at_k(R, 2), 6) 0.95
- pass_hat_k(R, k)[source]
Pass^k (Pass-hat@k): probability that all k selected trials are correct.
Computes the probability that k randomly selected samples are ALL correct, averaged over all M questions. Also known as G-Pass@k.
References
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). \(\tau\)-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
- Parameters:
- Returns:
The average Pass^k score across all M questions.
- Return type:
- Notation:
For each row \(\alpha\):
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).
- Formula:
- \[\hat{\text{Pass@k}}_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]\[\hat{\text{Pass@k}} = \frac{1}{M} \sum_{\alpha=1}^{M} \hat{\text{Pass@k}}_\alpha\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(pass_hat_k(R, 1), 6) 0.7 >>> round(pass_hat_k(R, 2), 6) 0.45
- pass_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Bayesian \(\mu\), \(\sigma\), and credible interval for i.i.d. Pass@k.
Treats each question’s underlying correctness probability \(p\) as latent with a \(\text{Beta}(\alpha_0, \beta_0)\) posterior and propagates uncertainty to the dataset-level metric.
- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.k (
int) – Number of samples to select (\(1 \le k \le N\)).confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval (default(0, 1)).alpha0 (
float) – Beta prior parameter \(\alpha_0\) (default 1).beta0 (
float) – Beta prior parameter \(\beta_0\) (default 1).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- Notation:
Per-question posterior: \(p_\alpha \mid R \sim \text{Beta}(\alpha_0 + c_\alpha,\; \beta_0 + N - c_\alpha)\) where \(c_\alpha = \sum_i R_{\alpha i}\).
- Formula:
The per-question i.i.d. quantity is:
\[g(p) = 1 - (1 - p)^k\]Its posterior mean and variance are:
\[ \begin{align}\begin{aligned}\mathbb{E}[g(p_\alpha)] &= 1 - \frac{B(\alpha_\alpha,\; \beta_\alpha + k)}{B(\alpha_\alpha, \beta_\alpha)}\\\text{Var}[g(p_\alpha)] &= \mathbb{E}[g(p_\alpha)^2] - \mathbb{E}[g(p_\alpha)]^2\end{aligned}\end{align} \]Dataset-level aggregation:
\[ \begin{align}\begin{aligned}\mu &= \frac{1}{M} \sum_{\alpha} \mathbb{E}[g(p_\alpha)]\\\sigma &= \frac{1}{M} \sqrt{\sum_{\alpha} \text{Var}[g(p_\alpha)]}\end{aligned}\end{align} \]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> mu, sigma, lo, hi = pass_at_k_ci(R, 1) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.642857, 0.118451, 0.4107, 0.875) >>> mu, sigma, lo, hi = pass_at_k_ci(R, 2) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.839286, 0.097263, 0.6487, 1.0)
- pass_hat_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Bayesian \(\mu\), \(\sigma\), and credible interval for i.i.d. Pass^k.
Treats each question’s underlying correctness probability \(p\) as latent with a \(\text{Beta}(\alpha_0, \beta_0)\) posterior and propagates uncertainty to the dataset-level metric.
- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.k (
int) – Number of samples to select (\(1 \le k \le N\)).confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval (default(0, 1)).alpha0 (
float) – Beta prior parameter \(\alpha_0\) (default 1).beta0 (
float) – Beta prior parameter \(\beta_0\) (default 1).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- Formula:
The per-question i.i.d. quantity is:
\[g(p) = p^k\]Its posterior mean and variance are:
\[ \begin{align}\begin{aligned}\mathbb{E}[g(p_\alpha)] &= \frac{B(\alpha_\alpha + k,\; \beta_\alpha)}{B(\alpha_\alpha, \beta_\alpha)}\\\text{Var}[g(p_\alpha)] &= \mathbb{E}[g(p_\alpha)^2] - \mathbb{E}[g(p_\alpha)]^2\end{aligned}\end{align} \]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> mu, sigma, lo, hi = pass_hat_k_ci(R, 1) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.642857, 0.118451, 0.4107, 0.875) >>> mu, sigma, lo, hi = pass_hat_k_ci(R, 2) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.446429, 0.146167, 0.1599, 0.7329)
AUC@K
- auc_at_k(R, k)[source]
Performance evaluation using AUC@K.
References
Hu, Z., et al. (2026). Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs. arXiv:2601.08763. https://arxiv.org/abs/2601.08763
- Parameters:
- Returns:
The average AUC@K score across all \(M\) questions.
- Return type:
- Notation:
For each row \(\alpha\):
\[ \begin{align}\begin{aligned}\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\\\mathrm{Pass@}j_\alpha = 1 - \frac{\binom{N - \nu_\alpha}{j}}{\binom{N}{j}}\end{aligned}\end{align} \]For \(k > 1\), define trapezoidal coefficients \(c_1 = c_k = \frac{1}{2(k-1)}\) and \(c_j = \frac{1}{k-1}\) for \(2 \le j \le k-1\). For \(k = 1\), \(\mathrm{AUC@1} = \mathrm{Pass@1}\).
- Formula:
- \[\mathrm{AUC@}k_\alpha = \sum_{j=1}^{k} c_j \, \mathrm{Pass@}j_\alpha\]\[\mathrm{AUC@}k = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{AUC@}k_\alpha\]
Equivalently, for \(k > 1\),
\[\mathrm{AUC@}k = \frac{1}{k - 1} \sum_{j=1}^{k-1} \frac{\mathrm{Pass@}j + \mathrm{Pass@}(j + 1)}{2}\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(auc_at_k(R, 1), 6) 0.7 >>> round(auc_at_k(R, 2), 6) 0.825 >>> round(auc_at_k(R, 3), 6) 0.9
- auc_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Bayesian posterior summary for the latent AUC@K target.
The posterior model treats each question’s success probability as a latent Bernoulli parameter with a Beta prior. It propagates that uncertainty through the AUC@K weighted sum of i.i.d. Pass@j targets. For
k = 1, AUC@1 is Pass@1, so this function returnspass_at_k_ci()withk = 1.- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Maximum sampling budget with1 <= k <= N.confidence (
float) – credibility level of the interval.bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval.alpha0 (
float) – Beta prior parameter \(\alpha_0\).beta0 (
float) – Beta prior parameter \(\beta_0\).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).
- Return type:
Majority
- maj_at_k(R, k)[source]
Maj@k: strict-majority correctness over
ksamples.This metric measures the probability that a uniformly sampled subset of
kobserved traces contains strictly more than half correct solutions. It is a binary-outcome proxy for the majority-vote metrics often reported in reasoning papers when only correctness labels are available.- Parameters:
- Returns:
Average strict-majority success probability across prompts.
- Return type:
- Formula:
Let \(\nu_\alpha = \sum_i R_{\alpha i}\) and \(j_0 = \lfloor k/2 \rfloor + 1\). Then
\[\mathrm{Maj@k}_\alpha = \sum_{j=j_0}^{k} \frac{\binom{\nu_\alpha}{j}\binom{N-\nu_\alpha}{k-j}} {\binom{N}{k}}.\]Equivalently, this is \(\mathrm{G\text{-}Pass@k}_{\tau}\) with \(\tau = j_0 / k\).
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(maj_at_k(R, 1), 6) 0.7 >>> round(maj_at_k(R, 2), 6) 0.45 >>> round(maj_at_k(R, 3), 6) 0.85
- maj_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Bayesian posterior summary for
maj_at_k().This reuses the generalized threshold-pass posterior with the strict majority threshold \(j_0 = \lfloor k/2 \rfloor + 1\).
- Parameters:
R (
ndarray) – \(M \times N\) binary outcome matrix.k (
int) – Number of sampled traces, with1 <= k <= N.confidence (
float) – Credibility level for the normal-approximation interval.bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval.alpha0 (
float) – Beta prior parameter \(\alpha_0\).beta0 (
float) – Beta prior parameter \(\beta_0\).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- Formula:
This is exactly
g_pass_at_k_tau_ci()evaluated at\[\tau = \frac{\lfloor k/2 \rfloor + 1}{k}.\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> mu, sigma, lo, hi = maj_at_k_ci(R, 2) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.446429, 0.146167, 0.1599, 0.7329) >>> mu, sigma, lo, hi = maj_at_k_ci(R, 3) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.684524, 0.151958, 0.3867, 0.9824)
Generalized Pass@k
- g_pass_at_k(R, k)[source]
Performance evaluation using G-Pass@k.
Equivalent to
pass_hat_k(), included for literature that uses the G-Pass@k naming convention for the \(\tau = 1\) threshold.References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
- Parameters:
- Returns:
The average G-Pass@k score across all \(M\) questions.
- Return type:
- Notation:
For each row \(\alpha\):
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).
- Formula:
- \[\mathrm{G\text{-}Pass@}k_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]\[\mathrm{G\text{-}Pass@}k = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{G\text{-}Pass@}k_\alpha\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(g_pass_at_k(R, 1), 6) 0.7 >>> round(g_pass_at_k(R, 2), 6) 0.45
- g_pass_at_k_tau(R, k, tau)[source]
Performance evaluation using G-Pass@kτ.
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.k (
int) – Number of samples to select (\(1 \le k \le N\)).tau (
float) – Threshold parameter \(\tau \in [0, 1]\). Requires at least \(\lceil \tau \cdot k \rceil\) successes. When \(\tau = 0\), equivalent to Pass@k. When \(\tau = 1\), equivalent to Pass^k.
- Returns:
The average G-Pass@kτ score across all \(M\) questions.
- Return type:
- Notation:
For each row \(\alpha\):
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).
\(j_0 = \lceil \tau \cdot k \rceil\) is the minimum number of successes required.
- Formula:
- \[\mathrm{G\text{-}Pass@}k_{\tau, \alpha} = \sum_{j=j_0}^{k} \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]\[\mathrm{G\text{-}Pass@}k_\tau = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{G\text{-}Pass@}k_{\tau, \alpha}\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(g_pass_at_k_tau(R, 2, 0.5), 6) 0.95 >>> round(g_pass_at_k_tau(R, 2, 1.0), 6) 0.45
- mg_pass_at_k(R, k)[source]
Performance evaluation using mG-Pass@k.
References
Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147
- Parameters:
- Returns:
The average mG-Pass@k score across all \(M\) questions.
- Return type:
- Notation:
For each row \(\alpha\):
\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]\(m = \lceil k/2 \rceil\) is the majority threshold. Let \(X_\alpha \sim \mathrm{Hypergeometric}(N, \nu_\alpha, k)\) be the number of correct samples among \(k\) selections for row \(\alpha\).
- Formula:
- \[\mathrm{mG\text{-}Pass@}k_\alpha = 2 \int_{0.5}^{1.0} \mathrm{G\text{-}Pass@}k_{\tau, \alpha} \, d\tau\]\[\mathrm{mG\text{-}Pass@}k_\alpha = \frac{2}{k} \sum_{j=m+1}^{k} (j - m) \cdot P(X_\alpha = j)\]\[P(X_\alpha = j) = \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]\[\mathrm{mG\text{-}Pass@}k = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{mG\text{-}Pass@}k_\alpha\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(mg_pass_at_k(R, 2), 6) 0.45 >>> round(mg_pass_at_k(R, 3), 6) 0.166667
- g_pass_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Bayesian posterior summary for G-Pass@k.
G-Pass@k is the all-success threshold, so this is the same posterior target as
pass_hat_k_ci().- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Number of selected samples with1 <= k <= N.confidence (
float) – credibility level of the interval.bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval.alpha0 (
float) – Beta prior parameter \(\alpha_0\).beta0 (
float) – Beta prior parameter \(\beta_0\).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).
- Return type:
- g_pass_at_k_tau_ci(R, k, tau, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Bayesian posterior summary for thresholded G-Pass@k.
The latent target is the probability that at least \(\lceil \tau k \rceil\) of
ki.i.d. samples are correct.- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Number of selected samples with1 <= k <= N.tau (
float) – Threshold parameter in[0, 1].confidence (
float) – credibility level of the interval.bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval.alpha0 (
float) – Beta prior parameter \(\alpha_0\).beta0 (
float) – Beta prior parameter \(\beta_0\).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).
- Return type:
- mg_pass_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Bayesian posterior summary for mG-Pass@k.
The latent target averages thresholded G-Pass@k over thresholds from
0.5to1.0using the closed-form weighting inmg_pass_at_k().- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Number of selected samples with1 <= k <= N.confidence (
float) – credibility level of the interval.bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval.alpha0 (
float) – Beta prior parameter \(\alpha_0\).beta0 (
float) – Beta prior parameter \(\beta_0\).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).
- Return type:
GeoSpectrum
- geom_at_k(R, k, pass_power=0.5, unanimous_power=0.5)[source]
Questionwise Geom@k averaged across questions.
This is
scorio’s primary Geom@k metric. Unlikegeom_ds_at_k(), which blends dataset-level Pass@k and dataset-level Unanimous@k, this function first computes the per-question quantities\[P_{\alpha,k} = 1 - \frac{\binom{N - \nu_\alpha}{k}}{\binom{N}{k}}\]\[U_{\alpha,k} = \frac{\binom{\nu_\alpha}{k}}{\binom{N}{k}}\]forms the geometric blend
\[G_{\alpha,k} = P_{\alpha,k}^{a}\,U_{\alpha,k}^{b},\]and only then averages across questions.
- Parameters:
- Returns:
The average questionwise
Geom@kscore.- Return type:
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(geom_at_k(R, 2), 6) 0.647106
- geom_at_k_ci(R, k, pass_power=0.5, unanimous_power=0.5, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Approximate posterior summary for the questionwise Geom@k target.
This is the uncertainty counterpart of
geom_at_k(): it applies a first-order delta method to each question’s latent Pass@k and Unanimous@k quantities, then averages the resulting question-level geometric blends.- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.pass_power (
float) – Exponent applied to each question’s latentPass@k.unanimous_power (
float) – Exponent applied to each question’s latentUnanimous@k.confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval (default(0, 1)).alpha0 (
float) – Beta prior parameter \(\alpha_0\) (default 1).beta0 (
float) – Beta prior parameter \(\beta_0\) (default 1).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- Formula:
Let \(\mu_{P,\alpha}\) and \(\mu_{U,\alpha}\) denote the posterior means of question \(\alpha\)’s latent Pass@k and Unanimous@k quantities. Then
\[\mu \approx \frac{1}{M}\sum_\alpha \mu_{P,\alpha}^{a}\,\mu_{U,\alpha}^{b}\]and \(\sigma\) is computed by per-question first-order delta propagation through \(g(x, y) = x^a y^b\).
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> mu, sigma, lo, hi = geom_at_k_ci(R, 2) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.610666, 0.133107, 0.3498, 0.8716)
- geom_ds_at_k(R, k, pass_power=0.5, unanimous_power=0.5)[source]
Dataset-level Pass/Unanimous geometric blend.
This is the endpoint GeoSpectrum operating point from the paper: it first averages Pass@k and Unanimous@k across questions, then applies the geometric blend. For the questionwise metric that blends before averaging, use
geom_at_k().The default operating point is the geometric mean of dataset-level Pass@k and Unanimous@k (equivalently Pass^k). The same API also exposes nearby operating points by letting callers adjust the exponents on the Pass@k and Unanimous@k terms directly.
- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.k (
int) – Sampling budget with \(1 \le k \le N\).pass_power (
float) – Exponent applied toPass@k.unanimous_power (
float) – Exponent applied toUnanimous@k.
- Returns:
The dataset-level endpoint score.
- Return type:
- Formula:
- \[G_{\mathrm{ds},k}(R; a, b) = \mathrm{Pass}@k(R)^a\, \mathrm{Unanimous}@k(R)^b\]
with the default dataset-level endpoint operating point given by \(a = b = 1/2\).
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(geom_ds_at_k(R, 2), 6) 0.653835
- geom_ds_at_k_ci(R, k, pass_power=0.5, unanimous_power=0.5, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Approximate posterior summary for the dataset-level Geom@k target.
This is the uncertainty counterpart of
geom_ds_at_k()and matches the dataset-level latent quantity introduced in the paper whenpass_power = unanimous_power = 0.5.- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.pass_power (
float) – Exponent applied to latent dataset-levelPass@k.unanimous_power (
float) – Exponent applied to latent dataset-levelUnanimous@k.confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval (default(0, 1)).alpha0 (
float) – Beta prior parameter \(\alpha_0\) (default 1).beta0 (
float) – Beta prior parameter \(\beta_0\) (default 1).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- Formula:
Let \(\mu_P\) and \(\mu_U\) denote the posterior means of the latent dataset-level Pass@k and Unanimous@k quantities. Then
\[\mu \approx \mu_P^a\,\mu_U^b\]and \(\sigma\) is computed by first-order delta propagation through \(g(x, y) = x^a y^b\).
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> mu, sigma, lo, hi = geom_ds_at_k_ci(R, 2) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.612112, 0.132755, 0.3519, 0.8723)
- geo_spectrum_at_k(R, k, lam=0.5, weights=None, lambda_=None)[source]
\(\mathrm{GeoSpectrum}_{\lambda,w}@k\) on the observed finite bank.
By default
weights=Noneselects the upper-halfmGweights, so the two-argument callgeo_spectrum_at_k(R, k)remains the special case\[\mathrm{GeoSpectrum}^*@k(R) = \sqrt{\mathrm{Pass}@k(R)\,\mathrm{mG\text{-}Pass}@k(R)}.\]This function also accepts the keyword alias
lambda_=...for callers that prefer naming the coupling parameter after the mathematical symbol.- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Sampling budget with \(1 \le k \le N\).lam (
float) – The coupling parameter \(\lambda\) in \([0,1]\).weights (
ndarray|list[float] |tuple[float,...] |None) – Spectrum weights \(w\). If omitted, uses the built-in upper-half mG weights. Custom weights must be length-\(k\), non-negative, finite, and satisfy \(\sum_r w_r \le 1\).lambda_ (float | None)
- Returns:
\(\mathrm{GeoSpectrum}_{\lambda,w}@k(R)\).
- Return type:
- Formula:
- \[\mathrm{GeoSpectrum}_{\lambda,w}@k(R) = \mathrm{Pass}@k(R)^\lambda \, S_{w,k}(R)^{1-\lambda}\]
Examples
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(geo_spectrum_at_k(R, 3), 6) 0.408248 >>> round(geo_spectrum_at_k(R, 3, lam=1.0), 6) 1.0
- geo_spectrum_at_k_ci(R, k, lam=0.5, weights=None, lambda_=None, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Approximate posterior summary for latent \(\mathrm{GeoSpectrum}_{\lambda,w}@k\).
As in
geo_spectrum_at_k(), omittingweightsselects theGeoSpectrum*@koperating point.This function also accepts the keyword alias
lambda_=...for callers that prefer naming the coupling parameter after the mathematical symbol.- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.lam (
float) – The coupling parameter \(\lambda\) in \([0,1]\).weights (
ndarray|list[float] |tuple[float,...] |None) – Spectrum weights \(w\). If omitted, uses the built-in upper-half mG weights unless \(\lambda = 1\), in which case the spectrum term is irrelevant. Custom weights must be length-\(k\), non-negative, finite, and satisfy \(\sum_r w_r \le 1\).confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval (default(0, 1)).alpha0 (
float) – Beta prior parameter \(\alpha_0\) (default 1).beta0 (
float) – Beta prior parameter \(\beta_0\) (default 1).lambda_ (float | None)
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- Formula:
Let \(x\) denote latent Pass@k and \(y\) denote the latent spectrum \(S_{w,k}\). The posterior mean is approximated by
\[\mu \approx x^\lambda y^{1-\lambda}\]evaluated at the posterior means of \(x\) and \(y\), and \(\sigma\) is obtained by first-order delta propagation through \(g(x, y) = x^\lambda y^{1-\lambda}\).
- geo_spectrum_star_at_k(R, k)[source]
Explicit alias for the default
GeoSpectrum*@koperating point.Equivalent to calling
geo_spectrum_at_k()with the default upper-halfmGspectrum weights.
- geo_spectrum_star_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Approximate posterior summary for latent
GeoSpectrum*@k.Equivalent to
geo_spectrum_at_k_ci()with the default upper-halfmGspectrum weights.- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval (default(0, 1)).alpha0 (
float) – Beta prior parameter \(\alpha_0\) (default 1).beta0 (
float) – Beta prior parameter \(\beta_0\) (default 1).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- threshold_spectrum_at_k(R, k, weights)[source]
Finite-bank threshold-spectrum summary \(S_{w,k}(R)\).
- Parameters:
- Returns:
\(S_{w,k}(R)\) averaged across questions.
- Return type:
Notes
This summary is defined by
\[S_{w,k}(R) = \sum_{r=1}^k w_r T_{r,k}(R).\]The implementation uses the equivalent event-score representation from Appendix C.4.
- threshold_spectrum_at_k_ci(R, k, weights, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]
Approximate posterior summary for the latent spectrum \(S_{w,k}(p)\).
- Parameters:
R (
ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).k (
int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.weights (
ndarray|list[float] |tuple[float,...]) – Non-negative length-\(k\) weights with \(\sum_r w_r \le 1\).confidence (
float) – credibility level of the interval (default 0.95).bounds (
tuple[float,float]) –(lo, hi)clipping bounds for the interval (default(0, 1)).alpha0 (
float) – Beta prior parameter \(\alpha_0\) (default 1).beta0 (
float) – Beta prior parameter \(\beta_0\) (default 1).
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
Notes
Unlike
threshold_spectrum_at_k(), the posterior target is defined for latent i.i.d. resampling and therefore does not require \(k \le N\).- Formula:
Let \(A_j = \sum_{r \le j} w_r\). The per-question latent target is
\[g(p) = \sum_{j=1}^{k} A_j \binom{k}{j} p^j (1-p)^{k-j}.\]Dataset-level aggregation uses
\[\mu = \frac{1}{M} \sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)]\]\[\sigma = \frac{1}{M} \sqrt{ \sum_{\alpha=1}^{M} \mathrm{Var}[g(p_\alpha)] }.\]
Max-Reward
- max_at_k(R, k, w=None)[source]
Max@k: expected best reward among
ksampled traces.When
w = [0, 1], Max@k reduces exactly to Pass@k. More generally, the reward vectorwmaps categorical outcomes to arbitrary real-valued scores, and Max@k averages the best score obtainable from a subset of sizek.References
Bagirov, F., et al. (2025). The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimization. arXiv:2510.23393. https://arxiv.org/abs/2510.23393 The finite-sample estimator below matches Appendix C.1 / Listing 1.
Walder, C., & Karkhanis, D. (2025). Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems. arXiv:2505.15201.
- Parameters:
- Returns:
Average Max@k score across prompts.
- Return type:
- Formula:
Let \(g_{\alpha 1} \le \cdots \le g_{\alpha N}\) denote the sorted rewards for prompt \(\alpha\). Then the unbiased finite- sample estimator is
\[\mathrm{Max@k}_\alpha = \frac{1}{\binom{N}{k}} \sum_{i=k}^{N} \binom{i-1}{k-1} g_{\alpha i}.\]The dataset-level metric is the average across prompts:
\[\mathrm{Max@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{Max@k}_\alpha\]
Examples
Binary (reduces to Pass@k):
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> round(max_at_k(R, 2), 6) 0.95
Weighted categorical rewards:
>>> R = np.array([[0, 1, 2, 2, 1], ... [1, 1, 0, 2, 2]]) >>> w = np.array([0.0, 0.5, 1.0]) >>> round(max_at_k(R, 2, w=w), 6) 0.85
- max_at_k_ci(R, k, w=None, R0=None, confidence=0.95, bounds=None)[source]
Bayesian posterior summary for
max_at_k().The posterior uses the same Dirichlet-plus-one construction as
bayes(). Whenk = 1,Max@1reduces to the usual single-draw expected score, so this function agrees withbayes_ci().This uncertainty model is a
scorioextension. Bagirov et al. (2025) define the finite-sample max@k point estimator, but do not derive these Bayesian credible intervals.- Parameters:
R (
ndarray) – \(M \times N\) categorical outcome matrix with integer entries in \(\{0, \ldots, C\}\).k (
int) – Selection count. The posterior target is defined for any integerk >= 1;k = 1matchesbayes_ci().w (
ndarray|None) – Optional reward vector of shape(C+1,). If omitted,Rmust be binary andw = [0, 1]is used.R0 (
ndarray|None) – Optional \(M \times D\) matrix of prior outcomes.confidence (
float) – Credibility level for the normal-approximation interval.bounds (
tuple[float,float] |None) – Optional(lo, hi)clipping bounds. If omitted, the interval is clipped to the minimum and maximum reward levels inw.
- Returns:
\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)
- Return type:
- Formula:
Let \(r_1 < \cdots < r_L\) be the unique reward levels and \(A_{\alpha \ell}\) the posterior cumulative probability of obtaining reward at most \(r_\ell\) for prompt \(\alpha\). Then the per-prompt latent target is
\[g_\alpha = r_L - \sum_{\ell=1}^{L-1} (r_{\ell+1} - r_\ell) A_{\alpha \ell}^k\]and posterior moments are computed in closed form under the grouped Dirichlet posterior.
Examples
Binary:
>>> import numpy as np >>> R = np.array([[0, 1, 1, 0, 1], ... [1, 1, 0, 1, 1]]) >>> mu, sigma, lo, hi = max_at_k_ci(R, 2) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.839286, 0.097263, 0.6487, 1.0)
Weighted categorical rewards:
>>> R = np.array([[0, 1, 2, 2, 1], ... [1, 1, 0, 2, 2]]) >>> w = np.array([0.0, 0.5, 1.0]) >>> mu, sigma, lo, hi = max_at_k_ci(R, 2, w=w) >>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4) (0.75, 0.08812, 0.5773, 0.9227)