scorio.eval

Evaluation metrics for outcome matrices. Import the public API with from scorio import eval and call functions as eval.bayes(...), eval.pass_at_k(...), and so on.

Shared conventions

R is an \(M \times N\) matrix, where rows are questions and columns are sampled trials. Binary metrics require entries in \(\{0,1\}\). Categorical metrics use entries in \(\{0,\ldots,C\}\) together with a weight or reward vector w of length \(C+1\).

Point estimators return a scalar score. Companion *_ci functions return (mu, sigma, lo, hi), where mu is the posterior mean or point estimate, sigma is the posterior standard deviation under the metric’s uncertainty model, and lo and hi define a normal-approximation credible interval.

Bayes@N

bayes(R, w=None, R0=None)[source]

Performance evaluation using the Bayes@N framework.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don’t Pass@k: A Bayesian Framework for Large Language Model Evaluation. ICLR 2026, arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Parameters:
  • R (ndarray) – \(M \times N\) int matrix with entries in \(\{0,\ldots,C\}\). Row \(\alpha\) are the N outcomes for question \(\alpha\).

  • w (ndarray | None) – length \((C+1)\) weight vector \((w_0,\ldots,w_C)\) that maps category k to score \(w_k\).

  • R0 (ndarray | None) – optional \(M \times D\) int matrix supplying D prior outcomes per row. If omitted, \(D=0\).

Returns:

\((\mu, \sigma)\) performance metric estimate and its uncertainty.

Return type:

tuple[float, float]

Notation:

\(\delta_{a,b}\) is the Kronecker delta. For each row \(\alpha\) and class \(k \in \{0,\ldots,C\}\):

\[ \begin{align}\begin{aligned}n_{\alpha k} &= \sum_{i=1}^N \delta_{k, R_{\alpha i}} \quad \text{(counts in R)}\\n^0_{\alpha k} &= 1 + \sum_{i=1}^D \delta_{k, R^0_{\alpha i}} \quad \text{(Dirichlet(+1) prior)}\\\nu_{\alpha k} &= n_{\alpha k} + n^0_{\alpha k}\end{aligned}\end{align} \]

Effective sample size: \(T = 1 + C + D + N\) (scalar)

Formula:
\[\mu = w_0 + \frac{1}{M \cdot T} \sum_{\alpha=1}^M \sum_{j=0}^C \nu_{\alpha j} (w_j - w_0)\]
\[\sigma = \sqrt{ \frac{1}{M^2(T+1)} \sum_{\alpha=1}^M \left[ \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0)^2 - \left( \sum_j \frac{\nu_{\alpha j}}{T} (w_j - w_0) \right)^2 \right] }\]

Examples

>>> import numpy as np
>>> R  = np.array([[0, 1, 2, 2, 1],
...                [1, 1, 0, 2, 2]])
>>> w  = np.array([0.0, 0.5, 1.0])
>>> R0 = np.array([[0, 2],
...                [1, 2]])

With prior (D=2 → T=10):

>>> mu, sigma = bayes(R, w, R0)
>>> round(mu, 6), round(sigma, 6)
(0.575, 0.084275)

Without prior (D=0 → T=8):

>>> mu2, sigma2 = bayes(R, w)
>>> round(mu2, 6), round(sigma2, 6)
(0.5625, 0.091998)
bayes_ci(R, w=None, R0=None, confidence=0.95, bounds=None)[source]

Bayes@N posterior mean, uncertainty, and credible interval.

This is the interval-valued companion to bayes(). It computes \(\mu\) and \(\sigma\) with the Bayes@N posterior moments, then forms a central normal-approximation credible interval.

Parameters:
  • R (ndarray) – \(M \times N\) int matrix with entries in \(\{0,\ldots,C\}\).

  • w (ndarray | None) – optional length \((C+1)\) weight vector \((w_0,\ldots,w_C)\). If omitted, R must be binary and w = [0, 1] is used.

  • R0 (ndarray | None) – optional \(M \times D\) int matrix supplying prior outcomes for each row.

  • confidence (float) – credibility level of the interval.

  • bounds (tuple[float, float] | None) – optional (lo, hi) clipping bounds for the interval.

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).

Return type:

tuple[float, float, float, float]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> mu, sigma, lo, hi = bayes_ci(R, bounds=(0.0, 1.0))
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.642857, 0.118451, 0.4107, 0.875)

avg@N

avg(R, w=None)[source]

Avg@N plus a Bayesian uncertainty estimate (uniform prior, no R0).

Under a uniform Dirichlet prior (\(D = 0\)), the Bayesian posterior mean \(\mu\) is an affine transform of the naive (weighted) average a, and the standard deviations are related by Eq. 20 of the Bayes@N paper:

\[\sigma_{\text{avg}} = \frac{T}{N}\,\sigma_{\text{Bayes}}\]

This lets you report the familiar avg@N while using the Bayesian framework of scorio to compute uncertainty on the same scale, without relying on CLT/Wald intervals or bootstrap resampling.

Parameters:
  • R (ndarray) – \(M \times N\) int matrix with entries in \(\{0, \ldots, C\}\). Row \(\alpha\) contains the N outcomes for question \(\alpha\).

  • w (ndarray | None) – optional length \((C+1)\) weight vector \((w_0, \ldots, w_C)\). If None, R must be binary and \(w = (0, 1)\) is used.

Returns:

\((a,\; \sigma_a)\) where a is the (weighted) average and \(\sigma_a\) is the Bayesian uncertainty rescaled to the avg@N scale.

Return type:

tuple[float, float]

Formula:

Let \(T = 1 + C + N\) (uniform prior, \(D = 0\)).

\[ \begin{align}\begin{aligned}a &= \text{avg}(R, w)\\\sigma_a &= \frac{T}{N}\,\sigma_{\text{Bayes}}(R, w)\end{aligned}\end{align} \]

Examples

Binary (no weights):

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> a, sigma = avg(R)
>>> round(a, 6), round(sigma, 6)
(0.7, 0.165831)

Weighted:

>>> R = np.array([[0, 1, 2, 2, 1],
...               [1, 1, 0, 2, 2]])
>>> w = np.array([0.0, 0.5, 1.0])
>>> a, sigma = avg(R, w)
>>> round(a, 6), round(sigma, 6)
(0.6, 0.147196)
avg_ci(R, w=None, confidence=0.95, bounds=None)[source]

Avg@N with Bayesian \(\sigma\) and a normal-approximation credible interval (CrI).

Combines avg() with a symmetric normal credible interval clipped to optional bounds.

Parameters:
  • R (ndarray) – \(M \times N\) int matrix with entries in \(\{0, \ldots, C\}\). Row \(\alpha\) contains the N outcomes for question \(\alpha\).

  • w (ndarray | None) – optional length \((C+1)\) weight vector \((w_0, \ldots, w_C)\). If None, R must be binary and \(w = (0, 1)\) is used.

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float] | None) – optional (lo, hi) clipping bounds for the interval.

Returns:

\((a,\; \sigma_a,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Formula:
\[\text{lo},\; \text{hi} = a \pm z_{(1+\gamma)/2}\,\sigma_a\]

where \(\gamma\) is the requested confidence level and the interval is clipped to bounds when provided.

Examples

Binary (no weights):

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> a, sigma, lo, hi = avg_ci(R, bounds=(0.0, 1.0))
>>> round(a, 4), round(sigma, 4), round(lo, 4), round(hi, 4)
(0.7, 0.1658, 0.375, 1.0)

Weighted:

>>> R = np.array([[0, 1, 2, 2, 1],
...               [1, 1, 0, 2, 2]])
>>> w = np.array([0.0, 0.5, 1.0])
>>> a, sigma, lo, hi = avg_ci(R, w, confidence=0.95)
>>> round(a, 4), round(sigma, 4), round(lo, 4), round(hi, 4)
(0.6, 0.1472, 0.3115, 0.8885)

Pass@k

unanimous_at_k and unanimous_at_k_ci are aliases for pass_hat_k and pass_hat_k_ci.

pass_at_k(R, k)[source]

Unbiased Pass@k estimator.

Computes the probability that at least one of k randomly selected samples is correct, averaged over all M questions.

References

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Number of samples to select (\(1 \le k \le N\)).

Returns:

The average Pass@k score across all M questions.

Return type:

float

Notation:

For each row \(\alpha\):

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).

Formula:
\[\text{Pass@k}_\alpha = 1 - \frac{C(N - \nu_\alpha, k)}{C(N, k)}\]
\[\text{Pass@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \text{Pass@k}_\alpha\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(pass_at_k(R, 1), 6)
0.7
>>> round(pass_at_k(R, 2), 6)
0.95
pass_hat_k(R, k)[source]

Pass^k (Pass-hat@k): probability that all k selected trials are correct.

Computes the probability that k randomly selected samples are ALL correct, averaged over all M questions. Also known as G-Pass@k.

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). \(\tau\)-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Number of samples to select (\(1 \le k \le N\)).

Returns:

The average Pass^k score across all M questions.

Return type:

float

Notation:

For each row \(\alpha\):

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).

Formula:
\[\hat{\text{Pass@k}}_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]
\[\hat{\text{Pass@k}} = \frac{1}{M} \sum_{\alpha=1}^{M} \hat{\text{Pass@k}}_\alpha\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(pass_hat_k(R, 1), 6)
0.7
>>> round(pass_hat_k(R, 2), 6)
0.45
pass_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Bayesian \(\mu\), \(\sigma\), and credible interval for i.i.d. Pass@k.

Treats each question’s underlying correctness probability \(p\) as latent with a \(\text{Beta}(\alpha_0, \beta_0)\) posterior and propagates uncertainty to the dataset-level metric.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Number of samples to select (\(1 \le k \le N\)).

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval (default (0, 1)).

  • alpha0 (float) – Beta prior parameter \(\alpha_0\) (default 1).

  • beta0 (float) – Beta prior parameter \(\beta_0\) (default 1).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Notation:

Per-question posterior: \(p_\alpha \mid R \sim \text{Beta}(\alpha_0 + c_\alpha,\; \beta_0 + N - c_\alpha)\) where \(c_\alpha = \sum_i R_{\alpha i}\).

Formula:

The per-question i.i.d. quantity is:

\[g(p) = 1 - (1 - p)^k\]

Its posterior mean and variance are:

\[ \begin{align}\begin{aligned}\mathbb{E}[g(p_\alpha)] &= 1 - \frac{B(\alpha_\alpha,\; \beta_\alpha + k)}{B(\alpha_\alpha, \beta_\alpha)}\\\text{Var}[g(p_\alpha)] &= \mathbb{E}[g(p_\alpha)^2] - \mathbb{E}[g(p_\alpha)]^2\end{aligned}\end{align} \]

Dataset-level aggregation:

\[ \begin{align}\begin{aligned}\mu &= \frac{1}{M} \sum_{\alpha} \mathbb{E}[g(p_\alpha)]\\\sigma &= \frac{1}{M} \sqrt{\sum_{\alpha} \text{Var}[g(p_\alpha)]}\end{aligned}\end{align} \]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> mu, sigma, lo, hi = pass_at_k_ci(R, 1)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.642857, 0.118451, 0.4107, 0.875)
>>> mu, sigma, lo, hi = pass_at_k_ci(R, 2)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.839286, 0.097263, 0.6487, 1.0)
pass_hat_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Bayesian \(\mu\), \(\sigma\), and credible interval for i.i.d. Pass^k.

Treats each question’s underlying correctness probability \(p\) as latent with a \(\text{Beta}(\alpha_0, \beta_0)\) posterior and propagates uncertainty to the dataset-level metric.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Number of samples to select (\(1 \le k \le N\)).

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval (default (0, 1)).

  • alpha0 (float) – Beta prior parameter \(\alpha_0\) (default 1).

  • beta0 (float) – Beta prior parameter \(\beta_0\) (default 1).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Formula:

The per-question i.i.d. quantity is:

\[g(p) = p^k\]

Its posterior mean and variance are:

\[ \begin{align}\begin{aligned}\mathbb{E}[g(p_\alpha)] &= \frac{B(\alpha_\alpha + k,\; \beta_\alpha)}{B(\alpha_\alpha, \beta_\alpha)}\\\text{Var}[g(p_\alpha)] &= \mathbb{E}[g(p_\alpha)^2] - \mathbb{E}[g(p_\alpha)]^2\end{aligned}\end{align} \]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> mu, sigma, lo, hi = pass_hat_k_ci(R, 1)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.642857, 0.118451, 0.4107, 0.875)
>>> mu, sigma, lo, hi = pass_hat_k_ci(R, 2)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.446429, 0.146167, 0.1599, 0.7329)

AUC@K

auc_at_k(R, k)[source]

Performance evaluation using AUC@K.

References

Hu, Z., et al. (2026). Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs. arXiv:2601.08763. https://arxiv.org/abs/2601.08763

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Maximum sampling budget (\(1 \le k \le N\)).

Returns:

The average AUC@K score across all \(M\) questions.

Return type:

float

Notation:

For each row \(\alpha\):

\[ \begin{align}\begin{aligned}\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\\\mathrm{Pass@}j_\alpha = 1 - \frac{\binom{N - \nu_\alpha}{j}}{\binom{N}{j}}\end{aligned}\end{align} \]

For \(k > 1\), define trapezoidal coefficients \(c_1 = c_k = \frac{1}{2(k-1)}\) and \(c_j = \frac{1}{k-1}\) for \(2 \le j \le k-1\). For \(k = 1\), \(\mathrm{AUC@1} = \mathrm{Pass@1}\).

Formula:
\[\mathrm{AUC@}k_\alpha = \sum_{j=1}^{k} c_j \, \mathrm{Pass@}j_\alpha\]
\[\mathrm{AUC@}k = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{AUC@}k_\alpha\]

Equivalently, for \(k > 1\),

\[\mathrm{AUC@}k = \frac{1}{k - 1} \sum_{j=1}^{k-1} \frac{\mathrm{Pass@}j + \mathrm{Pass@}(j + 1)}{2}\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(auc_at_k(R, 1), 6)
0.7
>>> round(auc_at_k(R, 2), 6)
0.825
>>> round(auc_at_k(R, 3), 6)
0.9
auc_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Bayesian posterior summary for the latent AUC@K target.

The posterior model treats each question’s success probability as a latent Bernoulli parameter with a Beta prior. It propagates that uncertainty through the AUC@K weighted sum of i.i.d. Pass@j targets. For k = 1, AUC@1 is Pass@1, so this function returns pass_at_k_ci() with k = 1.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Maximum sampling budget with 1 <= k <= N.

  • confidence (float) – credibility level of the interval.

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval.

  • alpha0 (float) – Beta prior parameter \(\alpha_0\).

  • beta0 (float) – Beta prior parameter \(\beta_0\).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).

Return type:

tuple[float, float, float, float]

Majority

maj_at_k(R, k)[source]

Maj@k: strict-majority correctness over k samples.

This metric measures the probability that a uniformly sampled subset of k observed traces contains strictly more than half correct solutions. It is a binary-outcome proxy for the majority-vote metrics often reported in reasoning papers when only correctness labels are available.

Parameters:
  • R (ndarray) – \(M \times N\) binary outcome matrix.

  • k (int) – Number of sampled traces, with 1 <= k <= N.

Returns:

Average strict-majority success probability across prompts.

Return type:

float

Formula:

Let \(\nu_\alpha = \sum_i R_{\alpha i}\) and \(j_0 = \lfloor k/2 \rfloor + 1\). Then

\[\mathrm{Maj@k}_\alpha = \sum_{j=j_0}^{k} \frac{\binom{\nu_\alpha}{j}\binom{N-\nu_\alpha}{k-j}} {\binom{N}{k}}.\]

Equivalently, this is \(\mathrm{G\text{-}Pass@k}_{\tau}\) with \(\tau = j_0 / k\).

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(maj_at_k(R, 1), 6)
0.7
>>> round(maj_at_k(R, 2), 6)
0.45
>>> round(maj_at_k(R, 3), 6)
0.85
maj_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Bayesian posterior summary for maj_at_k().

This reuses the generalized threshold-pass posterior with the strict majority threshold \(j_0 = \lfloor k/2 \rfloor + 1\).

Parameters:
  • R (ndarray) – \(M \times N\) binary outcome matrix.

  • k (int) – Number of sampled traces, with 1 <= k <= N.

  • confidence (float) – Credibility level for the normal-approximation interval.

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval.

  • alpha0 (float) – Beta prior parameter \(\alpha_0\).

  • beta0 (float) – Beta prior parameter \(\beta_0\).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Formula:

This is exactly g_pass_at_k_tau_ci() evaluated at

\[\tau = \frac{\lfloor k/2 \rfloor + 1}{k}.\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> mu, sigma, lo, hi = maj_at_k_ci(R, 2)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.446429, 0.146167, 0.1599, 0.7329)
>>> mu, sigma, lo, hi = maj_at_k_ci(R, 3)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.684524, 0.151958, 0.3867, 0.9824)

Generalized Pass@k

g_pass_at_k(R, k)[source]

Performance evaluation using G-Pass@k.

Equivalent to pass_hat_k(), included for literature that uses the G-Pass@k naming convention for the \(\tau = 1\) threshold.

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Number of samples to select (\(1 \le k \le N\)).

Returns:

The average G-Pass@k score across all \(M\) questions.

Return type:

float

Notation:

For each row \(\alpha\):

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).

Formula:
\[\mathrm{G\text{-}Pass@}k_\alpha = \frac{C(\nu_\alpha, k)}{C(N, k)}\]
\[\mathrm{G\text{-}Pass@}k = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{G\text{-}Pass@}k_\alpha\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(g_pass_at_k(R, 1), 6)
0.7
>>> round(g_pass_at_k(R, 2), 6)
0.45
g_pass_at_k_tau(R, k, tau)[source]

Performance evaluation using G-Pass@kτ.

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Number of samples to select (\(1 \le k \le N\)).

  • tau (float) – Threshold parameter \(\tau \in [0, 1]\). Requires at least \(\lceil \tau \cdot k \rceil\) successes. When \(\tau = 0\), equivalent to Pass@k. When \(\tau = 1\), equivalent to Pass^k.

Returns:

The average G-Pass@kτ score across all \(M\) questions.

Return type:

float

Notation:

For each row \(\alpha\):

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

\(C(a, b)\) denotes the binomial coefficient \(\binom{a}{b}\).

\(j_0 = \lceil \tau \cdot k \rceil\) is the minimum number of successes required.

Formula:
\[\mathrm{G\text{-}Pass@}k_{\tau, \alpha} = \sum_{j=j_0}^{k} \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]
\[\mathrm{G\text{-}Pass@}k_\tau = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{G\text{-}Pass@}k_{\tau, \alpha}\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(g_pass_at_k_tau(R, 2, 0.5), 6)
0.95
>>> round(g_pass_at_k_tau(R, 2, 1.0), 6)
0.45
mg_pass_at_k(R, k)[source]

Performance evaluation using mG-Pass@k.

References

Liu, J., Liu, H., Xiao, L., et al. (2024). Are Your LLMs Capable of Stable Reasoning? arXiv preprint arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0, 1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Number of samples to select (\(1 \le k \le N\)).

Returns:

The average mG-Pass@k score across all \(M\) questions.

Return type:

float

Notation:

For each row \(\alpha\):

\[\nu_\alpha = \sum_{i=1}^{N} R_{\alpha i} \quad \text{(number of correct samples)}\]

\(m = \lceil k/2 \rceil\) is the majority threshold. Let \(X_\alpha \sim \mathrm{Hypergeometric}(N, \nu_\alpha, k)\) be the number of correct samples among \(k\) selections for row \(\alpha\).

Formula:
\[\mathrm{mG\text{-}Pass@}k_\alpha = 2 \int_{0.5}^{1.0} \mathrm{G\text{-}Pass@}k_{\tau, \alpha} \, d\tau\]
\[\mathrm{mG\text{-}Pass@}k_\alpha = \frac{2}{k} \sum_{j=m+1}^{k} (j - m) \cdot P(X_\alpha = j)\]
\[P(X_\alpha = j) = \frac{C(\nu_\alpha, j) \cdot C(N - \nu_\alpha, k - j)}{C(N, k)}\]
\[\mathrm{mG\text{-}Pass@}k = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{mG\text{-}Pass@}k_\alpha\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(mg_pass_at_k(R, 2), 6)
0.45
>>> round(mg_pass_at_k(R, 3), 6)
0.166667
g_pass_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Bayesian posterior summary for G-Pass@k.

G-Pass@k is the all-success threshold, so this is the same posterior target as pass_hat_k_ci().

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Number of selected samples with 1 <= k <= N.

  • confidence (float) – credibility level of the interval.

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval.

  • alpha0 (float) – Beta prior parameter \(\alpha_0\).

  • beta0 (float) – Beta prior parameter \(\beta_0\).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).

Return type:

tuple[float, float, float, float]

g_pass_at_k_tau_ci(R, k, tau, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Bayesian posterior summary for thresholded G-Pass@k.

The latent target is the probability that at least \(\lceil \tau k \rceil\) of k i.i.d. samples are correct.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Number of selected samples with 1 <= k <= N.

  • tau (float) – Threshold parameter in [0, 1].

  • confidence (float) – credibility level of the interval.

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval.

  • alpha0 (float) – Beta prior parameter \(\alpha_0\).

  • beta0 (float) – Beta prior parameter \(\beta_0\).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).

Return type:

tuple[float, float, float, float]

mg_pass_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Bayesian posterior summary for mG-Pass@k.

The latent target averages thresholded G-Pass@k over thresholds from 0.5 to 1.0 using the closed-form weighting in mg_pass_at_k().

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Number of selected samples with 1 <= k <= N.

  • confidence (float) – credibility level of the interval.

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval.

  • alpha0 (float) – Beta prior parameter \(\alpha_0\).

  • beta0 (float) – Beta prior parameter \(\beta_0\).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\).

Return type:

tuple[float, float, float, float]

GeoSpectrum

geom_at_k(R, k, pass_power=0.5, unanimous_power=0.5)[source]

Questionwise Geom@k averaged across questions.

This is scorio’s primary Geom@k metric. Unlike geom_ds_at_k(), which blends dataset-level Pass@k and dataset-level Unanimous@k, this function first computes the per-question quantities

\[P_{\alpha,k} = 1 - \frac{\binom{N - \nu_\alpha}{k}}{\binom{N}{k}}\]
\[U_{\alpha,k} = \frac{\binom{\nu_\alpha}{k}}{\binom{N}{k}}\]

forms the geometric blend

\[G_{\alpha,k} = P_{\alpha,k}^{a}\,U_{\alpha,k}^{b},\]

and only then averages across questions.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Sampling budget with \(1 \le k \le N\).

  • pass_power (float) – Exponent applied to per-question Pass@k.

  • unanimous_power (float) – Exponent applied to per-question Unanimous@k.

Returns:

The average questionwise Geom@k score.

Return type:

float

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(geom_at_k(R, 2), 6)
0.647106
geom_at_k_ci(R, k, pass_power=0.5, unanimous_power=0.5, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Approximate posterior summary for the questionwise Geom@k target.

This is the uncertainty counterpart of geom_at_k(): it applies a first-order delta method to each question’s latent Pass@k and Unanimous@k quantities, then averages the resulting question-level geometric blends.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.

  • pass_power (float) – Exponent applied to each question’s latent Pass@k.

  • unanimous_power (float) – Exponent applied to each question’s latent Unanimous@k.

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval (default (0, 1)).

  • alpha0 (float) – Beta prior parameter \(\alpha_0\) (default 1).

  • beta0 (float) – Beta prior parameter \(\beta_0\) (default 1).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Formula:

Let \(\mu_{P,\alpha}\) and \(\mu_{U,\alpha}\) denote the posterior means of question \(\alpha\)’s latent Pass@k and Unanimous@k quantities. Then

\[\mu \approx \frac{1}{M}\sum_\alpha \mu_{P,\alpha}^{a}\,\mu_{U,\alpha}^{b}\]

and \(\sigma\) is computed by per-question first-order delta propagation through \(g(x, y) = x^a y^b\).

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> mu, sigma, lo, hi = geom_at_k_ci(R, 2)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.610666, 0.133107, 0.3498, 0.8716)
geom_ds_at_k(R, k, pass_power=0.5, unanimous_power=0.5)[source]

Dataset-level Pass/Unanimous geometric blend.

This is the endpoint GeoSpectrum operating point from the paper: it first averages Pass@k and Unanimous@k across questions, then applies the geometric blend. For the questionwise metric that blends before averaging, use geom_at_k().

The default operating point is the geometric mean of dataset-level Pass@k and Unanimous@k (equivalently Pass^k). The same API also exposes nearby operating points by letting callers adjust the exponents on the Pass@k and Unanimous@k terms directly.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\). \(R_{\alpha i} = 1\) if trial \(i\) for question \(\alpha\) passed, 0 otherwise.

  • k (int) – Sampling budget with \(1 \le k \le N\).

  • pass_power (float) – Exponent applied to Pass@k.

  • unanimous_power (float) – Exponent applied to Unanimous@k.

Returns:

The dataset-level endpoint score.

Return type:

float

Formula:
\[G_{\mathrm{ds},k}(R; a, b) = \mathrm{Pass}@k(R)^a\, \mathrm{Unanimous}@k(R)^b\]

with the default dataset-level endpoint operating point given by \(a = b = 1/2\).

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(geom_ds_at_k(R, 2), 6)
0.653835
geom_ds_at_k_ci(R, k, pass_power=0.5, unanimous_power=0.5, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Approximate posterior summary for the dataset-level Geom@k target.

This is the uncertainty counterpart of geom_ds_at_k() and matches the dataset-level latent quantity introduced in the paper when pass_power = unanimous_power = 0.5.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.

  • pass_power (float) – Exponent applied to latent dataset-level Pass@k.

  • unanimous_power (float) – Exponent applied to latent dataset-level Unanimous@k.

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval (default (0, 1)).

  • alpha0 (float) – Beta prior parameter \(\alpha_0\) (default 1).

  • beta0 (float) – Beta prior parameter \(\beta_0\) (default 1).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Formula:

Let \(\mu_P\) and \(\mu_U\) denote the posterior means of the latent dataset-level Pass@k and Unanimous@k quantities. Then

\[\mu \approx \mu_P^a\,\mu_U^b\]

and \(\sigma\) is computed by first-order delta propagation through \(g(x, y) = x^a y^b\).

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> mu, sigma, lo, hi = geom_ds_at_k_ci(R, 2)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.612112, 0.132755, 0.3519, 0.8723)
geo_spectrum_at_k(R, k, lam=0.5, weights=None, lambda_=None)[source]

\(\mathrm{GeoSpectrum}_{\lambda,w}@k\) on the observed finite bank.

By default weights=None selects the upper-half mG weights, so the two-argument call geo_spectrum_at_k(R, k) remains the special case

\[\mathrm{GeoSpectrum}^*@k(R) = \sqrt{\mathrm{Pass}@k(R)\,\mathrm{mG\text{-}Pass}@k(R)}.\]

This function also accepts the keyword alias lambda_=... for callers that prefer naming the coupling parameter after the mathematical symbol.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Sampling budget with \(1 \le k \le N\).

  • lam (float) – The coupling parameter \(\lambda\) in \([0,1]\).

  • weights (ndarray | list[float] | tuple[float, ...] | None) – Spectrum weights \(w\). If omitted, uses the built-in upper-half mG weights. Custom weights must be length-\(k\), non-negative, finite, and satisfy \(\sum_r w_r \le 1\).

  • lambda_ (float | None)

Returns:

\(\mathrm{GeoSpectrum}_{\lambda,w}@k(R)\).

Return type:

float

Formula:
\[\mathrm{GeoSpectrum}_{\lambda,w}@k(R) = \mathrm{Pass}@k(R)^\lambda \, S_{w,k}(R)^{1-\lambda}\]

Examples

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(geo_spectrum_at_k(R, 3), 6)
0.408248
>>> round(geo_spectrum_at_k(R, 3, lam=1.0), 6)
1.0
geo_spectrum_at_k_ci(R, k, lam=0.5, weights=None, lambda_=None, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Approximate posterior summary for latent \(\mathrm{GeoSpectrum}_{\lambda,w}@k\).

As in geo_spectrum_at_k(), omitting weights selects the GeoSpectrum*@k operating point.

This function also accepts the keyword alias lambda_=... for callers that prefer naming the coupling parameter after the mathematical symbol.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.

  • lam (float) – The coupling parameter \(\lambda\) in \([0,1]\).

  • weights (ndarray | list[float] | tuple[float, ...] | None) – Spectrum weights \(w\). If omitted, uses the built-in upper-half mG weights unless \(\lambda = 1\), in which case the spectrum term is irrelevant. Custom weights must be length-\(k\), non-negative, finite, and satisfy \(\sum_r w_r \le 1\).

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval (default (0, 1)).

  • alpha0 (float) – Beta prior parameter \(\alpha_0\) (default 1).

  • beta0 (float) – Beta prior parameter \(\beta_0\) (default 1).

  • lambda_ (float | None)

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Formula:

Let \(x\) denote latent Pass@k and \(y\) denote the latent spectrum \(S_{w,k}\). The posterior mean is approximated by

\[\mu \approx x^\lambda y^{1-\lambda}\]

evaluated at the posterior means of \(x\) and \(y\), and \(\sigma\) is obtained by first-order delta propagation through \(g(x, y) = x^\lambda y^{1-\lambda}\).

geo_spectrum_star_at_k(R, k)[source]

Explicit alias for the default GeoSpectrum*@k operating point.

Equivalent to calling geo_spectrum_at_k() with the default upper-half mG spectrum weights.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Sampling budget with \(1 \le k \le N\).

Returns:

The GeoSpectrum*@k score.

Return type:

float

geo_spectrum_star_at_k_ci(R, k, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Approximate posterior summary for latent GeoSpectrum*@k.

Equivalent to geo_spectrum_at_k_ci() with the default upper-half mG spectrum weights.

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval (default (0, 1)).

  • alpha0 (float) – Beta prior parameter \(\alpha_0\) (default 1).

  • beta0 (float) – Beta prior parameter \(\beta_0\) (default 1).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

threshold_spectrum_at_k(R, k, weights)[source]

Finite-bank threshold-spectrum summary \(S_{w,k}(R)\).

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Sampling budget with \(1 \le k \le N\).

  • weights (ndarray | list[float] | tuple[float, ...]) – Non-negative length-\(k\) weights with \(\sum_r w_r \le 1\).

Returns:

\(S_{w,k}(R)\) averaged across questions.

Return type:

float

Notes

This summary is defined by

\[S_{w,k}(R) = \sum_{r=1}^k w_r T_{r,k}(R).\]

The implementation uses the equivalent event-score representation from Appendix C.4.

threshold_spectrum_at_k_ci(R, k, weights, confidence=0.95, bounds=(0.0, 1.0), alpha0=1.0, beta0=1.0)[source]

Approximate posterior summary for the latent spectrum \(S_{w,k}(p)\).

Parameters:
  • R (ndarray) – \(M \times N\) binary matrix with entries in \(\{0,1\}\).

  • k (int) – Latent resampling budget. Once the posterior is defined, any integer \(k \ge 1\) is allowed.

  • weights (ndarray | list[float] | tuple[float, ...]) – Non-negative length-\(k\) weights with \(\sum_r w_r \le 1\).

  • confidence (float) – credibility level of the interval (default 0.95).

  • bounds (tuple[float, float]) – (lo, hi) clipping bounds for the interval (default (0, 1)).

  • alpha0 (float) – Beta prior parameter \(\alpha_0\) (default 1).

  • beta0 (float) – Beta prior parameter \(\beta_0\) (default 1).

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Notes

Unlike threshold_spectrum_at_k(), the posterior target is defined for latent i.i.d. resampling and therefore does not require \(k \le N\).

Formula:

Let \(A_j = \sum_{r \le j} w_r\). The per-question latent target is

\[g(p) = \sum_{j=1}^{k} A_j \binom{k}{j} p^j (1-p)^{k-j}.\]

Dataset-level aggregation uses

\[\mu = \frac{1}{M} \sum_{\alpha=1}^{M} \mathbb{E}[g(p_\alpha)]\]
\[\sigma = \frac{1}{M} \sqrt{ \sum_{\alpha=1}^{M} \mathrm{Var}[g(p_\alpha)] }.\]

Max-Reward

max_at_k(R, k, w=None)[source]

Max@k: expected best reward among k sampled traces.

When w = [0, 1], Max@k reduces exactly to Pass@k. More generally, the reward vector w maps categorical outcomes to arbitrary real-valued scores, and Max@k averages the best score obtainable from a subset of size k.

References

  • Bagirov, F., et al. (2025). The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimization. arXiv:2510.23393. https://arxiv.org/abs/2510.23393 The finite-sample estimator below matches Appendix C.1 / Listing 1.

  • Walder, C., & Karkhanis, D. (2025). Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems. arXiv:2505.15201.

Parameters:
  • R (ndarray) – \(M \times N\) categorical outcome matrix with integer entries in \(\{0, \ldots, C\}\).

  • k (int) – Number of selected samples, with 1 <= k <= N.

  • w (ndarray | None) – Optional reward vector of shape (C+1,). If omitted, R must be binary and w = [0, 1] is used.

Returns:

Average Max@k score across prompts.

Return type:

float

Formula:

Let \(g_{\alpha 1} \le \cdots \le g_{\alpha N}\) denote the sorted rewards for prompt \(\alpha\). Then the unbiased finite- sample estimator is

\[\mathrm{Max@k}_\alpha = \frac{1}{\binom{N}{k}} \sum_{i=k}^{N} \binom{i-1}{k-1} g_{\alpha i}.\]

The dataset-level metric is the average across prompts:

\[\mathrm{Max@k} = \frac{1}{M} \sum_{\alpha=1}^{M} \mathrm{Max@k}_\alpha\]

Examples

Binary (reduces to Pass@k):

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> round(max_at_k(R, 2), 6)
0.95

Weighted categorical rewards:

>>> R = np.array([[0, 1, 2, 2, 1],
...               [1, 1, 0, 2, 2]])
>>> w = np.array([0.0, 0.5, 1.0])
>>> round(max_at_k(R, 2, w=w), 6)
0.85
max_at_k_ci(R, k, w=None, R0=None, confidence=0.95, bounds=None)[source]

Bayesian posterior summary for max_at_k().

The posterior uses the same Dirichlet-plus-one construction as bayes(). When k = 1, Max@1 reduces to the usual single-draw expected score, so this function agrees with bayes_ci().

This uncertainty model is a scorio extension. Bagirov et al. (2025) define the finite-sample max@k point estimator, but do not derive these Bayesian credible intervals.

Parameters:
  • R (ndarray) – \(M \times N\) categorical outcome matrix with integer entries in \(\{0, \ldots, C\}\).

  • k (int) – Selection count. The posterior target is defined for any integer k >= 1; k = 1 matches bayes_ci().

  • w (ndarray | None) – Optional reward vector of shape (C+1,). If omitted, R must be binary and w = [0, 1] is used.

  • R0 (ndarray | None) – Optional \(M \times D\) matrix of prior outcomes.

  • confidence (float) – Credibility level for the normal-approximation interval.

  • bounds (tuple[float, float] | None) – Optional (lo, hi) clipping bounds. If omitted, the interval is clipped to the minimum and maximum reward levels in w.

Returns:

\((\mu,\; \sigma,\; \text{lo},\; \text{hi})\)

Return type:

tuple[float, float, float, float]

Formula:

Let \(r_1 < \cdots < r_L\) be the unique reward levels and \(A_{\alpha \ell}\) the posterior cumulative probability of obtaining reward at most \(r_\ell\) for prompt \(\alpha\). Then the per-prompt latent target is

\[g_\alpha = r_L - \sum_{\ell=1}^{L-1} (r_{\ell+1} - r_\ell) A_{\alpha \ell}^k\]

and posterior moments are computed in closed form under the grouped Dirichlet posterior.

Examples

Binary:

>>> import numpy as np
>>> R = np.array([[0, 1, 1, 0, 1],
...               [1, 1, 0, 1, 1]])
>>> mu, sigma, lo, hi = max_at_k_ci(R, 2)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.839286, 0.097263, 0.6487, 1.0)

Weighted categorical rewards:

>>> R = np.array([[0, 1, 2, 2, 1],
...               [1, 1, 0, 2, 2]])
>>> w = np.array([0.0, 0.5, 1.0])
>>> mu, sigma, lo, hi = max_at_k_ci(R, 2, w=w)
>>> round(mu, 6), round(sigma, 6), round(lo, 4), round(hi, 4)
(0.75, 0.08812, 0.5773, 0.9227)