scorio.rank

Ranking methods for comparing multiple models.

This module implements ranking estimators for binary response tensors used in scorio evaluation workflows.

Notation

All methods use the same core notation:

\(R \in \{0,1\}^{L \times M \times N}\) is the response tensor.
\(L\) is the number of models.
\(M\) is the number of questions.
\(N\) is the number of trials per question.
\(R_{lmn}=1\) means model \(l\) solves question \(m\) on trial \(n\).

For single-trial input, (L, M) is promoted to (L, M, 1).

Longitudinal variants of scorio.rank.dynamic_irt() (variant="growth" and variant="state_space") interpret axis \(N\) as ordered measurement time rather than i.i.d. trials.

Every ranking method computes raw scores \(s_l\) and then maps those scores to ranks:

\[s_l = f_l(R; \psi), \qquad r = \operatorname{rank\_scores}(s)[\text{method}]\]

By default, methods return r as a one-dimensional array of shape (L,). With return_scores=True, methods return (r, s).

Available Methods

Evaluation metric-based: avg, bayes, pass_at_k, pass_hat_k, g_pass_at_k_tau, mg_pass_at_k.

Paired-comparison probabilistic models: bradley_terry, bradley_terry_map, bradley_terry_davidson, bradley_terry_davidson_map, rao_kupper, rao_kupper_map.

Accuracy-based: inverse_difficulty.

Pairwise rating systems: elo, glicko, trueskill.

Bayesian methods: thompson, bayesian_mcmc.

Voting methods: borda, copeland, win_rate, minimax, schulze, ranked_pairs, kemeny_young, nanson, baldwin, majority_judgment.

Item Response Theory: rasch, rasch_map, rasch_2pl, rasch_2pl_map, rasch_3pl, rasch_3pl_map, rasch_mml, rasch_mml_credible, dynamic_irt, mirt (compensatory multidimensional IRT).

Graph-based: pagerank, spectral, rank_centrality, alpharank, nash.

Seriation-based: serial_rank.

Hodge-theoretic: hodge_rank.

Listwise and setwise choice models (Luce family): plackett_luce, plackett_luce_map, davidson_luce, davidson_luce_map, bradley_terry_luce, bradley_terry_luce_map.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> # Generate sample data: 3 models, 5 questions, 4 trials
>>> R = np.random.randint(0, 2, size=(3, 5, 4))
>>> # Get rankings using different methods
>>> ranks_bayes = rank.bayes(R, w=np.array([0.0, 1.0]))
>>> ranks_pass_at_k = rank.pass_at_k(R, k=2)
>>> # Get raw scores when needed
>>> ranking, scores = rank.avg(R, return_scores=True)

Notation

All ranking methods use response tensors \(R \in \{0,1\}^{L \times M \times N}\), with \(L\) models, \(M\) questions, and \(N\) trials per question. Methods compute raw scores \(s_l\) and then convert scores to ranks with scorio.utils.rank_scores.

Prior Classes

Prior penalties for MAP ranking methods.

Method context:

Several ranking methods in scorio.rank estimate latent log-strengths theta via a MAP objective of the form

\[\mathcal{L}_{\text{MAP}}(\theta) = \mathcal{L}_{\text{NLL}}(\theta) + P(\theta),\]

where P(theta) is the prior penalty (negative log-prior up to constants). This module defines reusable prior classes implementing P(theta) through a common Prior interface.

class Prior[source]

Bases: ABC

Abstract interface for prior penalties on log-strength parameters.

Method context:: Ranking MAP solvers optimize a negative objective and add a prior penalty term. Subclasses define that term via penalty().

Notes

theta should be interpreted as centered log-strengths in most Scorio MAP estimators because strengths are identifiable only up to an additive constant in log-space.

abstractmethod penalty(theta)[source]

Evaluate the prior penalty P(theta).

Parameters:: theta (ndarray) – Log-strength vector of shape (L,).
Return type:: float
Returns:: Scalar penalty to be added to a negative log-likelihood objective.

Notation:: theta_i denotes the latent log-strength for model i.
Formula:: Subclasses implement concrete forms of

\[P(\theta) = -\log p(\theta) + \text{constant}.\]

class EmpiricalPrior(R0, var=1.0, eps=1e-6)[source]

Bases: Prior

Empirical Gaussian prior from prior outcome tensor R0.

Method context:: Builds a model-specific prior mean from empirical accuracies in R0 and applies a shared Gaussian variance around those means. This is useful when a deterministic baseline (for example greedy decoding) should inform stochastic ranking.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don’t Pass@k: A Bayesian Framework for Large Language Model Evaluation. https://arxiv.org/abs/2510.04265

Parameters:

R0 (ndarray) – Prior outcomes with shape (L, M, D) or (L, M). (L, M) is treated as D=1.
var (float) – Positive Gaussian variance around empirical logit means.
eps (float) – Clipping constant for stable logit transform.

Examples

>>> import numpy as np
>>> R0 = np.array([
...     [1, 1, 1, 0, 1],
...     [0, 1, 0, 0, 1],
... ])
>>> prior = EmpiricalPrior(R0, var=1.0)
>>> prior.prior_mean.shape
(2,)

Notation:: acc_i: mean empirical accuracy for model i in R0. mu_i: empirical logit mean.
Formula:: \[\mu_i = \operatorname{logit}(\operatorname{clip}(acc_i, \epsilon, 1-\epsilon))\]

\[P(\theta) = \frac{1}{2\,\mathrm{var}} \sum_i (\theta_i - \mu_i)^2\]

Notes

Prior means are centered to match BT-style identifiability constraints.

penalty(theta)[source]

Evaluate empirical-Gaussian penalty around learned prior means.

Parameters:: theta (ndarray) – Log-strength vector of shape (L,).
Return type:: float
Returns:: Scalar penalty value.
Raises:: ValueError – If theta length differs from prior model count.

class GaussianPrior(mean=0.0, var=1.0)[source]

Bases: Prior

Isotropic Gaussian prior on log-strengths.

Method context:: Standard L2-style regularization used in many MAP ranking objectives.

Parameters:

mean (float) – Prior location.
var (float) – Positive prior variance.

Returns:

penalty(theta) returns a scalar quadratic penalty.

Formula:: \[P(\theta) = \frac{1}{2\,\mathrm{var}}\sum_i (\theta_i-\mathrm{mean})^2\]

Examples

>>> prior = GaussianPrior(mean=0.0, var=1.0)
>>> prior.penalty(np.array([0.5, -0.5]))
0.25

penalty(theta)[source]

Evaluate quadratic Gaussian penalty.

Parameters:: theta (ndarray) – Log-strength vector.
Return type:: float
Returns:: Scalar penalty value.

class LaplacePrior(loc=0.0, scale=1.0)[source]

Bases: Prior

Laplace prior on log-strengths.

Method context:: L1-style shrinkage prior that can be more robust than Gaussian around outliers and may encourage sparse deviations from loc.

Parameters:

loc (float) – Prior location (mode).
scale (float) – Positive scale.

Examples

>>> prior = LaplacePrior(loc=0.0, scale=1.0)
>>> prior.penalty(np.array([0.5, -0.5]))
1.0

Formula:: \[P(\theta) = \frac{1}{\mathrm{scale}} \sum_i |\theta_i - \mathrm{loc}|\]

penalty(theta)[source]

Evaluate absolute-deviation Laplace penalty.

Parameters:: theta (ndarray) – Log-strength vector.
Return type:: float
Returns:: Scalar penalty value.

class CauchyPrior(loc=0.0, scale=1.0)[source]

Bases: Prior

Cauchy prior on log-strengths.

Method context:: Heavy-tailed prior that penalizes extreme values less aggressively than Gaussian/Laplace priors.

Parameters:

loc (float) – Prior location.
scale (float) – Positive scale.

Examples

>>> prior = CauchyPrior(loc=0.0, scale=1.0)
>>> prior.penalty(np.array([2.0, -2.0]))
3.218...

Formula:: \[P(\theta) = \sum_i \log\left(1 + \left(\frac{\theta_i-\mathrm{loc}}{\mathrm{scale}}\right)^2\right)\]

penalty(theta)[source]

Evaluate heavy-tailed Cauchy penalty.

Parameters:: theta (ndarray) – Log-strength vector.
Return type:: float
Returns:: Scalar penalty value.

class UniformPrior[source]

Bases: Prior

Improper uniform prior on log-strengths.

Method context:: Disables regularization in MAP routines (equivalent to ML objective).

Examples

>>> prior = UniformPrior()
>>> prior.penalty(np.array([100.0, -100.0]))
0.0

Formula:: \[P(\theta) = 0\]

penalty(theta)[source]

Return zero penalty.

Parameters:: theta (ndarray) – Log-strength vector (ignored).
Return type:: float
Returns:: Always 0.0.

class CustomPrior(penalty_fn)[source]

Bases: Prior

User-defined prior penalty wrapper.

Method context:: Allows custom regularization while preserving the Prior interface expected by MAP estimators.

Parameters:: penalty_fn (Callable[[ndarray], float]) – Callable mapping theta to a scalar penalty.

Examples

>>> def horseshoe_penalty(theta):
...     return np.log1p(theta**2).sum()
>>> prior = CustomPrior(horseshoe_penalty)
>>> float(prior.penalty(np.array([0.0, 1.0]))) > 0.0
True

penalty(theta)[source]

Evaluate user-provided penalty function.

Parameters:: theta (ndarray) – Log-strength vector.
Return type:: float
Returns:: Scalar penalty from penalty_fn(theta).

Evaluation-based Ranking Methods

Evaluation-metric ranking methods.

These methods map each model’s responses to a scalar score and then convert scores to ranks with scorio.utils.rank_scores().

Notation

Let \(R \in \{0,1,\ldots,C\}^{L \times M \times N}\) denote model outcomes, and define per-question correct-count summaries \(k_{lm}=\sum_{n=1}^{N} R_{lmn}\) when outcomes are binary.

The module follows the score template

\[s_l = \frac{1}{M}\sum_{m=1}^{M} g_m(k_{lm}, N; \psi),\]

where \(g_m\) depends on the selected evaluation metric.

avg(R, method='competition', return_scores=False)[source]

Rank models by mean accuracy over all questions and trials.

Method context:: This is the simplest pointwise ranking baseline: each model receives one score equal to its empirical success rate across all M * N outcomes.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to rank_scores. One of "competition", "dense", "avg", "competition_max".
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns scores of shape (L,).

Notation:: R[l, m, n] is the binary outcome for model l, question m, trial n.
Formula:: \[s_l^{\mathrm{avg}} = \frac{1}{MN} \sum_{m=1}^{M}\sum_{n=1}^{N} R_{lmn}\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [0, 1]],
...     [[1, 0], [0, 0]],
... ])
>>> ranks, scores = rank.avg(R, return_scores=True)
>>> scores.round(3).tolist()
[0.75, 0.25]
>>> ranks.tolist()
[1, 2]

bayes(R, w=None, R0=None, quantile=None, method='competition', return_scores=False)[source]

Rank models with Bayes@N posterior statistics.

Method context:: For each model, this method computes Bayes@N posterior summary statistics (mu_l, sigma_l) from categorical outcomes. Ranking can be based on posterior mean (default) or a Normal-quantile conservative score.

References

Hariri, M., Samandar, A., Hinczewski, M., & Chaudhary, V. (2026). Don’t Pass@k: A Bayesian Framework for Large Language Model Evaluation. ICLR 2026, arXiv:2510.04265. https://arxiv.org/abs/2510.04265

Parameters:

R (ndarray) – Categorical outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1). Entries must be integers in {0, ..., C}.
w (ndarray | None) – Weight vector of shape (C+1,) mapping categories to scores. If not provided and R is binary (contains only 0 and 1), defaults to [1, 0]. For non-binary R, w is required.
R0 (ndarray | None) – Optional prior outcomes. Supported shapes: - (M, D): one shared prior matrix reused for all models. - (L, M, D): model-specific prior outcomes.
quantile (float | None) – Optional quantile q in [0, 1]. If None, rank by posterior mean. Otherwise rank by mu_l + Phi^{-1}(q) sigma_l.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule for score-to-rank conversion.
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns per-model scores used for ranking (posterior means or quantile scores), shape (L,).

Notation:: mu_l, sigma_l are Bayes@N posterior mean and uncertainty for model l computed by scorio.eval.bayes().
Formula:: \[\begin{split}s_l = \begin{cases} \mu_l, & \text{if } q\text{ is None} \\ \mu_l + \Phi^{-1}(q)\,\sigma_l, & \text{otherwise} \end{cases}\end{split}\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 0], [1, 1], [0, 0]],
...     [[0, 0], [1, 0], [1, 1]],
... ])
>>> w = np.array([0.0, 1.0])
>>> R0 = np.array([[1, 1], [0, 1], [0, 0]])  # shared prior
>>> ranks, scores = rank.bayes(R, w=w, R0=R0, return_scores=True)
>>> ranks.shape, scores.shape
((2,), (2,))

Notes

Lower quantiles (for example q=0.05) implement conservative ranking by penalizing posterior uncertainty.

pass_at_k(R, k, method='competition', return_scores=False)[source]

Rank models by the Pass@k metric.

Method context:: Pass@k measures the probability that at least one of k draws without replacement is correct for a question. Scores are averaged across questions per model.

References

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating Large Language Models Trained on Code. arXiv:2107.03374. https://arxiv.org/abs/2107.03374

Parameters:

R (ndarray) – Binary outcome tensor of shape (L, M, N) or matrix (L, M).
k (int) – Number of selected samples, with 1 <= k <= N.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule for rank_scores.
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns per-model Pass@k scores.

Notation:: nu_lm = sum_{n=1}^N R_lmn is the number of successes for model l on question m.
Formula:: \[s_l^{\mathrm{Pass@}k} = \frac{1}{M} \sum_{m=1}^{M} \left(1 - \frac{{N-\nu_{lm} \choose k}}{{N \choose k}}\right)\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1, 0], [0, 1, 0]],
...     [[1, 0, 0], [0, 0, 0]],
... ])
>>> ranks, scores = rank.pass_at_k(R, k=2, return_scores=True)
>>> ranks.tolist()
[1, 2]

pass_hat_k(R, k, method='competition', return_scores=False)[source]

Rank models by Pass-hat@k (G-Pass@k).

Method context:: Pass-hat@k is the probability that all k selected samples are correct for a question, then averaged across questions.

References

Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045. https://arxiv.org/abs/2406.12045

Parameters:

R (ndarray) – Binary outcome tensor of shape (L, M, N) or matrix (L, M).
k (int) – Number of selected samples, with 1 <= k <= N.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule for rank_scores.
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns per-model Pass-hat@k scores.

Notation:: nu_lm = sum_{n=1}^N R_lmn.
Formula:: \[s_l^{\widehat{\mathrm{Pass@}k}} = \frac{1}{M} \sum_{m=1}^{M} \frac{{\nu_{lm} \choose k}}{{N \choose k}}\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1, 0], [0, 1, 0]],
...     [[1, 0, 0], [0, 0, 0]],
... ])
>>> rank.pass_hat_k(R, k=1).tolist()
[1, 2]

g_pass_at_k_tau(R, k, tau, method='competition', return_scores=False)[source]

Rank models by generalized G-Pass@k_tau.

Method context:: G-Pass@k_tau measures the probability of obtaining at least ceil(tau * k) successes in k draws without replacement. It interpolates between Pass@k (small tau) and Pass-hat@k (tau=1).

References

Liu, J., Liu, H., Xiao, L., et al. (2025). Are Your LLMs Capable of Stable Reasoning? arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Parameters:

R (ndarray) – Binary outcome tensor of shape (L, M, N) or matrix (L, M).
k (int) – Number of selected samples, with 1 <= k <= N.
tau (float) – Threshold parameter in [0, 1].
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule for rank_scores.
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns per-model G-Pass@k_tau scores.

Notation:: X_lm ~ Hypergeom(N, nu_lm, k) where nu_lm is the success count for model l and question m.
Formula:: \[s_l^{\mathrm{G\text{-}Pass@}k_\tau} = \frac{1}{M} \sum_{m=1}^{M} \Pr\left(X_{lm} \ge \lceil \tau k \rceil\right)\]

\[\Pr\left(X_{lm} \ge \lceil \tau k \rceil\right) = \sum_{j=\lceil \tau k \rceil}^{k} \frac{{\nu_{lm} \choose j}{N-\nu_{lm} \choose k-j}} {{N \choose k}}\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1, 0], [0, 1, 0]],
...     [[1, 0, 0], [0, 0, 0]],
... ])
>>> rank.g_pass_at_k_tau(R, k=2, tau=1.0).tolist() == rank.pass_hat_k(R, 2).tolist()
True

mg_pass_at_k(R, k, method='competition', return_scores=False)[source]

Rank models by mG-Pass@k (mean generalized pass metric).

Method context:: mG-Pass@k aggregates G-Pass@k_tau for tau in [0.5, 1] via the discrete summation proposed in the G-Pass literature, producing a stability-focused score.

References

Liu, J., Liu, H., Xiao, L., et al. (2025). Are Your LLMs Capable of Stable Reasoning? arXiv:2412.13147. https://arxiv.org/abs/2412.13147

Parameters:

R (ndarray) – Binary outcome tensor of shape (L, M, N) or matrix (L, M).
k (int) – Number of selected samples, with 1 <= k <= N.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule for rank_scores.
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns per-model mG-Pass@k scores.

Notation:: X_lm ~ Hypergeom(N, nu_lm, k), and m0 = ceil(k/2).
Formula:: \[s_l^{\mathrm{mG\text{-}Pass@}k} = \frac{1}{M} \sum_{m=1}^{M} \frac{2}{k} \sum_{i=m_0+1}^{k} \Pr(X_{lm} \ge i)\]

\[\frac{2}{k} \sum_{i=m_0+1}^{k} \Pr(X_{lm} \ge i) = \frac{2}{k} \, \mathbb{E}\left[(X_{lm}-m_0)_+\right]\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1, 0], [0, 1, 0]],
...     [[1, 0, 0], [0, 0, 0]],
... ])
>>> ranks, scores = rank.mg_pass_at_k(R, k=2, return_scores=True)
>>> ranks.tolist()
[1, 2]

Pointwise Methods

Pointwise ranking methods.

Pointwise methods aggregate each model’s item-level performance into one scalar without explicit head-to-head modeling.

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\) and define \(k_{lm}=\sum_{n=1}^{N} R_{lmn}\), \(\widehat{p}_{lm}=k_{lm}/N\).

The pointwise score template is

\[s_l = \sum_{m=1}^{M} w_m \, g\!\left(\widehat{p}_{lm}, \phi_m\right),\]

where \(\phi_m\) is an item statistic, \(g\) is a fixed transform, and \(w_m\) are nonnegative weights.

inverse_difficulty(R, method='competition', return_scores=False, clip_range=(0.01, 0.99))[source]

Rank models by inverse-difficulty-weighted per-question accuracy.

Method context:: This pointwise method upweights hard questions by assigning each question an inverse weight proportional to the reciprocal global solve rate. It emphasizes rare successes while still aggregating all questions into one scalar score per model.

References

Inverse probability weighting (Wikipedia): https://en.wikipedia.org/wiki/Inverse_probability_weighting

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores(). One of "competition", "competition_max", "dense", or "avg".
return_scores (bool) – If True, return (ranking, scores).
clip_range (tuple) – Two-sided clipping interval (a, b) applied to global solve rates before inversion. Must satisfy 0 < a < b <= 1.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns inverse-difficulty scores of shape (L,).

Notation:: k_lm = sum_{n=1}^N R_lmn and p_hat_lm = k_lm / N. The global per-question solve rate is p_bar_m = (1/L) * sum_l p_hat_lm.
Formula:: \[w_m \propto \frac{1}{\operatorname{clip}(\bar p_m, a, b)}, \quad \sum_{m=1}^{M} w_m = 1\]

\[s_l^{\mathrm{inv\text{-}diff}} = \sum_{m=1}^{M} w_m \, \widehat{p}_{lm}\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [0, 0], [0, 0]],
...     [[0, 0], [1, 1], [0, 0]],
... ])
>>> ranks, scores = rank.inverse_difficulty(R, return_scores=True)
>>> ranks.tolist()
[1, 1]
>>> scores.shape
(2,)

>>> # tighter clipping is allowed
>>> rank.inverse_difficulty(R, clip_range=(0.05, 0.95)).shape
(2,)

Notes

Very small clip lower bounds can make the weighting highly sensitive to a few rare solves. This implementation is a simple inverse-weighted pointwise scorer.

Rank Centrality Methods

Rank Centrality: ranking from pairwise comparisons.

This module implements the Markov-chain estimator of Negahban, Oh, and Shah (2017).

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\). For each pair \((i,j)\), let \(W_{ij}\) be decisive wins of \(i\) over \(j\). Define the pairwise empirical probabilities \(\widehat{P}_{i\succ j}\) from decisive outcomes (optionally tie adjusted).

Rank Centrality builds a row-stochastic transition matrix \(P\) with off-diagonal mass proportional to \(\widehat{P}_{j\succ i}\) and ranks by the stationary distribution \(\pi\):

\[\pi^\top P = \pi^\top, \qquad \sum_i \pi_i = 1.\]

rank_centrality(R, method='competition', return_scores=False, tie_handling='half', smoothing=0.0, teleport=0.0, max_iter=10_000, tol=1e-12)[source]

Rank models with Rank Centrality.

Method context:: Build a row-stochastic random-walk matrix over models where transition probabilities prefer moving from a model to models that beat it. The stationary distribution of this chain is the score vector.

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) which is treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores) where scores are the stationary distribution.
tie_handling (str) –
- “ignore”: only decisive comparisons (i correct, j incorrect)
- ”half”: treat ties (both same) as 0.5 win for each side
smoothing (float) – Nonnegative pseudocount added to every directed win count. Use this to avoid disconnected graphs when tie_handling=”ignore”.
teleport (float) – Teleportation probability in [0, 1). When > 0, makes the Markov chain ergodic even if the comparison graph is disconnected.
max_iter (int) – Max iterations for the power method.
tol (float) – Convergence tolerance (L1 difference) for the power method.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns stationary-distribution scores (shape (L,)).

Formula:

Let d_max be the maximum degree of the undirected comparison graph. For i != j,

\[P_{ij} = \frac{1}{d_{\max}}\,\widehat{P}_{j\succ i}, \quad P_{ii} = 1 - \sum_{j\neq i} P_{ij}\]

where P_hat is computed from pairwise tied-split outcomes.

References

Negahban, S., Oh, S., & Shah, D. (2017). Rank Centrality: Ranking from Pairwise Comparisons. Operations Research.

HodgeRank

HodgeRank: statistical ranking via combinatorial Hodge theory.

This module implements the weighted \(\ell_2\) HodgeRank estimator from Jiang, Lim, Yao, and Ye (2009).

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\) and define decisive wins \(W_{ij}\) and ties \(T_{ij}\) from pairwise model comparisons. From these counts, construct a skew-symmetric edge flow \(Y\) and nonnegative symmetric edge weights \(w_{ij}\).

HodgeRank estimates a potential vector \(s \in \mathbb{R}^L\) by

\[s^\star \in \arg\min_{s} \sum_{i<j} w_{ij}\left((s_j-s_i)-Y_{ij}\right)^2.\]

This yields the weighted Laplacian normal equations

\[\Delta_0 s^\star = -\operatorname{div}(Y), \qquad s^\star = -\Delta_0^{\dagger}\operatorname{div}(Y),\]

where \(\Delta_0^{\dagger}\) is the Moore-Penrose pseudoinverse. Only score differences are identifiable; adding a constant to \(s\) does not change the ranking.

hodge_rank(R, pairwise_stat='binary', weight_method='total', epsilon=0.5, method='competition', return_scores=False, return_diagnostics=False)[source]

Rank models with l2 HodgeRank on pairwise-comparison graphs.

Method context:: HodgeRank treats aggregated pairwise outcomes as an edge flow and finds global potentials whose gradient best matches that flow in weighted least squares. The minimum-norm solution is obtained from a graph Laplacian pseudoinverse.

References

Jiang, X., Lim, L.-H., Yao, Y., & Ye, Y. (2009). Statistical Ranking and Combinatorial Hodge Theory. https://arxiv.org/abs/0811.1067

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
pairwise_stat (str) –
- “binary”: Ȳ_ij = P(j>i) - P(i>j)
- ”log_odds”: Ȳ_ij = log(P(j>=i)/P(j<=i)) with additive smoothing
weight_method (str) –
- “total”: w_ij = #comparisons (including ties)
- ”decisive”: w_ij = #non-tie comparisons (wins+losses)
- ”uniform”: w_ij = 1 if comparable else 0
epsilon (float) – Additive smoothing (counts) used only for pairwise_stat=”log_odds”.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores) where scores are the HodgeRank potentials s (higher ⇒ better).
return_diagnostics (bool) – If True, also returns a small diagnostics dict with least-squares residual norms.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, float]]

Returns:

Ranking array of shape (L,) by default. If return_scores=True, returns (ranking, scores). If return_diagnostics=True, returns (ranking, scores, diagnostics).

Notation:: Y: skew-symmetric observed edge flow from pairwise outcomes. w_ij: symmetric nonnegative edge weights. (grad s)_ij = s_j - s_i. (div Y)_i = \sum_j w_{ij} Y_{ij}. \Delta_0: weighted graph Laplacian.
Formula:: \[s^{\star} \in \arg\min_{s\in\mathbb{R}^{L}} \sum_{i<j} w_{ij}\left((s_j-s_i)-Y_{ij}\right)^2\]

\[\Delta_0 s^{\star} = -\operatorname{div}(Y), \qquad s^{\star} = -\Delta_0^{\dagger}\operatorname{div}(Y)\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([[[1, 1], [1, 1]], [[0, 0], [0, 0]]])
>>> rank.hodge_rank(R).tolist()
[1, 2]

Notes

Scores are identifiable only up to an additive constant; rankings depend on score differences. Diagnostics report weighted residual norms ||Y - grad(s)||_{2,w}.

Serial Rank

SerialRank: spectral ranking by seriation.

This module implements the SerialRank construction of Fogel, d’Aspremont, and Vojnovic.

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\) and build a skew-symmetric pairwise comparison matrix \(C\), with \(C_{ij}>0\) indicating evidence that model \(i\) is stronger than model \(j\).

SerialRank forms

\[S = \frac{1}{2}\left(L\mathbf{1}\mathbf{1}^{\top} + C C^{\top}\right), \qquad L_S = \operatorname{diag}(S\mathbf{1}) - S,\]

and ranks models by sorting a Fiedler vector of \(L_S\) (the eigenvector associated with the second-smallest eigenvalue).

serial_rank(R, comparison='prob_diff', method='competition', return_scores=False)[source]

Rank models with SerialRank spectral seriation.

Method context:: SerialRank builds a similarity matrix from pairwise comparisons, computes its graph Laplacian, and orders models by a Fiedler vector (second-smallest Laplacian eigenvector). This is a seriation-based ranking method designed to be robust to noisy pairwise outcomes.

References

Fogel, F., d’Aspremont, A., & Vojnovic, M. (2016). Spectral Ranking Using Seriation. Journal of Machine Learning Research. https://jmlr.org/papers/v17/16-035.html

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
comparison (str) – How to aggregate multiple comparisons between a pair: - “prob_diff”: C_ij = (wins_ij - wins_ji) / total_ij in [-1, 1] - “sign”: C_ij = sign(wins_ij - wins_ji) in {-1, 0, 1}
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores) where scores are the oriented Fiedler vector (higher ⇒ better).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, returns (ranking, scores) with scores of shape (L,).

Notation:: L: number of models. W_ij: number of decisive outcomes where model i beats j. T_ij: number of ties between models i and j. C: skew-symmetric comparison matrix. S: SerialRank similarity matrix. L_S: graph Laplacian of S.
Formula:: \[C_{ij} = \frac{W_{ij} - W_{ji}}{W_{ij} + W_{ji} + T_{ij}}, \quad C_{ii} = 0\]

\[S = \frac{1}{2}\left(L\,\mathbf{1}\mathbf{1}^{\top} + C C^{\top}\right), \quad L_S = \operatorname{diag}(S\mathbf{1}) - S\]

Rank by sorting a Fiedler vector of L_S. The sign is chosen to best align with observed pairwise comparisons.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([[[1, 1], [1, 0]], [[0, 0], [0, 1]]])
>>> rank.serial_rank(R).tolist()
[1, 2]

Notes

The original SerialRank paper sets C_ii = 1 for binary tournaments. In this implementation we keep C skew-symmetric with C_ii = 0; this changes S only by a constant diagonal shift and leaves the Laplacian (hence the ranking) unchanged.

If the Fiedler eigenvalue is not simple, the ordering is not identifiable. In that degenerate case, the method falls back to mean accuracy scores.

Graph-based Methods

Graph-based ranking methods.

These methods convert pairwise outcomes into a graph or Markov chain and rank models with stationary-distribution, spectral, or equilibrium concepts.

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\) and define decisive wins \(W_{ij}\) and ties \(T_{ij}\). A shared empirical pairwise relation is

\[\widehat{P}_{i\succ j} = \frac{W_{ij} + \tfrac12 T_{ij}} {W_{ij} + W_{ji} + T_{ij}}.\]

Graph methods construct an operator \(\mathcal{G}(\widehat{P})\) and rank from a derived score vector \(s\), such as a stationary distribution, principal eigenvector, or game-theoretic equilibrium score.

pagerank(R, damping=0.85, max_iter=100, tol=1e-6, method='competition', return_scores=False, teleport=None)[source]

Rank models with PageRank on the pairwise win-probability graph.

Method context:: Build a directed graph where edge j -> i has weight P_hat[i, j] (losers link to winners). Column-normalize to obtain a transition matrix and run the damped PageRank fixed point with a teleportation vector e.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
damping (float) – PageRank damping factor d in (0, 1).
max_iter (int) – Positive maximum number of power iterations.
tol (float) – Positive L1 convergence tolerance.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
teleport (ndarray | None) – Optional teleportation vector e (shape (L,), nonnegative, finite). If None, uses uniform teleportation (1/L) * 1.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns PageRank scores r (shape (L,)).

Notation:: P_hat[i, j] is the tied-split empirical win probability of model i against model j.
Formula:: \[P_{ij} = \frac{\widehat{P}_{i\succ j}} {\sum_{k\neq j}\widehat{P}_{k\succ j}}\]

\[r = d P r + (1-d)\,e\]

where e is a probability vector. The default choice is \(e = \frac{1}{L}\mathbf{1}\).

References

Page, L., et al. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> rank.pagerank(R).tolist()
[1, 2]

spectral(R, max_iter=10_000, tol=1e-12, method='competition', return_scores=False)[source]

Rank models by spectral centrality of pairwise win probabilities.

Method context:: Form a nonnegative matrix W with off-diagonal entries W[i, j] = P_hat[i, j] and diagonal self-loop mass equal to row sum. The normalized dominant right eigenvector is the score vector. This is an eigenvector-based Perron-style spectral ranking heuristic.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
max_iter (int) – Positive max iterations for power iteration.
tol (float) – Positive L1 convergence tolerance.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns normalized spectral scores (shape (L,)).

Formula:: \[W_{ij}=\widehat{P}_{i\succ j}\;(i\neq j),\quad W_{ii}=\sum_{j\neq i}W_{ij}\]

\[v \propto W v,\quad \sum_i v_i = 1\]

References

Vigna, S. (2016). Spectral ranking. Network Science. Keener, J. P. (1993). The Perron-Frobenius theorem and the ranking of football teams. SIAM Review.

Examples

>>> import numpy as np
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> _, scores = spectral(R, return_scores=True)
>>> scores[0] > scores[1]
True

alpharank(R, alpha=1.0, population_size=50, max_iter=100_000, tol=1e-12, method='competition', return_scores=False)[source]

Rank models with single-population alpha-Rank.

Method context:: Treat models as strategies in a symmetric constant-sum game with payoff P_hat[i, j]. Build fixation probabilities under Fermi selection in a finite population and compute the stationary distribution of the induced Markov chain. This corresponds to the single-population setting from the AlphaRank framework (the paper’s general method also covers multi-population and asymmetric games).

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
alpha (float) – Selection intensity alpha >= 0.
population_size (int) – Finite population size m >= 2.
max_iter (int) – Positive max iterations for stationary-distribution iteration.
tol (float) – Positive L1 convergence tolerance.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns alpha-Rank stationary distribution scores (shape (L,)).

Formula:: \[\begin{split}u = \alpha\frac{m}{m-1}\left(\widehat{P}_{r\succ s}-\frac12\right), \quad \rho_{r,s} = \begin{cases} \frac{1-e^{-u}}{1-e^{-mu}}, & u\neq 0 \\ \frac{1}{m}, & u=0 \end{cases}\end{split}\]

\[C_{sr}=\frac{1}{L-1}\rho_{r,s},\quad C_{ss}=1-\sum_{r\neq s}C_{sr}\]

References

Omidshafiei, S., et al. (2019). α-Rank: Multi-Agent Evaluation by Evolution. Scientific Reports.

Examples

>>> import numpy as np
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> _, scores = alpharank(R, return_scores=True)
>>> scores[0] > scores[1]
True

nash(R, n_iter=100, temperature=0.1, solver='lp', score_type='vs_equilibrium', return_equilibrium=False, method='competition', return_scores=False)[source]

Rank models via Nash equilibrium on the zero-sum meta-game.

Method context:: Construct antisymmetric payoff matrix A = 2 * P_hat - 1 (with zero diagonal), solve a maximin mixed strategy x, then score models by expected performance versus x (Nash-averaging style) or alternative score views.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
n_iter (int) – Unused when solver=”lp” (kept for backward compatibility).
temperature (float) – Unused when solver=”lp” (kept for backward compatibility).
solver (str) – Currently only “lp” is supported.
score_type (str) – Which score vector to rank by: - “vs_equilibrium”: expected win probability vs equilibrium opponent. - “equilibrium”: the equilibrium mixture itself. - “advantage_vs_equilibrium”: expected zero-sum advantage vs equilibrium.
return_equilibrium (bool) – If True, also return the equilibrium mixture.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns scores for the selected score_type. If return_equilibrium=True, also returns equilibrium mixture x.

Formula:: \[A_{ij} = 2\widehat{P}_{i\succ j} - 1,\quad A_{ii}=0\]

\[x \in \arg\max_{x\in\Delta^{L-1}}\min_{y\in\Delta^{L-1}} x^\top A y\]

\[s_i = \sum_j \widehat{P}_{i\succ j} x_j\]

Notes

The cited papers establish empirical Nash equilibria and Nash-based population evaluation in symmetric zero-sum games. The score_type choices here are practical per-model summaries derived from that equilibrium for ranking API use.

References

Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W. M., P{‘e}rolat, J., Jaderberg, M., & Graepel, T. (2019). Open-ended Learning in Symmetric Zero-sum Games. ICML. Balduzzi, D., Tuyls, K., P{‘e}rolat, J., & Graepel, T. (2018). Re-evaluating Evaluation. NeurIPS.

Examples

>>> import numpy as np
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, scores, eq = nash(R, return_scores=True, return_equilibrium=True)
>>> scores[0] > scores[1]
True

Pairwise Methods

Sequential pairwise rating methods.

This module adapts Elo, TrueSkill, and Glicko to binary response tensors by streaming induced pairwise outcomes.

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\). For each event \((m,n)\) and each pair \((i,j)\), define an observed pair score \(S_{ij}^{(m,n)} \in \{0, 0.5, 1\}\) according to the configured tie policy.

Each method updates latent rating states \(\Theta_i\) sequentially:

\[\Theta_i^{(t+1)} = \mathcal{U}_i\left(\Theta^{(t)}, S^{(t)}; \psi\right), \qquad s_i = h(\Theta_i^{(T)}),\]

where \(s_i\) is the final score used for ranking.

elo(R, K=32.0, initial_rating=1500.0, tie_handling='correct_draw_only', method='competition', return_scores=False)[source]

Rank models with sequential Elo updates on induced pairwise matches.

Method context:: Each question-trial pair (m, n) induces head-to-head outcomes for every model pair. Ratings are updated online, so final ratings depend on stream order.

References

Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing. https://archive.org/details/ratingofchesspla0000eloa

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
K (float) – Positive Elo step size.
initial_rating (float) – Initial rating assigned to every model.
tie_handling (Literal['skip', 'draw', 'correct_draw_only']) – Policy for (1,1) and (0,0) outcomes: - "skip": ignore ties - "draw": treat both tie types as draws - "correct_draw_only": draw only for (1,1), skip (0,0)
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, ratings).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns final Elo ratings of shape (L,).

Notation:: r_i is model i rating, and S_{ij} is the observed score for model i against j in one induced match.
Formula:: \[E_{ij} = \frac{1}{1 + 10^{(r_j-r_i)/400}}\]

\[r_i \leftarrow r_i + K(S_{ij} - E_{ij}), \quad r_j \leftarrow r_j + K((1-S_{ij}) - (1-E_{ij}))\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, ratings = rank.elo(R, return_scores=True)
>>> ranks.tolist()
[1, 2]
>>> float(ratings[0] > ratings[1])
1.0

>>> # tie handling can change updates
>>> rank.elo(np.array([[[1]], [[1]]]), tie_handling="skip").tolist()
[1, 1]

Notes

The implementation processes pairs in index order for each (trial, question) event.

trueskill(R, mu_initial=25.0, sigma_initial=25.0 / 3, beta=25.0 / 6, tau=25.0 / 300, method='competition', return_scores=False, tie_handling='skip', draw_margin=0.0)[source]

Rank models with a two-player TrueSkill update stream.

Method context:: Each model has latent skill belief \(\mathcal{N}(\mu_i, \sigma_i^2)\). Every decisive pairwise result from R applies a two-player TrueSkill update. Ties can be skipped or treated as draw events with an optional draw margin. Ranking uses final mu.

References

Herbrich, R., Minka, T., & Graepel, T. (2006). TrueSkill(TM): A Bayesian Skill Rating System. NeurIPS 19. https://proceedings.neurips.cc/paper_files/paper/2006/file/f44ee263952e65b3610b8ba51229d1f9-Paper.pdf

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
mu_initial (float) – Initial mean skill.
sigma_initial (float) – Initial skill standard deviation. Must be positive.
beta (float) – Performance-noise scale. Must be positive.
tau (float) – Per-round dynamics scale for sigma inflation. Must be nonnegative.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking variant passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, mu).
tie_handling (Literal['skip', 'draw', 'correct_draw_only']) – Policy for tied outcomes (1,1) and (0,0): - "skip": ignore ties - "draw": treat both tie types as draws - "correct_draw_only": draw only for (1,1), skip (0,0)
draw_margin (float) – Nonnegative draw margin \(\epsilon\) in the performance-difference space. 0 recovers no-margin updates.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns final mu values (shape (L,)).

Notation:: For one match between models i and j: c = sqrt(2 beta^2 + sigma_i^2 + sigma_j^2), t = (mu_i - mu_j) / c, epsilon = draw_margin / c.
Formula:: \[v_{win}(t,\epsilon) = \frac{\phi(t-\epsilon)}{\Phi(t-\epsilon)},\quad w_{win}(t,\epsilon)=v_{win}(t,\epsilon)\left(v_{win}(t,\epsilon)+t-\epsilon\right)\]

\[\mu_i' = \mu_i + \frac{\sigma_i^2}{c} v_{win}(t,\epsilon),\quad \mu_j' = \mu_j - \frac{\sigma_j^2}{c} v_{win}(t,\epsilon)\]

\[\sigma_i'^2 = \sigma_i^2\left(1 - \frac{\sigma_i^2}{c^2} w_{win}(t,\epsilon)\right),\quad \sigma_j'^2 = \sigma_j^2\left(1 - \frac{\sigma_j^2}{c^2} w_{win}(t,\epsilon)\right)\]

\[v_{draw}(t,\epsilon) = \frac{\phi(-\epsilon-t)-\phi(\epsilon-t)}{\Phi(\epsilon-t)-\Phi(-\epsilon-t)}\]

\[w_{draw}(t,\epsilon) = v_{draw}(t,\epsilon)^2 + \frac{(\epsilon-t)\phi(\epsilon-t)-(-\epsilon-t)\phi(-\epsilon-t)}{\Phi(\epsilon-t)-\Phi(-\epsilon-t)}\]

\[\mu_i' = \mu_i + \frac{\sigma_i^2}{c} v_{draw}(t,\epsilon),\quad \mu_j' = \mu_j - \frac{\sigma_j^2}{c} v_{draw}(t,\epsilon)\]

\[\sigma_i'^2 = \sigma_i^2\left(1 - \frac{\sigma_i^2}{c^2} w_{draw}(t,\epsilon)\right),\quad \sigma_j'^2 = \sigma_j^2\left(1 - \frac{\sigma_j^2}{c^2} w_{draw}(t,\epsilon)\right)\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, mu = rank.trueskill(R, return_scores=True)
>>> ranks.tolist()
[1, 2]
>>> float(mu[0] > mu[1])
1.0

Notes

This is a simplified two-player update stream, not full multiplayer factor-graph inference from the original TrueSkill paper.

glicko(R, initial_rating=1500.0, initial_rd=350.0, c=0.0, rd_max=350.0, tie_handling='correct_draw_only', return_deviation=False, method='competition', return_scores=False)[source]

Rank models with Glicko rating and rating-deviation updates.

Method context:: Glicko extends Elo by tracking both rating r and uncertainty RD. Each (question, trial) is treated as one rating period containing all pairwise induced matches for that slice.

References

Glickman, M. E. (1999). Parameter Estimation in Large Dynamic Paired Comparison Experiments. Journal of the Royal Statistical Society: Series C, 48(3), 377-394. https://doi.org/10.1111/1467-9876.00159

Glickman, M. E. (n.d.). The Glicko System. https://www.glicko.net/glicko/glicko.pdf

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
initial_rating (float) – Initial rating r for every model.
initial_rd (float) – Initial rating deviation RD for every model.
c (float) – Nonnegative RD inflation parameter between rating periods.
rd_max (float) – Positive hard cap for RD.
tie_handling (Literal['skip', 'draw', 'correct_draw_only']) – Policy for tied outcomes: - "skip": ignore (1,1) and (0,0) - "draw": treat both as draws - "correct_draw_only": draw for (1,1), skip (0,0)
return_deviation (bool) – If True, return RD in addition to ratings.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule for scorio.utils.rank_scores().
return_scores (bool) – If True, return ratings with ranking.

Returns:

ranking array of shape (L,).

If return_deviation=False and return_scores=True: (ranking, rating) where rating has shape (L,).

If return_deviation=True: (ranking, rating, rd) with both vectors shape (L,).

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, ndarray]

Notation:: q = ln(10)/400, g(RD) = 1 / sqrt(1 + 3 q^2 RD^2 / pi^2). For model i in one period with opponents j: S_ij is observed score and E_ij = 1 / (1 + 10^{-g(RD_j)(r_i-r_j)/400}).
Formula:: \[d_i^2 = \left(q^2 \sum_j g(RD_j)^2 E_{ij}(1-E_{ij})\right)^{-1}\]

\[RD_i' = \left(\frac{1}{RD_i^2} + \frac{1}{d_i^2}\right)^{-1/2}, \quad r_i' = r_i + \frac{q}{\frac{1}{RD_i^2}+\frac{1}{d_i^2}} \sum_j g(RD_j)(S_{ij}-E_{ij})\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, ratings, rds = rank.glicko(R, return_deviation=True)
>>> ranks.tolist()
[1, 2]
>>> float(ratings[0] > ratings[1])
1.0

Notes

Models with no games in a rating period keep rating unchanged but can still receive RD inflation when c > 0.

Paired-Comparison Probabilistic Models

Paired-comparison (pairwise) probabilistic ranking models.

In Scorio’s binary tensor setting \(R \in \{0,1\}^{L \times M \times N}\), each model pair (i, j) generates decisive wins and ties:

\[W_{ij} = \sum_{m,n} \mathbf{1}\{R_{imn}=1,\ R_{jmn}=0\}, \qquad T_{ij} = \sum_{m,n} \mathbf{1}\{R_{imn}=R_{jmn}\}.\]

Paired-comparison models assume latent positive strengths \(\pi_i\) and map them to pairwise event probabilities. A common likelihood template is

\[\log p(W,T \mid \pi, \eta) = \sum_{i<j} \left[ W_{ij}\log p_{ij} + W_{ji}\log p_{ji} + T_{ij}\log p^{\text{tie}}_{ij} \right],\]

where \(\eta\) denotes tie-related parameter(s) when present.

Main families implemented here:

Bradley-Terry (BT) (no explicit ties):

\[\Pr(i \succ j) = \frac{\pi_i}{\pi_i + \pi_j}.\]
Davidson (1970) tie extension with \(\nu > 0\):

\[\Pr(i \sim j) = \frac{\nu\sqrt{\pi_i\pi_j}} {\pi_i + \pi_j + \nu\sqrt{\pi_i\pi_j}}.\]
Rao-Kupper (1967) tie extension with \(\kappa \ge 1\):

\[\Pr(i \succ j) = \frac{\pi_i}{\pi_i + \kappa\pi_j}, \quad \Pr(i \sim j) = \frac{(\kappa^2-1)\pi_i\pi_j} {(\pi_i+\kappa\pi_j)(\kappa\pi_i+\pi_j)}.\]

Each family has ML and MAP variants (MAP adds a prior penalty on the latent log-strengths \(\theta_i=\log\pi_i\) for regularization and numerical stability).

bradley_terry(R, method='competition', return_scores=False, max_iter=500)[source]

Rank models with Bradley-Terry maximum-likelihood strengths.

Method context:: Pairwise decisive outcomes are modeled with positive strengths \(\pi_i\). Tied outcomes are ignored for BT-ML.

References

Bradley, R. A., & Terry, M. E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika. https://doi.org/10.1093/biomet/39.3-4.324

Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models. The Annals of Statistics. https://doi.org/10.1214/aos/1079120141

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns BT strengths pi (shape (L,)).

Notation:: W_ij is the number of decisive outcomes where model i beats model j.
Formula:: \[\Pr(i \succ j) = \frac{\pi_i}{\pi_i + \pi_j}\]

\[\log p(W\mid\pi) = \sum_{i\ne j} W_{ij} \left[\log \pi_i - \log(\pi_i+\pi_j)\right]\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, scores = rank.bradley_terry(R, return_scores=True)
>>> ranks.tolist()
[1, 2]
>>> float(scores[0] > scores[1])
1.0

Notes

If there are no decisive outcomes at all, all strengths are returned equal because relative ability is unidentifiable.

bradley_terry_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500)[source]

Rank models with Bradley-Terry MAP estimation.

Method context:: This method adds a prior penalty on centered log-strengths theta_i = log(pi_i) to regularize BT estimation.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
prior (Prior | float) – Prior on log-strengths. If prior is a float, it is interpreted as Gaussian prior variance using scorio.rank.GaussianPrior with mean 0. If prior is a Prior instance, its penalty is used directly.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns MAP strengths pi (shape (L,)).

Notation:: theta_i = log(pi_i), and P(theta) denotes the prior.
Formula:: \[\hat{\theta} = \arg\min_{\theta} \left[ -\log p(W\mid\theta) + \operatorname{penalty}(\theta) \right]\]

\[\hat{\pi}_i = \exp(\hat{\theta}_i)\]

References

Caron, F., & Doucet, A. (2012). Efficient Bayesian inference for generalized Bradley-Terry models. Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2012.638220

Examples

>>> import numpy as np
>>> from scorio import rank
>>> from scorio.rank import GaussianPrior
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, scores = rank.bradley_terry_map(R, prior=1.0, return_scores=True)
>>> ranks.tolist()
[1, 2]

>>> prior = GaussianPrior(mean=0.0, var=0.5)
>>> rank.bradley_terry_map(R, prior=prior).shape
(2,)

Notes

With informative priors, MAP can remain identifiable in settings where BT-ML has weak or degenerate decisive information.

bradley_terry_davidson(R, method='competition', return_scores=False, max_iter=500)[source]

Rank models with the Bradley-Terry-Davidson tie model (ML).

Method context:: Davidson extends BT by introducing a tie parameter nu > 0. In this benchmark setting, ties correspond to (1,1) or (0,0) outcomes for a model pair on the same question-trial event.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns Davidson strengths pi (shape (L,)).

Notation:: W_ij are decisive win counts and T_ij tie counts.
Formula:: \[\Pr(i\succ j) = \frac{\pi_i}{\pi_i+\pi_j+\nu\sqrt{\pi_i\pi_j}}, \quad \Pr(i\sim j) = \frac{\nu\sqrt{\pi_i\pi_j}}{\pi_i+\pi_j+\nu\sqrt{\pi_i\pi_j}}\]

References

Davidson, R. R. (1970). On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.1970.10481082

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 0]],
...     [[1, 0], [0, 0]],
... ])
>>> ranks, scores = rank.bradley_terry_davidson(R, return_scores=True)
>>> ranks.tolist()
[1, 2]

Notes

If there are ties but no decisive outcomes, strengths are set equal because relative ability is not identified from ties alone.

bradley_terry_davidson_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500)[source]

Rank models with the Bradley-Terry-Davidson tie model (MAP).

Method context:: Adds a prior penalty on centered log-strengths on top of Davidson’s tie likelihood.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
prior (Prior | float) – Prior on log-strengths. - float: interpreted as Gaussian prior variance with mean 0. - Prior instance: custom prior penalty.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns Davidson-MAP strengths pi (shape (L,)).

Notation:: theta_i = log(pi_i) and nu is the Davidson tie parameter.
Formula:: \[\hat{\theta},\hat{\nu} = \arg\min_{\theta,\nu>0} \left[ -\log p(W,T\mid\theta,\nu) + \operatorname{penalty}(\theta) \right]\]

References

Davidson, R. R. (1970). On extending the Bradley-Terry model to accommodate ties in paired comparison experiments. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.1970.10481082

Caron, F., & Doucet, A. (2012). Efficient Bayesian inference for generalized Bradley-Terry models. https://doi.org/10.1080/10618600.2012.638220

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 0]],
...     [[1, 0], [0, 0]],
... ])
>>> ranks, scores = rank.bradley_terry_davidson_map(
...     R, prior=1.0, return_scores=True
... )
>>> ranks.tolist()
[1, 2]

rao_kupper(R, tie_strength=1.1, method='competition', return_scores=False, max_iter=500)[source]

Rank models with the Rao-Kupper tie model (ML).

Method context:: Rao-Kupper augments paired comparison with a threshold parameter kappa >= 1 controlling tie prevalence. In this API, kappa is a fixed hyperparameter (tie_strength), not estimated.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
tie_strength (float) – Rao-Kupper parameter kappa >= 1. kappa=1 reduces to no-tie BT.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns Rao-Kupper strengths pi (shape (L,)).

Notation:: W_ij are decisive win counts and T_ij tie counts.
Formula:: \[\Pr(i\succ j) = \frac{\pi_i}{\pi_i + \kappa\pi_j}, \quad \Pr(j\succ i) = \frac{\pi_j}{\kappa\pi_i + \pi_j}\]

\[\Pr(i\sim j)= \frac{(\kappa^2-1)\pi_i\pi_j} {(\pi_i+\kappa\pi_j)(\kappa\pi_i+\pi_j)}\]

References

Rao, P. V., & Kupper, L. L. (1967). Ties in paired-comparison experiments: A generalization of the Bradley-Terry model. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.1967.10482901

Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models. The Annals of Statistics. https://doi.org/10.1214/aos/1079120141

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 0]],
...     [[1, 0], [0, 0]],
... ])
>>> ranks, scores = rank.rao_kupper(R, tie_strength=1.1, return_scores=True)
>>> ranks.tolist()
[1, 2]

Notes

If tie_strength=1 and tie counts are present, the model is inconsistent and raises ValueError.

rao_kupper_map(R, tie_strength=1.1, prior=1.0, method='competition', return_scores=False, max_iter=500)[source]

Rank models with the Rao-Kupper tie model (MAP).

Method context:: Adds a prior penalty on centered log-strengths to Rao-Kupper likelihood while keeping kappa fixed.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
tie_strength (float) – Rao-Kupper parameter kappa >= 1.
prior (Prior | float) – Prior on log-strengths. - float: interpreted as Gaussian prior variance with mean 0. - Prior instance: custom prior penalty.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns Rao-Kupper MAP strengths pi (shape (L,)).

Notation:: theta_i = log(pi_i) and kappa is fixed.
Formula:: \[\hat{\theta} = \arg\min_{\theta} \left[ -\log p(W,T\mid\theta,\kappa) + \operatorname{penalty}(\theta) \right]\]

References

Rao, P. V., & Kupper, L. L. (1967). Ties in paired-comparison experiments: A generalization of the Bradley-Terry model. Journal of the American Statistical Association. https://doi.org/10.1080/01621459.1967.10482901

Caron, F., & Doucet, A. (2012). Efficient Bayesian inference for generalized Bradley-Terry models. https://doi.org/10.1080/10618600.2012.638220

Bayesian Ranking Methods

Bayesian ranking methods.

These methods rank models using posterior summaries rather than point estimates alone.

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\). Each method introduces latent model parameters \(\theta_l\), computes a posterior \(p(\theta \mid R)\), and ranks with posterior scores \(s_l\).

The generic form is

\[s_l = \mathbb{E}\!\left[g(\theta_l) \mid R\right] \quad\text{or}\quad s_l = \mathbb{Q}_q\!\left(g(\theta_l) \mid R\right),\]

where \(\mathbb{Q}_q\) denotes a posterior quantile.

thompson(R, n_samples=10000, prior_alpha=1.0, prior_beta=1.0, seed=42, method='competition', return_scores=False)[source]

Rank models by Thompson-sampling posterior expected rank.

Method context:: This method assumes each model has one latent Bernoulli success probability over all M*N outcomes and uses the conjugate Beta-Binomial posterior. Ranking score is the negative Monte Carlo estimate of posterior expected rank.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
n_samples (int) – Positive number of posterior Monte Carlo samples T.
prior_alpha (float) – Positive Beta prior alpha.
prior_beta (float) – Positive Beta prior beta.
seed (int) – Random seed for reproducibility.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns scores = - E[r_l | R] approximations (shape (L,)), where higher is better.

Notation:: S_l = sum_{m,n} R[l,m,n] and T_tot = M*N. r_l^(t) is model l rank in posterior draw t.
Formula:: \[p_l \mid R \sim \mathrm{Beta}(\alpha + S_l,\ \beta + T_{\mathrm{tot}} - S_l)\]

\[s_l^{\mathrm{TS}} = -\frac{1}{T}\sum_{t=1}^{T} r_l^{(t)}\]

References

Thompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrika. https://doi.org/10.1093/biomet/25.3-4.285

Russo, D. J., et al. (2018). A Tutorial on Thompson Sampling. Foundations and Trends in Machine Learning. https://doi.org/10.1561/2200000070

Gelman, A., et al. (2013). Bayesian Data Analysis (3rd ed.). https://doi.org/10.1201/b16018

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, scores = rank.thompson(R, n_samples=2000, return_scores=True)
>>> ranks.tolist()
[1, 2]

Notes

If all models have identical posterior Beta parameters, the exact expected ranks are equal and the implementation returns equal scores deterministically (instead of Monte Carlo tie-breaking noise).

bayesian_mcmc(R, n_samples=5000, burnin=1000, prior_var=1.0, seed=42, method='competition', return_scores=False)[source]

Rank models via Bayesian Bradley-Terry posterior means from MCMC.

Method context:: This method uses decisive pairwise counts with Bradley-Terry likelihood and independent Gaussian priors on log-strengths. Posterior inference is approximated by random-walk Metropolis-Hastings.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
n_samples (int) – Positive number of retained MCMC samples.
burnin (int) – Nonnegative number of warmup iterations.
prior_var (float) – Positive Gaussian prior variance on theta.
seed (int) – Random seed for reproducibility.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns posterior mean E[theta|W] estimates (shape (L,)).

Notation:: W_ij is decisive wins where model i beats j. theta_i are log-strengths with pi_i = exp(theta_i).
Formula:: \[\Pr(i\succ j\mid\theta)= \frac{\exp(\theta_i)}{\exp(\theta_i)+\exp(\theta_j)},\quad \theta_i\sim\mathcal{N}(0,\sigma^2)\]

\[s_i^{\mathrm{MCMC}} = \mathbb{E}[\theta_i\mid W]\]

References

Bradley, R. A., & Terry, M. E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika. https://doi.org/10.1093/biomet/39.3-4.324

Metropolis, N., et al. (1953). Equation of State Calculations by Fast Computing Machines. The Journal of Chemical Physics. https://doi.org/10.1063/1.1699114

Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. https://doi.org/10.1093/biomet/57.1.97

Caron, F., & Doucet, A. (2012). Efficient Bayesian inference for generalized Bradley-Terry models. https://doi.org/10.1080/10618600.2012.638220

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, scores = rank.bayesian_mcmc(
...     R, n_samples=2000, burnin=500, return_scores=True
... )
>>> ranks.tolist()
[1, 2]

Notes

If there are no decisive outcomes, posterior means are exactly equal under the symmetric Gaussian prior, and zeros are returned directly.

Item Response Theory Methods

Item Response Theory (IRT) ranking methods.

This module estimates latent model abilities and question parameters under binary IRT families.

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\) and \(k_{lm}=\sum_{n=1}^{N} R_{lmn}\). Model abilities are \(\theta_l\); item parameters include difficulty \(b_m\), discrimination \(a_m\), and optional pseudo-guessing \(c_m\).

A general binary IRT response model is

\[P(R_{lmn}=1 \mid \theta_l, a_m, b_m, c_m) = c_m + (1-c_m)\sigma\left(a_m(\theta_l-b_m)\right).\]

Special cases:

1PL (Rasch): \(a_m=1\), \(c_m=0\).
2PL: \(c_m=0\), free \(a_m\) and \(b_m\).
3PL: free \(a_m\), \(b_m\), and \(c_m\).

Rankings are induced by ability scores \(s_l\), typically \(s_l=\hat\theta_l\) or a posterior summary of \(\theta_l\).

The module includes maximum-likelihood and joint maximum-likelihood estimators, MAP variants with configurable priors, and MML-EAP estimators.

rasch(R, method='competition', return_scores=False, max_iter=500, return_item_params=False)[source]

Rank models with Rasch (1PL) IRT via joint MLE.

Method context:: Each model l has latent ability theta_l and each question m has difficulty b_m. We estimate both by maximizing the binomial likelihood over per-question correct counts.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.
return_item_params (bool) – If True, also returns estimated item parameters (difficulty). Implies returning scores.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns ability scores theta (shape (L,)). If return_item_params=True, also returns {"difficulty": b} (shape (M,)).

Notation:: k_{lm} = sum_n R_{lmn} is the correct-count for model l and question m.
Formula:: \[k_{lm} \sim \mathrm{Binomial}\left(N,\sigma(\theta_l-b_m)\right)\]

\[b \leftarrow b - \frac{1}{M}\sum_m b_m\]

References

Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, scores = rank.rasch(R, return_scores=True)
>>> ranks.tolist()
[1, 2]

rasch_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500, return_item_params=False)[source]

Rank models with Rasch (1PL) IRT via MAP estimation.

Method context:: Same likelihood as rasch(), with an additional prior penalty on abilities theta for shrinkage and numerical stability.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
prior (Prior | float) – Ability prior. A float is interpreted as Gaussian prior variance; otherwise must be a Prior instance.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.
return_item_params (bool) – If True, also returns estimated item parameters. Implies returning scores.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns MAP ability scores theta. If return_item_params=True, also returns {"difficulty": b}.

Formula:: \[\hat\theta,\hat b = \arg\min_{\theta,b} \left[ -\sum_{l,m}\log p(k_{lm}\mid\theta_l,b_m) + \mathrm{penalty}(\theta) \right]\]

References

Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> rank.rasch_map(R, prior=1.0).tolist()
[1, 2]

rasch_mml(R, method='competition', return_scores=False, max_iter=100, em_iter=20, n_quadrature=21, return_item_params=False)[source]

Rank models with Rasch MML (EM + quadrature) and EAP scoring.

Method context:: Integrates out abilities under a population prior (standard normal), estimates item difficulties by EM, then computes expected-a-posteriori (EAP) model abilities.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive max optimizer iterations in each M-step item update.
em_iter (int) – Positive number of EM iterations.
n_quadrature (int) – Number of Gauss-Hermite nodes (integer >=2).
return_item_params (bool) – If True, also returns estimated item parameters. Implies returning scores.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns EAP ability scores. If return_item_params=True, also returns {"difficulty": b, "ability_sd": sd(theta|R)}.

Formula:: \[\hat\theta_l^{\mathrm{EAP}} = \sum_q w_{lq}\,\theta_q, \quad w_{lq} \propto p(k_l\mid\theta_q,b)\,w_q\]

References

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443-459.

Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51(2), 177-195.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> rank.rasch_mml(R).tolist()
[1, 2]

rasch_mml_credible(R, quantile=0.05, method='competition', return_scores=False, max_iter=100, em_iter=20, n_quadrature=21)[source]

Rank models by a posterior quantile under Rasch MML.

Method context:: Uses the discrete posterior from rasch_mml() and ranks by posterior quantile Q_q(theta_l | R). Lower quantiles provide conservative, uncertainty-aware ordering.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
quantile (float) – Posterior quantile q in (0, 1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive max optimizer iterations in each M-step item update.
em_iter (int) – Positive number of EM iterations.
n_quadrature (int) – Number of Gauss-Hermite nodes (integer >=2).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns posterior-quantile scores.

Formula:: \[s_l = Q_q(\theta_l\mid R)\]

rasch_2pl(R, method='competition', return_scores=False, max_iter=500, return_item_params=False, reg_discrimination=0.01)[source]

Rank models with 2PL IRT via joint (optionally regularized) JMLE.

Method context:: Extends Rasch with item discrimination a_m > 0, so items can differ in how strongly they separate abilities. By default, a small L2 penalty is applied on log(a) for numerical stability.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.
return_item_params (bool) – If True, also returns estimated item parameters (difficulty and discrimination). Implies returning scores.
reg_discrimination (float) – Non-negative L2 penalty weight on log(a). Set to 0.0 for pure (unpenalized) JMLE.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns ability scores theta. If return_item_params=True, also returns {"difficulty": b, "discrimination": a}.

Formula:: \[k_{lm} \sim \mathrm{Binomial} \left(N,\sigma\left(a_m(\theta_l-b_m)\right)\right)\]

References

Birnbaum, A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability. In Statistical Theories of Mental Test Scores.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> rank.rasch_2pl(R).tolist()
[1, 2]

rasch_2pl_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500, return_item_params=False, reg_discrimination=0.01)[source]

Rank models with 2PL IRT via MAP estimation.

Method context:: Same 2PL likelihood as rasch_2pl(), with a prior penalty on model abilities theta and an optional L2 penalty on log(a).

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
prior (Prior | float) – Ability prior. A float is interpreted as Gaussian prior variance; otherwise must be a Prior instance.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.
return_item_params (bool) – If True, also returns estimated item parameters. Implies returning scores.
reg_discrimination (float) – Non-negative L2 penalty weight on log(a). Set to 0.0 to remove item-discrimination regularization.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns MAP ability scores theta. If return_item_params=True, also returns {"difficulty": b, "discrimination": a}.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> rank.rasch_2pl_map(R, prior=1.0).tolist()
[1, 2]

rasch_3pl(R, method='competition', return_scores=False, max_iter=500, fix_guessing=None, return_item_params=False, reg_discrimination=0.01, reg_guessing=0.1, guessing_upper=0.5)[source]

Rank models with 3PL IRT via joint (optionally regularized) JMLE.

Method context:: Extends 2PL with item-specific pseudo-guessing c_m. Estimated guessing is constrained to [0, guessing_upper]; optionally a fixed value can be used. By default, small L2 penalties are applied on log(a) and guessing logits for numerical stability.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.
fix_guessing (float | None) – If provided, fixes the guessing parameter to this value for all questions; must lie in [0, guessing_upper].
return_item_params (bool) – If True, also returns estimated item parameters. Implies returning scores.
reg_discrimination (float) – Non-negative L2 penalty weight on log(a). Set to 0.0 for pure (unpenalized) JMLE.
reg_guessing (float) – Non-negative L2 penalty weight on guessing logits. Set to 0.0 for pure (unpenalized) JMLE.
guessing_upper (float) – Upper bound for item guessing c_m. Must be in (0, 1). Default 0.5 is suitable for binary outcomes.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns ability scores theta. If return_item_params=True, also returns {"difficulty": b, "discrimination": a, "guessing": c}.

Formula:: \[p_{lm} = c_m + (1-c_m)\sigma\left(a_m(\theta_l-b_m)\right)\]

References

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Routledge.

Birnbaum, A. (1968). Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability. In Statistical Theories of Mental Test Scores.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> rank.rasch_3pl(R, fix_guessing=0.25).tolist()
[1, 2]

rasch_3pl_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500, fix_guessing=None, return_item_params=False, reg_discrimination=0.01, reg_guessing=0.1, guessing_upper=0.5)[source]

Rank models with 3PL IRT via MAP estimation.

Method context:: Same 3PL likelihood as rasch_3pl(), with prior regularization on model abilities theta and optional L2 regularization on item parameters.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
prior (Prior | float) – Ability prior. A float is interpreted as Gaussian prior variance; otherwise must be a Prior instance.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.
fix_guessing (float | None) – Optional fixed guessing parameter in [0, guessing_upper].
return_item_params (bool) – If True, also return item parameters.
reg_discrimination (float) – Non-negative L2 penalty weight on log(a).
reg_guessing (float) – Non-negative L2 penalty weight on guessing logits.
guessing_upper (float) – Upper bound for item guessing c_m in (0, 1). Default is 0.5 for binary outcomes.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns MAP ability scores theta. If return_item_params=True, also returns {"difficulty": b, "discrimination": a, "guessing": c}.

dynamic_irt(R, variant='linear', method='competition', return_scores=False, max_iter=500, return_item_params=False, time_points=None, score_target='final', slope_reg=0.01, state_reg=1.0, assume_time_axis=False)[source]

Rank models with dynamic (longitudinal) IRT variants.

Method context:

variant="linear" is a static Rasch baseline over aggregated counts. variant="growth" fits a longitudinal logistic growth model with per-model baseline theta0_l and slope theta1_l:

\[\theta_{ln}=\theta_{0,l}+\theta_{1,l}t_n\]

variant="state_space" fits a dynamic Rasch trajectory \(\theta_{ln}\) with random-walk smoothness regularization:

\[P(R_{lmn}=1)=\sigma\left(\theta_{ln}-b_m\right)\]

\[\mathrm{penalty}=\lambda\sum_{l,n>0} \frac{\left(\theta_{ln}-\theta_{l,n-1}\right)^2}{t_n-t_{n-1}}\]

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
variant (Literal['linear', 'growth', 'state_space']) – "linear", "growth", or "state_space".
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum number of L-BFGS iterations.
return_item_params (bool) – If True, also returns estimated item parameters. Implies returning scores.
time_points (ndarray | None) – Optional ordered measurement times of length N. If None, uses equally spaced times in [0, 1]. Used only for longitudinal variants.
score_target (Literal['initial', 'final', 'mean', 'gain', 'baseline', 'start', 'end', 'average', 'delta', 'trend']) – Longitudinal score extracted from ability paths for ranking in growth and state-space variants. One of {"initial", "final", "mean", "gain"}.
slope_reg (float) – Non-negative L2 regularization weight on growth slopes. Used only for variant="growth".
state_reg (float) – Non-negative random-walk smoothness penalty in variant="state_space".
assume_time_axis (bool) – Safety switch for longitudinal variants. Set True to acknowledge that axis-2 of R is ordered time, not i.i.d. sampling trials.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns scores: theta for linear and dynamic target scores for growth/state_space. If return_item_params=True, also returns {"difficulty": b} (linear), or for longitudinal variants: {"difficulty": b, "ability_path": theta_path, ...}.

Formula:: \[P(R_{lmn}=1) = \sigma\left(\theta_{0,l} + \theta_{1,l} t_n - b_m\right)\]

References

Verhelst, N. D., & Glas, C. A. (1993). A dynamic generalization of the Rasch model. Psychometrika.

Wang, C., & Nydick, S. W. (2020). On Longitudinal Item Response Theory Models: A Didactic. Journal of Educational and Behavioral Statistics.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1, 1], [1, 1, 1]],
...     [[0, 0, 0], [0, 0, 0]],
... ])
>>> rank.dynamic_irt(R, variant="linear").tolist()
[1, 2]

mirt(R, n_factors=2, model='2pl', method='competition', return_scores=False, max_iter=50, em_iter=100, n_quadrature=15, fix_guessing=None, reg_discrimination=0.01, reg_guessing=0.1, guessing_upper=0.5, tol=1e-4, return_item_params=False)[source]

Rank models with compensatory multidimensional IRT (MIRT) via MML-EM.

Method context:

Each model l has a D-dimensional latent ability vector theta_l (D = n_factors) and each question m has a slope (discrimination) vector a_m and intercept d_m. The compensatory dichotomous model is

\[P(R_{lmn}=1\mid\theta_l) = c_m + (1-c_m)\,\sigma\!\left(a_m^{\top}\theta_l + d_m\right),\]

with c_m=0 for model="2pl" and an item pseudo-guessing lower asymptote for model="3pl". Item slopes and intercepts are estimated by marginal maximum likelihood with a Bock-Aitkin EM algorithm, integrating abilities over a standard multivariate-normal population prior on a product Gauss-Hermite quadrature grid. Per-model abilities are then summarized by their expected a posteriori (EAP) values.

Ranking score:

Multidimensional abilities are collapsed to a scalar via the rotation-invariant reference composite - the projection of each ability vector onto the mean item-slope direction a_bar = (1/M) sum_m a_m:

\[s_l = a_{\mathrm{bar}}^{\top}\theta_l.\]

Because the compensatory model is invariant to an orthogonal rotation theta \mapsto Q\theta, a \mapsto Qa, this composite is well-defined without fixing an (otherwise arbitrary) factor rotation. The full per-dimension abilities are available via return_item_params.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
n_factors (int) – Number of latent ability dimensions D (>= 1).
model (Literal['2pl', '3pl']) – "2pl" (no guessing) or "3pl" (item pseudo-guessing).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive maximum L-BFGS iterations per EM M-step.
em_iter (int) – Positive maximum number of EM iterations.
n_quadrature (int) – Gauss-Hermite nodes per dimension (integer >= 2). The product grid has n_quadrature ** n_factors nodes, so keep n_factors small.
fix_guessing (float | None) – Only valid for model="3pl". If provided, fixes the guessing parameter to this value for all questions; must lie in [0, guessing_upper]. Otherwise guessing is estimated.
reg_discrimination (float) – Non-negative L2 (ridge) penalty on slope vectors.
reg_guessing (float) – Non-negative L2 penalty on guessing logits (3PL only).
guessing_upper (float) – Upper bound for item guessing c_m in (0, 1).
tol (float) – Non-negative convergence tolerance on the maximum item-parameter change between EM iterations. Set 0.0 to always run em_iter iterations.
return_item_params (bool) – If True, also return item/ability parameters. Implies returning scores.

Return type:

ndarray | tuple[ndarray, ndarray] | tuple[ndarray, ndarray, dict[str, ndarray]]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns reference-composite ability scores (shape (L,)). If return_item_params=True, also returns a dict with "difficulty" (multidimensional difficulty MDIFF, shape (M,)), "discrimination" (multidimensional discrimination MDISC, shape (M,)), "slopes" (a, shape (M, D)), "intercept" (d, shape (M,)), "abilities" (EAP theta, shape (L, D)), "ability_sd" (posterior SD, shape (L, D)), and, for model="3pl", "guessing" (c, shape (M,)).

Notation:: MDISC_m = ||a_m||_2 and MDIFF_m = -d_m / MDISC_m are the multidimensional discrimination and difficulty of item m.

References

Chalmers, R. P. (2012). mirt: A Multidimensional Item Response Theory Package for the R Environment. Journal of Statistical Software, 48(6), 1-29.

Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443-459.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> rank.mirt(R, n_factors=1, n_quadrature=7).tolist()
[1, 2]

Voting Methods

Voting-based ranking methods.

This module adapts social-choice rules to model ranking from response tensors.

Notation

Let \(R \in \{0,1\}^{L \times M \times N}\) and define question-level grades

\[k_{lm} = \sum_{n=1}^{N} R_{lmn} \in \{0,1,\ldots,N\}.\]

Each question \(m\) is treated as one voter over models based on \(k_{:m}\). Voting rules produce scores \(s_l\) from these per-question preferences, then ranks are derived from \(s\).

When \(N=1\), each question induces a two-level ordering (correct versus incorrect), so many voting rules reduce to simple majority-style behavior.

borda(R, method='competition', return_scores=False)[source]

Rank models with Borda count on per-question rankings.

Method context:: Each question acts as a voter that ranks models by per-question correct count k[l, m]. Borda assigns positional points per question and sums them across questions; ties use averaged ranks.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns Borda scores (shape (L,)).

Notation:: k_{lm} = sum_{n=1}^N R_{lmn} and r_{lm} is model l’s tie-averaged descending rank among k_{:,m} (rank 1 is best).
Formula:: \[s_l^{\mathrm{Borda}} = \sum_{m=1}^{M} (L - r_{lm})\]

References

de Borda, J.-C. (1781/1784). Mémoire sur les élections au scrutin. In Histoire de l’Académie Royale des Sciences.

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1, 1], [1, 0, 0]],  # k: [3, 1]
...     [[1, 1, 0], [0, 1, 0]],  # k: [2, 1]
...     [[0, 0, 0], [1, 1, 1]],  # k: [0, 3]
... ])
>>> ranks, scores = rank.borda(R, return_scores=True)
>>> ranks.tolist()
[1, 3, 2]
>>> scores.round(2).tolist()
[2.5, 1.5, 2.0]

Notes

Adding a constant to every model’s score does not change the ranking.

copeland(R, method='competition', return_scores=False)[source]

Rank models with Copeland pairwise-majority scores.

Method context:: For each model pair (i, j), count how many questions prefer i over j by comparing per-question correct counts. A strict majority contributes +1 to the winner and -1 to the loser; tied pairwise contests contribute 0.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns Copeland scores (shape (L,)).

Notation:: k_{im} = sum_{n=1}^N R_{imn} and W^{(q)}_{ij} = sum_{m=1}^{M} 1[k_{im} > k_{jm}].
Formula:: \[s_i^{\mathrm{Copeland}} = \sum_{j\neq i} \operatorname{sign}\!\left( W^{(q)}_{ij} - W^{(q)}_{ji} \right)\]

References

Copeland, A. H. (1951). A Reasonable Social Welfare Function. Seminar on Applications of Mathematics to the Social Sciences, University of Michigan.

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [0, 0], [1, 0]],  # k: [2, 0, 1]
...     [[1, 0], [1, 1], [0, 0]],  # k: [1, 2, 0]
...     [[0, 0], [1, 0], [1, 1]],  # k: [0, 1, 2]
... ])
>>> ranks, scores = rank.copeland(R, return_scores=True)
>>> ranks.tolist()
[1, 1, 1]
>>> scores.tolist()
[0.0, 0.0, 0.0]

Notes

This implementation compares pairwise question-level majorities and does not use magnitude of per-question margins beyond the sign.

win_rate(R, method='competition', return_scores=False)[source]

Rank models by pairwise question-level win rate.

Method context:: For each model pair, count on how many questions model i has a higher per-question correct count than model j. A model’s score is the fraction of decisive pairwise outcomes it wins.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns win-rate scores in [0, 1] (shape (L,)).

Notation:: k_{im} = sum_{n=1}^N R_{imn} and W^{(q)}_{ij} = sum_{m=1}^{M} 1[k_{im} > k_{jm}].
Formula:: \[s_i^{\mathrm{winrate}} = \frac{\sum_{j\neq i} W^{(q)}_{ij}} {\sum_{j\neq i} \left(W^{(q)}_{ij} + W^{(q)}_{ji}\right)}\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks, scores = rank.win_rate(R, return_scores=True)
>>> ranks.tolist()
[1, 2]
>>> scores.round(2).tolist()
[1.0, 0.0]

Notes

If a model has no decisive pairwise outcomes against any opponent, its score is set to 0.5.

minimax(R, variant='margin', tie_policy='half', method='competition', return_scores=False)[source]

Rank models with the Minimax (Simpson-Kramer) Condorcet rule.

Method context:: Build pairwise question-level preference counts P[i, j] from per-question correct counts. Each model is scored by its worst pairwise defeat; smaller worst defeat is better.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
variant (str) – Defeat-strength definition: - "margin": use margin of defeat - "winning_votes": use opponent’s winning-vote count
tie_policy (str) – How per-question ties contribute to pairwise counts: "ignore" or "half".
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns minimax scores (shape (L,)), where values are non-positive and larger is better.

Notation:: P_{ij} is pairwise preference count and Delta_{ij} = P_{ij} - P_{ji}.
Formula:: \[s_i^{\mathrm{minimax}} = -\max_{j\neq i} \max(0, \Delta_{ji})\]

\[s_i^{\mathrm{wv}} = -\max_{j : P_{ji} > P_{ij}} P_{ji}\]

References

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1], [1, 1]],
...     [[1, 0], [1, 0], [1, 0]],
...     [[0, 0], [0, 0], [0, 0]],
... ])
>>> ranks, scores = rank.minimax(R, return_scores=True)
>>> ranks.tolist()
[1, 2, 2]
>>> scores.tolist()
[-0.0, -3.0, -3.0]

schulze(R, tie_policy='half', method='competition', return_scores=False)[source]

Rank models with the Schulze beatpath Condorcet method.

Method context:: Build pairwise preference counts P from per-question outcomes. Initialize direct victories, then run a strongest-path closure: path strength is the maximum bottleneck strength across all directed paths. Model i is preferred to j when p[i, j] > p[j, i].

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
tie_policy (str) – How per-question ties contribute to pairwise counts: "ignore" or "half".
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns level scores derived from the beatpath dominance relation (shape (L,)).

Notation:: P_{ij} is pairwise preference count and p_{ij} is strongest path strength from i to j.
Formula:: \[\begin{split}p_{ij} = \begin{cases} P_{ij}, & P_{ij} > P_{ji} \\ 0, & \text{otherwise} \end{cases}\end{split}\]

\[p_{jk} \leftarrow \max\bigl(p_{jk}, \min(p_{ji}, p_{ik})\bigr)\]

References

Schulze, M. (2010). A new monotonic, clone-independent, reversal symmetric, and Condorcet-consistent single-winner election method. Social Choice and Welfare.

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

ranked_pairs(R, strength='margin', tie_policy='half', method='competition', return_scores=False)[source]

Rank models with the Ranked Pairs (Tideman) Condorcet method.

Method context:: Compute directed pairwise victories and sort them by strength. Lock victories in descending order, skipping any edge that would create a directed cycle. The locked acyclic graph induces the final ranking.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
strength (str) – Primary edge-strength key: - "margin": absolute pairwise margin |P_ij - P_ji| - "winning_votes": winning-vote count P_ij
tie_policy (str) – How per-question ties contribute to pairwise counts: "ignore" or "half".
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns topological level scores from the locked graph (shape (L,)).

Notation:: P_{ij} is pairwise preference count and Delta_{ij} = P_{ij} - P_{ji}.
Formula:: Lock directed wins i -> j in nonincreasing order of selected strength, accepting a lock only if it keeps the locked graph acyclic.

References

Tideman, T. N. (1987). Independence of clones as a criterion for voting rules. Social Choice and Welfare.

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

kemeny_young(R, tie_policy='half', method='competition', return_scores=False, time_limit=None, tie_aware=True)[source]

Rank models with Kemeny-Young rank aggregation via exact MILP.

Method context:: Treat each question as a voter and aggregate pairwise preferences into P. Kemeny-Young selects a total order maximizing agreement with pairwise counts.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
tie_policy (str) – How per-question ties contribute to pairwise counts: "ignore" or "half".
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).
time_limit (float | None) – Optional positive MILP time limit in seconds.
tie_aware (bool) – If True (default), derive a tie-aware preorder by checking which pairwise orders are forced across optimal Kemeny solutions. If False, return the single optimal order selected by the MILP solver.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns Kemeny scores (shape (L,)). For tie_aware=True, these are level scores from the forced-order DAG over all optimal Kemeny solutions. For tie_aware=False, scores are the number of opponents ranked below each model in the selected optimal order.

Notation:: P_{ij} is the aggregated pairwise preference count. y_{ij} in {0,1} indicates whether model i is ranked above model j.
Formula:: \[\max_{y} \sum_{i \ne j} P_{ij} y_{ij}\]

\[y_{ij} + y_{ji} = 1, \quad y_{ij} + y_{jk} + y_{ki} \le 2 \;\; (\forall i,j,k \text{ distinct})\]

References

Kemeny, J. G. (1959). Mathematics without Numbers. Daedalus.

Young, H. P. (1977). Extending Condorcet’s rule. Journal of Economic Theory.

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

Notes

Exact Kemeny optimization is NP-hard in general; this implementation uses scipy.optimize.milp on the induced linear-ordering ILP. Tie-aware mode solves additional MILPs (up to one per model pair).

nanson(R, rank_ties='average', method='competition', return_scores=False)[source]

Rank models with Nanson’s Borda-elimination rule.

Method context:: Recompute Borda scores among currently active models each round, then eliminate all models scoring strictly below the round mean.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
rank_ties (str) – Tie rule passed to scipy.stats.rankdata() when computing per-question Borda ranks among active models.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns survival-round scores (shape (L,)), where larger means eliminated later.

Notation:: At round t, active set is A_t and Borda score in that round is s_i^{(t)} for i in A_t.
Formula:: \[E_t = \{ i \in A_t : s_i^{(t)} < \overline{s}^{(t)} \} ,\quad A_{t+1} = A_t \setminus E_t\]

References

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

Notes

If no active model is strictly below the mean, elimination stops and all remaining models tie.

baldwin(R, rank_ties='average', method='competition', return_scores=False)[source]

Rank models with Baldwin’s iterative Borda-elimination rule.

Method context:: Recompute Borda scores among active models each round and eliminate the model(s) with the minimum score. This implementation removes all models tied at the minimum in a round.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
rank_ties (str) – Tie rule passed to scipy.stats.rankdata() when computing per-question Borda ranks among active models.
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns survival-round scores (shape (L,)), where larger means eliminated later.

Notation:: At round t, active set is A_t and Borda score is s_i^{(t)} for i in A_t.
Formula:: \[E_t = \arg\min_{i \in A_t} s_i^{(t)} ,\quad A_{t+1} = A_t \setminus E_t\]

References

Brandt, F., Conitzer, V., Endriss, U., Lang, J., & Procaccia, A. D. (2016). Handbook of Computational Social Choice. Cambridge University Press.

Notes

If all active models tie at the minimum in a round, elimination stops and all remaining models tie.

majority_judgment(R, method='competition', return_scores=False)[source]

Rank models with Majority Judgment on per-question trial counts.

Method context:: Each model receives per-question integer grades k_{lm} in {0, ..., N}. Models are first compared by median grade. Ties are broken with majority-gauge style iterative median removal.

Parameters:

R (ndarray) – Binary outcome tensor with shape (L, M, N) or matrix (L, M) (treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Tie-handling rule passed to scorio.utils.rank_scores().
return_scores (bool) – If True, return (ranking, scores).

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, also returns level scores induced by the MJ comparison preorder (shape (L,)).

Notation:: k_{lm} = sum_{n=1}^N R_{lmn} is the grade assigned by question m to model l.
Formula:: Rank by median grade, then for tied models iteratively remove one occurrence of the current median grade from each tied model and repeat median comparison.

References

Balinski, M., & Laraki, R. (2011). Majority Judgment: Measuring, Ranking, and Electing. MIT Press.

Notes

The implementation uses the lower median for even sample sizes.

Listwise and Setwise Choice Models

Listwise and setwise probabilistic choice models (Luce family).

In the binary tensor setting \(R \in \{0,1\}^{L \times M \times N}\), each event \((m,n)\) induces a winner set \(W_{mn}\) (correct models) and loser set \(L_{mn}\) (incorrect models).

Luce-family models assign positive strengths \(\pi_i\) and define selection probabilities over comparison sets:

\[\Pr(i \mid S) = \frac{\pi_i}{\sum_{j \in S} \pi_j}.\]

For strict orderings \(i_1 \succ i_2 \succ \cdots \succ i_K\), the Plackett-Luce likelihood is

\[\Pr(i_1 \succ i_2 \succ \cdots \succ i_K) = \prod_{k=1}^{K} \frac{\pi_{i_k}}{\sum_{j=k}^{K} \pi_{i_j}}.\]

This module uses three constructions:

Pairwise reduction (wins only): reduce \(R\) to decisive pairwise counts and fit the Bradley-Terry form of Plackett-Luce.
Setwise likelihood with ties: treat each observed winner set as one tied choice event using Davidson-Luce normalization.
Setwise composite likelihood: convert each winner into a Luce choice from {winner} union {losers} (Bradley-Terry-Luce rank breaking).

Estimation uses MM updates for the Plackett-Luce pairwise reduction and L-BFGS optimization for the setwise models.

References

Plackett, R. L. (1975). The Analysis of Permutations. Journal of the Royal Statistical Society: Series C.

Luce, R. D. (1959). Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons.

Hunter, D. R. (2004). MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 32(1), 384-406.

Firth, D., Kosmidis, I., & Turner, H. L. (2019). Davidson-Luce model for multi-item choice with ties. arXiv:1909.07123.

plackett_luce(R, method='competition', return_scores=False, max_iter=500, tol=1e-8)[source]

Rank models with Plackett-Luce maximum likelihood.

Method context:: In Scorio’s binary tensor setting, we reduce outcomes to decisive pairwise win counts and fit the Bradley-Terry form of a Plackett-Luce model using Hunter’s MM updates.

References

Plackett, R. L. (1975). The Analysis of Permutations. Luce, R. D. (1959). Individual Choice Behavior. Hunter, D. R. (2004). MM Algorithms for Generalized Bradley-Terry Models.

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive MM iteration budget.
tol (float) – Positive finite convergence tolerance on max parameter change.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, returns (ranking, scores) where scores are the estimated strengths pi.

Notation:: W_ij: number of decisive outcomes where model i beats j. N_ij = W_ij + W_ji: total decisive pairwise comparisons. pi_i > 0: latent model strengths.
Formula:: \[\pi_i^{(k+1)} = \frac{\sum_j W_{ij}} {\sum_{j \ne i} N_{ij} / (\pi_i^{(k)} + \pi_j^{(k)})}\]

followed by normalization to resolve scale non-identifiability.

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks = rank.plackett_luce(R)
>>> ranks[0] < ranks[1]  # Model 0 has better (lower) rank
True

Notes

This implementation intentionally ignores within-outcome ties (both-correct or both-incorrect), matching pairwise decisive reduction.

plackett_luce_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500)[source]

Rank models with Plackett-Luce maximum a posteriori estimation.

Method context:: Adds a prior penalty on centered log-strengths to the pairwise-reduced Plackett-Luce likelihood. Numeric priors are interpreted as Gaussian prior variances.

References

Luce, R. D. (1959). Individual Choice Behavior. Hunter, D. R. (2004). MM Algorithms for Generalized Bradley-Terry Models.

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
prior (Prior | float) – Prior instance or positive finite float variance for GaussianPrior(mean=0, var=prior).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive optimizer iteration budget.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, returns (ranking, scores).

Formula:: Let theta_i = log(pi_i) and P(theta) be the prior penalty.

\[\hat\theta \in \arg\min_{\theta} \left[ -\sum_{i \ne j} W_{ij} \left(\theta_i - \log(e^{\theta_i}+e^{\theta_j})\right) + P(\theta - \bar\theta) \right]\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks = rank.plackett_luce_map(R, prior=1.0)
>>> ranks[0] < ranks[1]
True

Notes

The MAP objective is solved with L-BFGS-B over centered log-strengths.

davidson_luce(R, method='competition', return_scores=False, max_iter=500, max_tie_order=None)[source]

Rank models with Davidson-Luce maximum likelihood (setwise ties).

Method context:: Each question-trial induces a winner set W and loser set L. Davidson-Luce models tied winners directly with tie-order parameters. Normalization terms are computed with elementary symmetric polynomials.

References

Firth, D., Kosmidis, I., & Turner, H. L. (2019). Davidson-Luce model for multi-item choice with ties. https://arxiv.org/abs/1909.07123

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive optimizer iteration budget.
max_tie_order (int | None) – Maximum tie order D used in normalization; default is L-1.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, returns (ranking, scores) where scores are strengths alpha.

Notation:: S = W \cup L is the comparison set and t = |W|. delta_1 = 1 and delta_t > 0 for t >= 2. g_t(T) = (\prod_{i\in T} alpha_i)^{1/t}.
Formula:: \[\Pr(W \mid S) = \frac{\delta_t g_t(W)} {\sum_{t'=1}^{\min(D,|S|)} \delta_{t'} \sum_{|T|=t'} g_{t'}(T)}\]

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks = rank.davidson_luce(R)
>>> ranks[0] < ranks[1]  # Model 0 has better (lower) rank
True

Notes

Events with all winners or all losers are dropped as uninformative.

davidson_luce_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500, max_tie_order=None)[source]

Rank models with Davidson-Luce MAP estimation.

Method context:: Adds a prior penalty on centered log-strengths to the Davidson-Luce setwise tie likelihood.

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
prior (Prior | float) – Prior instance or positive finite float variance for GaussianPrior(mean=0, var=prior).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive optimizer iteration budget.
max_tie_order (int | None) – Maximum tie order D used in normalization; default is L-1.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, returns (ranking, scores).

Examples

>>> import numpy as np
>>> from scorio import rank
>>> R = np.array([
...     [[1, 1], [1, 1]],
...     [[0, 0], [0, 0]],
... ])
>>> ranks = rank.davidson_luce_map(R, prior=1.0)
>>> ranks[0] < ranks[1]
True

bradley_terry_luce(R, method='competition', return_scores=False, max_iter=500)[source]

Rank models with Bradley-Terry-Luce composite-likelihood ML.

Method context:: For each event (W, L), each winner i in W is treated as a Luce choice from {i} union L. This yields a rank-breaking composite likelihood objective (not a normalized probability for the whole winner set W as a single event).

References

Luce, R. D. (1959). Individual Choice Behavior.

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive optimizer iteration budget.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, returns (ranking, scores).

Formula:: For event (W, L), define per-winner choice sets C_i = {i} union L and optimize the composite log-likelihood

\[\ell_{\mathrm{comp}}(\pi) = \sum_{(W,L)}\sum_{i\in W} \left[ \log \pi_i - \log\left(\pi_i + \sum_{j\in L}\pi_j\right) \right]\]

Notes

This objective is a Luce-style composite likelihood induced by rank-breaking, rather than a full normalized likelihood over all possible winner subsets.

bradley_terry_luce_map(R, prior=1.0, method='competition', return_scores=False, max_iter=500)[source]

Rank models with Bradley-Terry-Luce composite-likelihood MAP estimation.

Method context:: Adds a prior penalty on centered log-strengths to the BTL setwise-choice composite likelihood.

Parameters:

R (ndarray) – Binary tensor of shape (L, M, N) (or (L, M) treated as N=1).
prior (Prior | float) – Prior instance or positive finite float variance for GaussianPrior(mean=0, var=prior).
method (Literal['competition', 'competition_max', 'dense', 'avg']) – Ranking method passed to scorio.utils.rank_scores.
return_scores (bool) – If True, return (ranking, scores).
max_iter (int) – Positive optimizer iteration budget.

Return type:

ndarray | tuple[ndarray, ndarray]

Returns:

Ranking array of shape (L,). If return_scores=True, returns (ranking, scores).