HiBayES: Improving LLM Evaluation with Hierarchical Bayesian Modelling

Accurately assessing the capabilities of advanced Large Language Models (LLMs) has become increasingly challenging. While historically LLMs were evaluated with question-answer benchmarks, the development of more complex capabilities means evaluations must evolve towards agentic and capability elicitation settings. These advanced evaluations involve complex scenarios and diverse types of inputs that test the model's capabilities in ways not captured by earlier, simpler benchmarks.

Conventional approaches to the statistical analysis of evaluation results often fall short when faced with these hierarchically structured and complex datasets, small sample sizes, and the inherently stochastic nature of LLM outputs. These new multi-layered and more intricate tasks require deeper analysis, robust uncertainty quantification and more nuanced understanding. Our new paper introduces HiBayES: a hierarchical Bayesian modelling framework designed to address these limitations through principled statistical methods.

The Challenges of LLM Evaluation

AI evaluations face a fundamental challenge: as LLMs and agentic systems become increasingly sophisticated, the demand for reliable, nuanced benchmarks and evaluation metrics is soaring. Accurate assessment of model capabilities and predicting model behaviour in real-world use cases are not only crucial for driving further model development but also vital for improving model safety and safeguards in deployment.

The challenge lies in achieving precision, robustness, and principled uncertainty quantification for increasingly complex and hierarchically structured evaluation benchmarks. Conventional statistical approaches often fall short when dealing with the intricate and multidimensional nature of modern AI assessments. As a result, there is a pressing need for advanced methods that can provide reliable insights into model performance across diverse scenarios.

Compounding this issue is the economic cost of thoroughly testing performance across benchmarks and in complex, agentic settings – often costing hundreds of US Dollars in tokens to solve a single task. This economic reality necessitates working with smaller datasets, precisely where conventional approaches to data analysis are not robust.

The conventional analysis approach

While conventional analysis methods relying on simple statistical methods (e.g., t-tests) applied to pre-averaged data offer a quick means to answer basic questions, they often disregard the complex and hierarchical structure of evaluation data (Figure 1). Conventional analysis methods are known to overfit evaluation data and do not systematically quantify uncertainty across levels of the data hierarchy – both of which lead to under- or over-estimation of model capabilities. For organisations committed to deploying safe and effective AI systems, these statistical shortcomings represent a significant barrier to progress.

**Figure 1. An example of hierarchically nested evaluation data.** Evaluation data is typically acquired across multiple hierarchical levels: repetitions of items/questions within subdomains and domains, often evaluating multiple LLMs on the same set of tasks. This complex structure is often overlooked in current statistical practices.

HiBayES: A Statistical Modelling Framework

HiBayES addresses these challenges by introducing a flexible, robust statistical modelling framework grounded in hierarchical (multilevel) Bayesian Generalised Linear Models (GLMs). Drawing inspiration from modern statistical practices in natural and social sciences, HiBayES enables nuanced, principled estimation of AI capabilities while providing formal uncertainty quantification.

The framework leverages three key components:

Multilevel GLMs to capture the hierarchical nature and distributional properties of evaluation data,

Bayesian data analysis for robust parameter estimation and uncertainty quantification across nested, hierarchical data structures,

Formal model comparison using information criteria.

Unlike conventional approaches, HiBayES explicitly accounts for the hierarchical nature of evaluation data sets (Figure 1), providing more accurate insights into model performance across varying domains and complexity levels.

Practical Applications and Results

We tested HiBayES across three typical evaluation scenarios, and found significant advantages over conventional methods*:

1. One LLM performance: When evaluating one LLM across multiple benchmarks with varying item counts, HiBayES accurately determined performance differences while avoiding the pitfalls of conventional t-tests, which often struggle with uneven sample sizes (requiring data padding or subsampling) and can lead to inflated effects, especially when computing unobserved variables (e.g., overall LLM performance) (Figure 2).

**Figure 2. Forest plot showing posterior means, distributions, and 95% HPDI (estimated using** HiBayES). This plot highlights domain-specific and overall performance (blue), indicating the width (thin lines) and uncertainty (thick lines) of the posterior distributions around the means (dots). For comparison, empirical means and error bars (SEM) per domain (orange) and overall (red) are also shown.

2. Two LLM comparison: When comparing the performance of two LLMs across multiple subdomains, HiBayES provided balanced and nuanced assessments, revealing no meaningful performance differences in cases where conventional statistics (t-tests) indicated statistically significant disparities (Figure 3).

**Figure 3. Forest plot of posterior means, distributions, and 95% HPDI (estimated using** HiBayES). This plot focuses on LLM-, domain-, and subdomain-specific effects, showing the width (thin lines) and uncertainty (thick lines) of the posterior distributions around the means (blue dots). For comparison, empirical means and SEM per LLM, domain, and subdomain (orange) are also shown.

3. Elicitation: For evaluating multiple LLMs on an agentic benchmark (GAIA) containing tasks with varying difficulties, HiBayES effectively handled data from tasks that were either always or never solved correctly. The results of our analysis highlight how different levels of LLM reasoning efforts (prompting or configuring the LLM to dedicate more “cognitive” resources and processing depth to solving a problem) improve performance for some models (Figure 4).

**Figure 4. Agentic evaluation analysis. A)** GAIA success rate on different tasks – data is asymmetric and bimodal. B) GLM comparison using WAIC. Higher WAIC values indicate better GLM fit to the data. The Reasoning Beta-Binomial GLM, fits better than the Reasoning Binomial GLM and the Null GLM. C) Forest plot of the effect of reasoning effort and task difficulty on the performance of 4 LLMs, showing posterior means, distributions, and 95% HPDI (blue dots indicate means, thin lines indicate width, and thick lines indicate uncertainty).

These results underscore that the use of HiBayES provides principled uncertainty quantification and robust parameter estimation, even in low-data regimes with high levels of complexity.

Practical Benefits

The practical implications of HiBayES extend beyond improved statistical methodology. By enabling reliable evaluation with smaller datasets, the framework helps optimise the evaluation process, reducing the amount of required data and potentially costs, while maintaining scientific integrity.

HiBayES also scales effortlessly to comparing multiple LLMs or benchmarks without requiring statistical adjustments, providing comprehensive and reliable evaluations that support safer and more effective AI development. This approach will help prevent misguided data-driven decisions that could otherwise hinder progress in model development and safety.

Looking Forward

HiBayES provides the building blocks for a more rigorous statistical approach to AI evaluation. Future extensions may include expanding predictive capabilities to estimate correlations across related tasks and domains, deepening our understanding of LLM performance and capability structures. The framework could also be used to reliably predict results from real-world applications like human-uplift studies based on findings from automated benchmarks.

For researchers and practitioners in the AI community, HiBayES offers a ready-to-use software package with step-by-step guidance for implementing multilevel Bayesian GLMs.

As AI systems continue to advance in capabilities and complexity, frameworks like HiBayES will become increasingly essential for ensuring that our evaluations remain both scientifically sound and practically informative, ultimately supporting the development of safer, more secure and more reliable models.

*Note that all LLM results reported here and in the paper are based on data acquired on publicly available benchmarks using the UK AI Security Institute’s evaluation framework Inspect. As such, we make no claims about the integrity of LLM capabilities reported by model developers.

‍