Experimental Design Methods

Classical design-of-experiments instruments — variance decomposition, measurement reliability, space-filling quality, and screening designs.

When to use: You have a structured experimental layout (balanced grid, crossed facets, factorial design) or need to evaluate the quality of a sampling plan. These methods operate on observed data arranged in known designs, not on black-box model evaluations with random sampling.

ANOVA

Fisher (1925) Statistical Methods for Research Workers. [bib]

Theory

Analysis of variance partitions the total sum of squares of a balanced design into components attributable to each factor and their interactions. For a two-way layout with factors A (rows, a levels) and B (columns, b levels):

SS_{\text{total}} = SS_A + SS_B + SS_{AB} + SS_{\text{error}}

where SS_A = b \sum_{i=1}^{a} (\bar{Y}_{i\cdot} - \bar{Y}_{\cdot\cdot})^2 measures the row (factor A) effect, SS_B = a \sum_{j=1}^{b} (\bar{Y}_{\cdot j} - \bar{Y}_{\cdot\cdot})^2 measures the column (factor B) effect, and SS_{AB} captures the interaction residual.

Each mean square MS = SS / df is tested against an appropriate denominator via the F-statistic:

F_A = \frac{MS_A}{MS_{AB}}, \quad F_B = \frac{MS_B}{MS_{AB}}

with p-values from the Fisher-Snedecor distribution. A significant F-ratio means the factor’s effect is larger than expected from the interaction noise alone.

The three-way extension adds factors A \times B \times C with all two-way and three-way interaction terms. The implementation uses the three-way interaction MS_{ABC} as the error denominator for all F-tests (unreplicated balanced design — no pure-error term exists).

Code

use salib::estimators::estimate_anova_two_way;
use ndarray::arr2;

// 3 x 4 balanced grid: 3 levels of factor A, 4 levels of factor B
let grid = arr2(&[
    [23.0, 27.0, 21.0, 25.0],
    [30.0, 34.0, 28.0, 32.0],
    [18.0, 22.0, 16.0, 20.0],
]);

let result = estimate_anova_two_way(grid.view()).unwrap();
println!("{result}");

Three-way ANOVA:

use salib::estimators::estimate_anova_three_way;
use ndarray::Array3;

// A x B x C balanced grid
let grid = Array3::<f64>::from_shape_fn((3, 4, 2), |(i, j, k)| {
    10.0 * i as f64 + 3.0 * j as f64 + 1.5 * k as f64
});

let result = estimate_anova_three_way(grid.view()).unwrap();
println!("{result}");

Bootstrap confidence intervals on variance fractions:

use salib::estimators::estimate_anova_two_way_with_bootstrap;
use salib::RngState;
use ndarray::arr2;

let grid = arr2(&[
    [23.0, 27.0, 21.0, 25.0],
    [30.0, 34.0, 28.0, 32.0],
    [18.0, 22.0, 16.0, 20.0],
]);
let mut rng = RngState::from_seed([0u8; 32]);

let result = estimate_anova_two_way_with_bootstrap(
    grid.view(), 2000, 0.05, &mut rng
).unwrap();

Note: The implementation operates on unreplicated balanced grids. The residual sum of squares is identically zero — inferential F-tests use the interaction mean square as the denominator, following the classical unreplicated two-way design convention.

G-Theory

Brennan (2001) Generalizability Theory, Springer. [bib]

Theory

Generalizability theory extends classical reliability (Cronbach’s alpha) to designs with multiple facets. A crossed p \times i \times r design (persons \times items \times raters) decomposes the total variance into seven components:

\sigma^2_{\text{total}} = \sigma^2_p + \sigma^2_i + \sigma^2_r + \sigma^2_{pi} + \sigma^2_{pr} + \sigma^2_{ir} + \sigma^2_{pir}

Each component is estimated from the corresponding mean square via the standard EMS equations (Brennan 2001 Chapter 2). The two reliability coefficients are:

Generalizability coefficient (relative decisions):

\rho^2 = \frac{\sigma^2_p}{\sigma^2_p + \sigma^2_\delta}

where \sigma^2_\delta = \sigma^2_{pi}/n_i + \sigma^2_{pr}/n_r + \sigma^2_{pir}/(n_i \cdot n_r) is the relative error variance.

Dependability coefficient (absolute decisions):

\Phi = \frac{\sigma^2_p}{\sigma^2_p + \sigma^2_\Delta}

where \sigma^2_\Delta includes all non-person variance components scaled by their facet sample sizes.

The D-study projects \rho^2 and \Phi to hypothetical designs with different numbers of items and raters — answering “how many raters/items do I need to reach \rho^2 \geq 0.80?”

Code

use salib::estimators::{estimate_g_theory_pir, GTheoryDesign};
use ndarray::Array3;

// p x i x r balanced grid: 10 persons, 5 items, 3 raters
let grid = Array3::<f64>::from_shape_fn((10, 5, 3), |(p, i, r)| {
    50.0 + 6.0 * p as f64 + 2.0 * i as f64 + 0.5 * r as f64
});

let result = estimate_g_theory_pir(grid.view(), GTheoryDesign::Crossed).unwrap();
println!("{result}");

D-study projection:

use salib::estimators::project_g_theory_d_study;

// What if we used 8 items and 4 raters instead?
let projected = project_g_theory_d_study(&result, 8, 4).unwrap();
println!("G = {:.4}, Phi = {:.4}", projected.g_coefficient, projected.phi_coefficient);

Bootstrap confidence intervals:

use salib::estimators::{estimate_g_theory_pir_with_bootstrap, GTheoryDesign};
use salib::RngState;

let mut rng = RngState::from_seed([0u8; 32]);
let result = estimate_g_theory_pir_with_bootstrap(
    grid.view(), GTheoryDesign::Crossed, 2000, 0.05, &mut rng
).unwrap();

When to use: Measurement study design — deciding how many raters, items, or occasions are needed to achieve a target reliability. The D-study projection avoids costly pilot studies by extrapolating from a single G-study.

Scope: Only the fully crossed p \times i \times r design is implemented. Nested and mixed designs are planned for a future release.

Discrepancy

Hickernell (1998) in Monte Carlo and Quasi-Monte Carlo Methods 1996, Springer. [bib]

Theory

Discrepancy measures how uniformly a point set fills the unit hypercube [0, 1]^d. Lower discrepancy implies better space-filling and, by the Koksma-Hlawka inequality, lower quasi-Monte Carlo integration error:

\left|\int_{[0,1]^d} f(\mathbf{x})\,d\mathbf{x} - \frac{1}{N}\sum_{i=1}^{N} f(\mathbf{x}_i)\right| \leq D^*(P_N) \cdot V(f)

where D^*(P_N) is the star discrepancy of the point set and V(f) is the variation of f in the sense of Hardy and Krause.

Four discrepancy measures are computed simultaneously:

Measure	Description
Centered (CD)	Hickernell 1998 Eq 3.8. Symmetric around 0.5.
Wrap-around (WD)	Hickernell 1998 Eq 3.10. Invariant to coordinate shifts modulo 1.
Modified (MD)	Fang et al. 2006. Penalizes boundary clustering more heavily.
L2-star	Niederreiter 1992. The classical L2 star discrepancy.

Code

use salib::estimators::compute_discrepancy;
use ndarray::Array2;

// Evaluate a 100-point Sobol' sequence in 3 dimensions
let sample = Array2::<f64>::from_shape_fn((100, 3), |(i, j)| {
    // (placeholder — use your actual sample matrix)
    ((i * 7 + j * 13) % 100) as f64 / 100.0
});

let result = compute_discrepancy(sample.view()).unwrap();
println!("{result}");

Use case: Evaluate and compare the quality of sampling designs — LHS vs Sobol’ sequence vs Halton sequence vs random. Lower discrepancy values indicate better space-filling. All input values must lie in [0, 1]; rescale if needed.

Fractional Factorial

Box, Hunter & Hunter (1978) Statistics for Experimenters, Wiley. [bib]

Theory

A 2^{k-p} fractional factorial runs a fraction of the full factorial design to estimate main effects and (when resolution permits) two-factor interactions. The implementation uses Plackett-Burman designs — a family of two-level screening designs where N runs (a multiple of 4) can screen up to N - 1 factors.

For each factor i, the main effect is the difference in mean response between the high (+1) and low (-1) levels:

\text{effect}_i = \bar{Y}_{x_i = +1} - \bar{Y}_{x_i = -1}

The coded levels \{-1, +1\} are mapped to the physical factor bounds via x_{\text{phys}} = \text{lo} + \frac{x_{\text{coded}} + 1}{2}(\text{hi} - \text{lo}).

Code

use salib::estimators::estimate_fractional_factorial;
use salib::samplers::build_plackett_burman;
use salib::*;

let problem = ProblemBuilder::new()
    .factor("x1", Distribution::Uniform { lo: 0.0, hi: 1.0 })
    .factor("x2", Distribution::Uniform { lo: 0.0, hi: 1.0 })
    .factor("x3", Distribution::Uniform { lo: 0.0, hi: 1.0 })
    .build()
    .unwrap();

let design = build_plackett_burman(problem.dim()).unwrap();

let effects = estimate_fractional_factorial(&design, &problem, |x| {
    3.0 * x[0] + 1.0 * x[1] + 0.5 * x[2]
});

println!("{effects}");

When to use: Classical DoE screening — identify which factors have large main effects using the fewest runs possible. Most useful when you have a physical experiment (not just a computer model) and runs are expensive. For computer models with cheap evaluations, variance-based methods (Sobol’) or Morris screening are usually preferable.

Choosing among experimental design methods

Method	Input	Output	Best for
ANOVA	Balanced grid	Variance fractions + F-tests	Factor significance in structured experiments.
G-Theory	Crossed p \times i \times r grid	Variance components + \rho^2, \Phi	Measurement reliability, D-study projections.
Discrepancy	Point set in [0, 1]^d	CD, WD, MD, L2*	Evaluating space-filling quality of sampling plans.
Fractional Factorial	Plackett-Burman design	Main effects	Screening with minimal runs in physical experiments.