Architecture - simple

Design Principles

Facade Pattern

STATISTICS is the main facade class that coordinates all statistical operations. This provides a single entry point and simplifies the API.

Design by Contract

Every feature includes formal contracts:

Preconditions (require): What callers must guarantee (non-empty arrays, matching lengths, etc.)
Postconditions (ensure): What the implementation guarantees (range constraints, ordering, etc.)
Invariants: Properties that hold throughout object lifetime

Immutable Results

TEST_RESULT and REGRESSION_RESULT are immutable. Once created, their values cannot change. This ensures thread safety and simplifies reasoning about code.

Numerical Stability

All algorithms use numerically stable variants:

Welford's algorithm: For mean and variance - one-pass, no accumulation errors
Kahan summation: For sum - compensated addition to reduce floating-point error
Linear interpolation: For percentiles - stable computation

Class Structure

STATISTICS (Facade)

Main entry point. Stateless - provides all statistical operations.

Methods:
- Descriptive Statistics: mean, median, variance, std_dev, etc.
- Bivariate Analysis: covariance, correlation
- Regression: linear_regression
- Hypothesis Testing: t_test_*, chi_square_test, anova

TEST_RESULT (Data)

Immutable result object from hypothesis tests. Contains test statistic, p-value, and degrees of freedom.

REGRESSION_RESULT (Data)

Immutable regression output. Contains slope, intercept, R², and prediction capability.

ASSUMPTION_CHECK (Data)

Documents validation of statistical test assumptions (normality, homogeneity of variance, etc.).

CLEANED_STATISTICS (Utility)

Data cleaning utility class. Removes NaN and infinite values from arrays.

Key Algorithms

Mean Computation

Welford's Online Mean

Computes the mean in a single pass while maintaining numerical stability:

mean = 0
for each x in data:
    n = n + 1
    mean = mean + (x - mean) / n

Variance Computation

Welford's Two-Pass Variance

First pass: compute mean. Second pass: compute sum of squared deviations.

sum_sq_dev = 0
for each x in data:
    sum_sq_dev += (x - mean)^2
variance = sum_sq_dev / n

Sum Computation

Kahan Summation

Compensated summation that tracks and corrects for floating-point rounding errors:

sum = 0, compensation = 0
for each x in data:
    temp = x - compensation
    sum = sum + temp
    compensation = (sum - x) - temp

Percentile Computation

NIST R-7 Linear Interpolation

After sorting: h = (p/100) * (n-1), then interpolate between floor(h) and ceil(h).

Correlation

Pearson Correlation

r = cov(x,y) / (std(x) * std(y)), bounded to [-1, 1].

Linear Regression

Ordinary Least Squares

Uses normal equations: slope = cov(x,y) / var(x), intercept = mean(y) - slope * mean(x).

Hypothesis Tests

t-test: Computes t-statistic = (mean - mu) / (std_err). Dof = n-1 or Welch-Satterthwaite for two-sample.

ANOVA: Computes F-statistic = MS_between / MS_within. Dof = groups - 1.

Chi-square: Computes chi² = sum((O - E)² / E).

Dependencies

External Libraries

simple_math: For sqrt function (one method call)

ISE Eiffel Base Library

ARRAY for data storage
REAL_64 for numeric operations
HASH_TABLE for frequency counting in mode()
ARRAYED_LIST for data cleaning

No external numerical libraries or scientific frameworks - everything is implemented from first principles.

Future Enhancements

Phase 2: Distribution CDFs

Implement proper p-value computation for hypothesis tests:

t-distribution CDF for t-tests
Chi-square distribution CDF
F-distribution CDF for ANOVA

Phase 3: Advanced Statistics

Multiple regression
Logistic regression
Non-parametric tests (Mann-Whitney U, Kruskal-Wallis, etc.)
Confidence intervals

Phase 4: Matrix Operations

Matrix algebra for multivariate analysis
Principal component analysis (PCA)
Factor analysis