simple_statistics

Architecture & Design

Design Principles

Facade Pattern

STATISTICS is the main facade class that coordinates all statistical operations. This provides a single entry point and simplifies the API.

Design by Contract

Every feature includes formal contracts:

Immutable Results

TEST_RESULT and REGRESSION_RESULT are immutable. Once created, their values cannot change. This ensures thread safety and simplifies reasoning about code.

Numerical Stability

All algorithms use numerically stable variants:

Class Structure

STATISTICS (Facade)

Main entry point. Stateless - provides all statistical operations.

Methods:
- Descriptive Statistics: mean, median, variance, std_dev, etc.
- Bivariate Analysis: covariance, correlation
- Regression: linear_regression
- Hypothesis Testing: t_test_*, chi_square_test, anova

TEST_RESULT (Data)

Immutable result object from hypothesis tests. Contains test statistic, p-value, and degrees of freedom.

REGRESSION_RESULT (Data)

Immutable regression output. Contains slope, intercept, R², and prediction capability.

ASSUMPTION_CHECK (Data)

Documents validation of statistical test assumptions (normality, homogeneity of variance, etc.).

CLEANED_STATISTICS (Utility)

Data cleaning utility class. Removes NaN and infinite values from arrays.

Key Algorithms

Mean Computation

Welford's Online Mean

Computes the mean in a single pass while maintaining numerical stability:

mean = 0
for each x in data:
    n = n + 1
    mean = mean + (x - mean) / n

Variance Computation

Welford's Two-Pass Variance

First pass: compute mean. Second pass: compute sum of squared deviations.

sum_sq_dev = 0
for each x in data:
    sum_sq_dev += (x - mean)^2
variance = sum_sq_dev / n

Sum Computation

Kahan Summation

Compensated summation that tracks and corrects for floating-point rounding errors:

sum = 0, compensation = 0
for each x in data:
    temp = x - compensation
    sum = sum + temp
    compensation = (sum - x) - temp

Percentile Computation

NIST R-7 Linear Interpolation

After sorting: h = (p/100) * (n-1), then interpolate between floor(h) and ceil(h).

Correlation

Pearson Correlation

r = cov(x,y) / (std(x) * std(y)), bounded to [-1, 1].

Linear Regression

Ordinary Least Squares

Uses normal equations: slope = cov(x,y) / var(x), intercept = mean(y) - slope * mean(x).

Hypothesis Tests

t-test: Computes t-statistic = (mean - mu) / (std_err). Dof = n-1 or Welch-Satterthwaite for two-sample.

ANOVA: Computes F-statistic = MS_between / MS_within. Dof = groups - 1.

Chi-square: Computes chi² = sum((O - E)² / E).

Dependencies

External Libraries

ISE Eiffel Base Library

No external numerical libraries or scientific frameworks - everything is implemented from first principles.

Future Enhancements

Phase 2: Distribution CDFs

Implement proper p-value computation for hypothesis tests:

Phase 3: Advanced Statistics

Phase 4: Matrix Operations