Design Principles
Facade Pattern
STATISTICS is the main facade class that coordinates all statistical operations. This provides a single entry point and simplifies the API.
Design by Contract
Every feature includes formal contracts:
- Preconditions (require): What callers must guarantee (non-empty arrays, matching lengths, etc.)
- Postconditions (ensure): What the implementation guarantees (range constraints, ordering, etc.)
- Invariants: Properties that hold throughout object lifetime
Immutable Results
TEST_RESULT and REGRESSION_RESULT are immutable. Once created, their values cannot change. This ensures thread safety and simplifies reasoning about code.
Numerical Stability
All algorithms use numerically stable variants:
- Welford's algorithm: For mean and variance - one-pass, no accumulation errors
- Kahan summation: For sum - compensated addition to reduce floating-point error
- Linear interpolation: For percentiles - stable computation
Class Structure
STATISTICS (Facade)
Main entry point. Stateless - provides all statistical operations.
Methods:
- Descriptive Statistics: mean, median, variance, std_dev, etc.
- Bivariate Analysis: covariance, correlation
- Regression: linear_regression
- Hypothesis Testing: t_test_*, chi_square_test, anova
TEST_RESULT (Data)
Immutable result object from hypothesis tests. Contains test statistic, p-value, and degrees of freedom.
REGRESSION_RESULT (Data)
Immutable regression output. Contains slope, intercept, R², and prediction capability.
ASSUMPTION_CHECK (Data)
Documents validation of statistical test assumptions (normality, homogeneity of variance, etc.).
CLEANED_STATISTICS (Utility)
Data cleaning utility class. Removes NaN and infinite values from arrays.
Key Algorithms
Mean Computation
Welford's Online Mean
Computes the mean in a single pass while maintaining numerical stability:
mean = 0
for each x in data:
n = n + 1
mean = mean + (x - mean) / n
Variance Computation
Welford's Two-Pass Variance
First pass: compute mean. Second pass: compute sum of squared deviations.
sum_sq_dev = 0
for each x in data:
sum_sq_dev += (x - mean)^2
variance = sum_sq_dev / n
Sum Computation
Kahan Summation
Compensated summation that tracks and corrects for floating-point rounding errors:
sum = 0, compensation = 0
for each x in data:
temp = x - compensation
sum = sum + temp
compensation = (sum - x) - temp
Percentile Computation
NIST R-7 Linear Interpolation
After sorting: h = (p/100) * (n-1), then interpolate between floor(h) and ceil(h).
Correlation
Pearson Correlation
r = cov(x,y) / (std(x) * std(y)), bounded to [-1, 1].
Linear Regression
Ordinary Least Squares
Uses normal equations: slope = cov(x,y) / var(x), intercept = mean(y) - slope * mean(x).
Hypothesis Tests
t-test: Computes t-statistic = (mean - mu) / (std_err). Dof = n-1 or Welch-Satterthwaite for two-sample.
ANOVA: Computes F-statistic = MS_between / MS_within. Dof = groups - 1.
Chi-square: Computes chi² = sum((O - E)² / E).
Dependencies
External Libraries
- simple_math: For sqrt function (one method call)
ISE Eiffel Base Library
- ARRAY for data storage
- REAL_64 for numeric operations
- HASH_TABLE for frequency counting in mode()
- ARRAYED_LIST for data cleaning
No external numerical libraries or scientific frameworks - everything is implemented from first principles.
Future Enhancements
Phase 2: Distribution CDFs
Implement proper p-value computation for hypothesis tests:
- t-distribution CDF for t-tests
- Chi-square distribution CDF
- F-distribution CDF for ANOVA
Phase 3: Advanced Statistics
- Multiple regression
- Logistic regression
- Non-parametric tests (Mann-Whitney U, Kruskal-Wallis, etc.)
- Confidence intervals
Phase 4: Matrix Operations
- Matrix algebra for multivariate analysis
- Principal component analysis (PCA)
- Factor analysis