simple_statistics

Code Recipes & Patterns

Code Recipes

Recipe 1: Outlier Detection

Detect values more than 3 standard deviations from the mean:

local
    stats: STATISTICS
    data: ARRAY [REAL_64]
    mean_val, std_val: REAL_64
    i: INTEGER
do
    create stats.make
    mean_val := stats.mean (data)
    std_val := stats.std_dev (data)

    from i := data.lower until i > data.upper loop
        if (data[i] - mean_val).abs > 3 * std_val then
            print ("Outlier detected at index " + i.out + ": " + data[i].out + "%N")
        end
        i := i + 1
    end

Recipe 2: Group Comparison

Compare two groups and report if they differ significantly:

local
    stats: STATISTICS
    control, treatment: ARRAY [REAL_64]
    result: TEST_RESULT
do
    create stats.make
    result := stats.t_test_two_sample (control, treatment)

    print ("Control mean: " + stats.mean (control).out + "%N")
    print ("Treatment mean: " + stats.mean (treatment).out + "%N")
    print ("t-statistic: " + result.statistic.out + "%N")

    if result.is_significant (0.05) then
        print ("RESULT: Treatment has significant effect (p < 0.05)%N")
    else
        print ("RESULT: No significant difference (p >= 0.05)%N")
    end

Recipe 3: Relationship Strength

Measure how strongly two variables are related:

local
    stats: STATISTICS
    var1, var2: ARRAY [REAL_64]
    corr: REAL_64
do
    create stats.make
    corr := stats.correlation (var1, var2)

    if corr > 0.9 then
        print ("Very strong positive relationship%N")
    elseif corr > 0.7 then
        print ("Strong positive relationship%N")
    elseif corr > 0.5 then
        print ("Moderate positive relationship%N")
    elseif corr > 0.3 then
        print ("Weak positive relationship%N")
    elseif corr > -0.3 then
        print ("No relationship%N")
    elseif corr > -0.5 then
        print ("Weak negative relationship%N")
    else
        print ("Strong negative relationship%N")
    end

Recipe 4: Prediction from Model

Build a model and make predictions for new data:

local
    stats: STATISTICS
    x_training, y_training: ARRAY [REAL_64]
    x_new: REAL_64
    result: REGRESSION_RESULT
    y_predicted: REAL_64
do
    create stats.make

    -- Build regression model on training data
    result := stats.linear_regression (x_training, y_training)

    -- Print model
    print ("Model: y = " + result.slope.out + " * x + " + result.intercept.out + "%N")
    print ("R-squared: " + result.r_squared.out + "%N")

    -- Make prediction for new x value
    x_new := 42.0
    y_predicted := result.predict (x_new)
    print ("Prediction for x=" + x_new.out + ": y=" + y_predicted.out + "%N")

Recipe 5: Data Quality Assessment

Assess and clean data with issues:

local
    stats: STATISTICS
    clean: CLEANED_STATISTICS
    raw_data, clean_data: ARRAY [REAL_64]
do
    create stats.make
    create clean.make

    -- Assess raw data
    print ("Original data size: " + raw_data.count.out + "%N")
    if clean.has_nan (raw_data) then
        print ("WARNING: Data contains NaN values%N")
    end
    if clean.has_infinite (raw_data) then
        print ("WARNING: Data contains infinite values%N")
    end

    -- Clean data
    clean_data := clean.clean (raw_data)
    print ("Cleaned data size: " + clean_data.count.out + "%N")
    print ("Removed " + (raw_data.count - clean_data.count).out + " invalid entries%N")

    -- Proceed with analysis
    if clean_data.count >= 2 then
        print ("Mean of clean data: " + stats.mean (clean_data).out + "%N")
    end

Design Patterns

Pattern 1: Exploratory Data Analysis (EDA)

Quick summary of a dataset:

local
    stats: STATISTICS
    data: ARRAY [REAL_64]
do
    create stats.make

    print ("=== Data Summary ===%N")
    print ("Count: " + data.count.out + "%N")
    print ("Min: " + stats.min_value (data).out + "%N")
    print ("Q1: " + stats.quartiles (data)[1].out + "%N")
    print ("Median: " + stats.median (data).out + "%N")
    print ("Mean: " + stats.mean (data).out + "%N")
    print ("Q3: " + stats.quartiles (data)[3].out + "%N")
    print ("Max: " + stats.max_value (data).out + "%N")
    print ("Std Dev: " + stats.std_dev (data).out + "%N")

Pattern 2: Hypothesis Testing Pipeline

Standardized workflow for statistical testing:

1. Formulate hypothesis (null and alternative)
2. Collect data and choose alpha level (e.g., 0.05)
3. Check assumptions (normality, equal variance)
4. Run appropriate test
5. Interpret results based on p-value
6. Report conclusions

Pattern 3: Model Validation

Use train/test split for regression:

-- Split data into training (80%) and test (20%)
-- Train model on training set
-- Evaluate R-squared on test set
-- If R-squared is high, model generalizes well

Pattern 4: Multiple Comparisons

When comparing many groups, use ANOVA instead of multiple t-tests to control Type I error.

Troubleshooting

Q: I get a precondition violation

A: Check preconditions before calling features. Example: mean() requires non-empty data. Always verify array.count > 0.

Q: My correlation is NaN

A: This happens when variance is zero (all values identical). Check your data and handle this edge case.

Q: R-squared is negative

A: This shouldn't happen in v1.0 - it's clamped to [0, 1]. If you see it, report a bug.

Q: P-values are always 0.5

A: Yes - in v1.0, p-values are placeholders. This will be fixed when distribution CDFs are implemented.

Q: Data cleaning lost too much data

A: Use remove_nan and remove_infinite separately to see which values are problematic. Investigate why data has NaN/infinite values.