OLSengine: Assisted Simplicity Tutorial

1. Introduction

The philosophy of Assisted Simplicity aims to bridge the gap between complex computation and methodological rigor. OLSengine provides six estimation engines in a single, unified interface: OLS regression, ANOVA/t-tests, logistic regression, panel data, instrumental variables (2SLS), and difference-in-differences.

All methods are implemented in pure base R with integrated Methodological Customs that audit assumptions and guide researchers toward robust alternatives.


2. OLS Regression: Handling Heteroskedasticity

In social sciences, variance often increases with the scale of the predictor. Let’s see how OLSengine detects and handles this.

library(OLSengine)

# Simulate data with non-constant variance
set.seed(123)
n <- 200
x <- rnorm(n, 50, 10)
y <- 10 + 0.5 * x + rnorm(n, 0, x * 0.2) # Heteroskedasticity

df <- data.frame(y, x)

# Run the engine
model <- paper_engine(y ~ x, data = df, model = "ols")
model$messages
## [1] "CRITICAL WARNING: Heteroskedasticity detected (Breusch-Pagan p < .05). Reported classic standard errors are biased. It is highly recommended to run the model with 'robust = TRUE' or 'robust = \"auto\"'."

Correcting with Robust Standard Errors

Following the Aduana’s advice, we apply HC3 robust standard errors:

model_robust <- paper_engine(y ~ x, data = df, model = "ols", robust = TRUE)
model_robust$tables$Table2_OLS_Estimation
##               Predictor    B   SE    t      p CI_95_Low CI_95_High   f2
## (Intercept) (Intercept) 9.70 4.28 2.27   0.02      1.26      18.14   NA
## x                     x 0.51 0.09 5.72 < .001      0.34       0.69 0.23
plot_engine(model_robust)


3. ANOVA/t-tests: Experimental Comparisons

In experimental research, we often compare groups. OLSengine automatically checks normality and variance homogeneity.

# Simulating 3 groups with non-normal distribution
set.seed(789)
group_data <- data.frame(
  score = c(rgamma(30, 2, 0.5), rgamma(30, 5, 0.5), rgamma(30, 3, 0.5)),
  group = rep(c("Control", "Treatment A", "Treatment B"), each = 30)
)

# Auto-pilot switches to non-parametric if normality fails
model_anova <- paper_engine(score ~ group, data = group_data, 
                            model = "anova", non_parametric = "auto")

model_anova$tables$Table4_Mean_Differences
##             Test     Metric Statistic df      p   Effect_Metric Effect_Size
## 1 Kruskal-Wallis Chi-square     42.93  2 < .001 Eta-squared (H)        0.47
plot_engine(model_anova)


4. Logistic Regression: Binary Outcomes

When the outcome is binary (e.g., Success/Failure), the engine calculates Odds Ratios, McFadden’s and Nagelkerke’s Pseudo R², and classification accuracy.

# Simulating binary data
set.seed(101)
n_logit <- 100
age <- rnorm(n_logit, 40, 10)
passed <- rbinom(n_logit, 1, plogis(-5 + 0.12 * age))

logit_df <- data.frame(passed, age)

# Run Logit engine
model_logit <- paper_engine(passed ~ age, data = logit_df, model = "logit")

model_logit$tables$Table2_Logit_Estimation
##               Predictor     B   OR   SE     z      p OR_CI_95_Low OR_CI_95_High
## (Intercept) (Intercept) -5.97 0.00 1.35 -4.42 < .001         0.00          0.04
## age                 age  0.14 1.15 0.03  4.41 < .001         1.08          1.23
model_logit$messages
## [1] "INFO: Classification Accuracy (Threshold 0.5) = 72.0%"          
## [2] "INFO: McFadden Pseudo R² = 0.207 | Nagelkerke Pseudo R² = 0.332"
plot_engine(model_logit)


5. Panel Data: Fixed and Random Effects

Panel data tracks the same entities over time. The Hausman test automatically selects between Fixed Effects (FE) and Random Effects (RE).

# Simulate panel data: 50 workers observed over 3 years
set.seed(456)
n_entities <- 50
n_time <- 3

panel_df <- data.frame(
  worker_id = rep(1:n_entities, each = n_time),
  year = rep(2018:2020, times = n_entities),
  wage = rnorm(n_entities * n_time, 50, 10) + 
         rep(rnorm(n_entities, 0, 5), each = n_time), # Entity fixed effect
  experience = rep(5:7, times = n_entities) + rnorm(n_entities * n_time, 0, 1)
)

# Hausman test decides between FE and RE
model_panel <- paper_engine(wage ~ experience, 
                            data = panel_df,
                            model = "panel",
                            entity_id = "worker_id",
                            time_id = "year",
                            method = "auto")

model_panel$tables[[1]]
##             Predictor    B   SE    t    p CI_95_Low CI_95_High
## experience experience 0.76 0.58 1.31 0.19     -0.39        1.9
model_panel$messages
## [1] "Customs Auto-Pilot: Hausman test supports random effects (p = 0.810). Random Effects estimator selected for efficiency gains (Wooldridge, 2010)."
## [2] "INFO: Panel structure - N entities = 50, T periods = 3, Total obs = 150"                                                                         
## [3] "INFO: R² (overall) = 0.000"                                                                                                                      
## [4] "INFO: Hausman test statistic = 0.06 (p = 0.810)"
plot_engine(model_panel)


6. Instrumental Variables: Addressing Endogeneity

When a predictor is correlated with the error term (endogeneity), Instrumental Variables (IV) provide consistent estimates using Two-Stage Least Squares (2SLS).

# Simulate IV data with endogeneity
set.seed(789)
n_iv <- 200

z <- rnorm(n_iv, 10, 3)  # Instrument
u <- rnorm(n_iv, 0, 2)   # Unobserved confounder

x <- 2 + 0.8 * z + u + rnorm(n_iv, sd = 1)  # Endogenous predictor
y <- 5 + 1.5 * x + u + rnorm(n_iv, sd = 3)  # Outcome

iv_df <- data.frame(y, x, z)

# 2SLS estimation
model_iv <- paper_engine(y ~ x, 
                         data = iv_df,
                         model = "iv",
                         instruments = ~ z)

model_iv$tables$Table2_IV_2SLS
##   Predictor   B  SE    t      p CI_95_Low CI_95_High
## x         x 1.4 0.1 13.7 < .001       1.2        1.6
model_iv$messages  # Reports first-stage F-stat
## [1] "INFO: First-stage F-statistic = 242.98 (> 10). Instruments pass relevance threshold (Stock & Yogo, 2005)."
## [2] "INFO: Exact identification (# instruments = # endogenous variables). Sargan test not applicable."         
## [3] "INFO: Sample size N = 200, Number of instruments = 1"                                                     
## [4] "INFO: R² = 0.724"
plot_engine(model_iv)


7. Difference-in-Differences: Causal Policy Evaluation

DiD estimates treatment effects by comparing changes over time between treated and control groups.

# Simulate policy intervention data
set.seed(321)
n_per_group <- 100

did_df <- data.frame(
  outcome = c(rnorm(n_per_group, 50, 10), rnorm(n_per_group, 53, 10),  # Control: pre/post
              rnorm(n_per_group, 50, 10), rnorm(n_per_group, 58, 10)), # Treated: pre/post
  group = rep(c("Control", "Treated"), each = n_per_group * 2),
  period = rep(c("Pre", "Post"), times = n_per_group * 2)
)

# DiD estimation
model_did <- paper_engine(outcome ~ 1, 
                          data = did_df,
                          model = "did",
                          treatment_var = "group",
                          time_var = "period",
                          treatment_level = "Treated",
                          post_level = "Post")

model_did$tables$Table2_DiD_Estimation
##                 Predictor     B   SE     t      p CI_95_Low CI_95_High
## (Intercept)   (Intercept) 51.08 1.04 49.10 < .001     49.04      53.13
## Treated           Treated  3.84 1.47  2.61   0.01      0.94       6.73
## Post                 Post  0.91 1.47  0.62   0.54     -1.99       3.80
## Treated:Post Treated:Post -0.26 2.08 -0.12   0.90     -4.35       3.83
model_did$tables$Group_Means
##     Group Time     Mean
## 1 Control  Pre 51.08468
## 2 Control Post 51.99048
## 3 Treated  Pre 54.92013
## 4 Treated Post 55.56740
model_did$messages  # Reports parallel trends test
## [1] "WARNING (Parallel Trends): Significant pre-treatment difference detected (p = 0.008). The parallel trends assumption may be violated. Consider including covariates or using alternative identification strategies (Roth et al., 2023)."
## [2] "INFO: DiD Effect (Treated:Post interaction) = -0.259 (SE = 2.081, p 0.90)"                                                                                                                                                              
## [3] "INFO: Sample size N = 400"                                                                                                                                                                                                              
## [4] "INFO: R² = 0.032"
plot_engine(model_did)


8. Real Data Example: Academic Salaries

The package includes a real dataset of 397 U.S. college professors.

data(academic_salaries)

# Explore salary determinants
salary_model <- paper_engine(salary ~ rank + discipline + years_since_phd + sex,
                             data = academic_salaries,
                             model = "ols",
                             robust = "auto")

salary_model$tables$Table2_OLS_Estimation
##                       Predictor        B      SE     t      p CI_95_Low
## (Intercept)         (Intercept) 67884.32 2784.30 24.38 < .001  62410.24
## rankAssocProf     rankAssocProf 13104.15 2243.19  5.84 < .001   8693.92
## rankProf               rankProf 46032.55 3359.53 13.70 < .001  39427.55
## disciplineB         disciplineB 13937.47 2372.64  5.87 < .001   9272.73
## years_since_phd years_since_phd    61.01  155.64  0.39   0.70   -244.99
## sexMale                 sexMale  4349.37 2423.53  1.79   0.07   -415.42
##                 CI_95_High   f2
## (Intercept)       73358.40   NA
## rankAssocProf     17514.38 0.03
## rankProf          52637.55 0.30
## disciplineB       18602.21 0.09
## years_since_phd     367.02 0.00
## sexMale            9114.15 0.00
salary_model$messages
## [1] "Customs Auto-Pilot: Heteroskedasticity detected (p < .05). Robust Standard Errors (HC3) were applied automatically (Hayes & Cai, 2007)."                             
## [2] "INFO (Multicollinearity): Maximum VIF = 3.07. Moderate collinearity detected but below critical threshold. Standard errors may be slightly inflated (O'Brien, 2007)."
## [3] "WARNING (Normality): p < .05 in Shapiro-Wilk. With N=397, OLS may be robust due to CLT (Lumley et al., 2002)."

Conclusion

OLSengine simplifies the transition from raw data to paper-ready results across six estimation methods. Every step is backed by rigorous methodological audits, ensuring researchers can focus on interpretation rather than implementation.

Key Features: - Zero external dependencies (pure base R) - Integrated diagnostics with actionable warnings - APA-formatted tables and publication-ready plots - Unified interface across OLS, ANOVA, Logit, Panel, IV, and DiD