The philosophy of Assisted Simplicity aims to bridge the gap between complex computation and methodological rigor. OLSengine provides six estimation engines in a single, unified interface: OLS regression, ANOVA/t-tests, logistic regression, panel data, instrumental variables (2SLS), and difference-in-differences.
All methods are implemented in pure base R with integrated Methodological Customs that audit assumptions and guide researchers toward robust alternatives.
In social sciences, variance often increases with the scale of the
predictor. Let’s see how OLSengine detects and handles
this.
library(OLSengine)
# Simulate data with non-constant variance
set.seed(123)
n <- 200
x <- rnorm(n, 50, 10)
y <- 10 + 0.5 * x + rnorm(n, 0, x * 0.2) # Heteroskedasticity
df <- data.frame(y, x)
# Run the engine
model <- paper_engine(y ~ x, data = df, model = "ols")
model$messages## [1] "CRITICAL WARNING: Heteroskedasticity detected (Breusch-Pagan p < .05). Reported classic standard errors are biased. It is highly recommended to run the model with 'robust = TRUE' or 'robust = \"auto\"'."
Following the Aduana’s advice, we apply HC3 robust standard errors:
model_robust <- paper_engine(y ~ x, data = df, model = "ols", robust = TRUE)
model_robust$tables$Table2_OLS_Estimation## Predictor B SE t p CI_95_Low CI_95_High f2
## (Intercept) (Intercept) 9.70 4.28 2.27 0.02 1.26 18.14 NA
## x x 0.51 0.09 5.72 < .001 0.34 0.69 0.23
In experimental research, we often compare groups.
OLSengine automatically checks normality and variance
homogeneity.
# Simulating 3 groups with non-normal distribution
set.seed(789)
group_data <- data.frame(
score = c(rgamma(30, 2, 0.5), rgamma(30, 5, 0.5), rgamma(30, 3, 0.5)),
group = rep(c("Control", "Treatment A", "Treatment B"), each = 30)
)
# Auto-pilot switches to non-parametric if normality fails
model_anova <- paper_engine(score ~ group, data = group_data,
model = "anova", non_parametric = "auto")
model_anova$tables$Table4_Mean_Differences## Test Metric Statistic df p Effect_Metric Effect_Size
## 1 Kruskal-Wallis Chi-square 42.93 2 < .001 Eta-squared (H) 0.47
When the outcome is binary (e.g., Success/Failure), the engine calculates Odds Ratios, McFadden’s and Nagelkerke’s Pseudo R², and classification accuracy.
# Simulating binary data
set.seed(101)
n_logit <- 100
age <- rnorm(n_logit, 40, 10)
passed <- rbinom(n_logit, 1, plogis(-5 + 0.12 * age))
logit_df <- data.frame(passed, age)
# Run Logit engine
model_logit <- paper_engine(passed ~ age, data = logit_df, model = "logit")
model_logit$tables$Table2_Logit_Estimation## Predictor B OR SE z p OR_CI_95_Low OR_CI_95_High
## (Intercept) (Intercept) -5.97 0.00 1.35 -4.42 < .001 0.00 0.04
## age age 0.14 1.15 0.03 4.41 < .001 1.08 1.23
## [1] "INFO: Classification Accuracy (Threshold 0.5) = 72.0%"
## [2] "INFO: McFadden Pseudo R² = 0.207 | Nagelkerke Pseudo R² = 0.332"
Panel data tracks the same entities over time. The Hausman test automatically selects between Fixed Effects (FE) and Random Effects (RE).
# Simulate panel data: 50 workers observed over 3 years
set.seed(456)
n_entities <- 50
n_time <- 3
panel_df <- data.frame(
worker_id = rep(1:n_entities, each = n_time),
year = rep(2018:2020, times = n_entities),
wage = rnorm(n_entities * n_time, 50, 10) +
rep(rnorm(n_entities, 0, 5), each = n_time), # Entity fixed effect
experience = rep(5:7, times = n_entities) + rnorm(n_entities * n_time, 0, 1)
)
# Hausman test decides between FE and RE
model_panel <- paper_engine(wage ~ experience,
data = panel_df,
model = "panel",
entity_id = "worker_id",
time_id = "year",
method = "auto")
model_panel$tables[[1]]## Predictor B SE t p CI_95_Low CI_95_High
## experience experience 0.76 0.58 1.31 0.19 -0.39 1.9
## [1] "Customs Auto-Pilot: Hausman test supports random effects (p = 0.810). Random Effects estimator selected for efficiency gains (Wooldridge, 2010)."
## [2] "INFO: Panel structure - N entities = 50, T periods = 3, Total obs = 150"
## [3] "INFO: R² (overall) = 0.000"
## [4] "INFO: Hausman test statistic = 0.06 (p = 0.810)"
When a predictor is correlated with the error term (endogeneity), Instrumental Variables (IV) provide consistent estimates using Two-Stage Least Squares (2SLS).
# Simulate IV data with endogeneity
set.seed(789)
n_iv <- 200
z <- rnorm(n_iv, 10, 3) # Instrument
u <- rnorm(n_iv, 0, 2) # Unobserved confounder
x <- 2 + 0.8 * z + u + rnorm(n_iv, sd = 1) # Endogenous predictor
y <- 5 + 1.5 * x + u + rnorm(n_iv, sd = 3) # Outcome
iv_df <- data.frame(y, x, z)
# 2SLS estimation
model_iv <- paper_engine(y ~ x,
data = iv_df,
model = "iv",
instruments = ~ z)
model_iv$tables$Table2_IV_2SLS## Predictor B SE t p CI_95_Low CI_95_High
## x x 1.4 0.1 13.7 < .001 1.2 1.6
## [1] "INFO: First-stage F-statistic = 242.98 (> 10). Instruments pass relevance threshold (Stock & Yogo, 2005)."
## [2] "INFO: Exact identification (# instruments = # endogenous variables). Sargan test not applicable."
## [3] "INFO: Sample size N = 200, Number of instruments = 1"
## [4] "INFO: R² = 0.724"
DiD estimates treatment effects by comparing changes over time between treated and control groups.
# Simulate policy intervention data
set.seed(321)
n_per_group <- 100
did_df <- data.frame(
outcome = c(rnorm(n_per_group, 50, 10), rnorm(n_per_group, 53, 10), # Control: pre/post
rnorm(n_per_group, 50, 10), rnorm(n_per_group, 58, 10)), # Treated: pre/post
group = rep(c("Control", "Treated"), each = n_per_group * 2),
period = rep(c("Pre", "Post"), times = n_per_group * 2)
)
# DiD estimation
model_did <- paper_engine(outcome ~ 1,
data = did_df,
model = "did",
treatment_var = "group",
time_var = "period",
treatment_level = "Treated",
post_level = "Post")
model_did$tables$Table2_DiD_Estimation## Predictor B SE t p CI_95_Low CI_95_High
## (Intercept) (Intercept) 51.08 1.04 49.10 < .001 49.04 53.13
## Treated Treated 3.84 1.47 2.61 0.01 0.94 6.73
## Post Post 0.91 1.47 0.62 0.54 -1.99 3.80
## Treated:Post Treated:Post -0.26 2.08 -0.12 0.90 -4.35 3.83
## Group Time Mean
## 1 Control Pre 51.08468
## 2 Control Post 51.99048
## 3 Treated Pre 54.92013
## 4 Treated Post 55.56740
## [1] "WARNING (Parallel Trends): Significant pre-treatment difference detected (p = 0.008). The parallel trends assumption may be violated. Consider including covariates or using alternative identification strategies (Roth et al., 2023)."
## [2] "INFO: DiD Effect (Treated:Post interaction) = -0.259 (SE = 2.081, p 0.90)"
## [3] "INFO: Sample size N = 400"
## [4] "INFO: R² = 0.032"
The package includes a real dataset of 397 U.S. college professors.
data(academic_salaries)
# Explore salary determinants
salary_model <- paper_engine(salary ~ rank + discipline + years_since_phd + sex,
data = academic_salaries,
model = "ols",
robust = "auto")
salary_model$tables$Table2_OLS_Estimation## Predictor B SE t p CI_95_Low
## (Intercept) (Intercept) 67884.32 2784.30 24.38 < .001 62410.24
## rankAssocProf rankAssocProf 13104.15 2243.19 5.84 < .001 8693.92
## rankProf rankProf 46032.55 3359.53 13.70 < .001 39427.55
## disciplineB disciplineB 13937.47 2372.64 5.87 < .001 9272.73
## years_since_phd years_since_phd 61.01 155.64 0.39 0.70 -244.99
## sexMale sexMale 4349.37 2423.53 1.79 0.07 -415.42
## CI_95_High f2
## (Intercept) 73358.40 NA
## rankAssocProf 17514.38 0.03
## rankProf 52637.55 0.30
## disciplineB 18602.21 0.09
## years_since_phd 367.02 0.00
## sexMale 9114.15 0.00
## [1] "Customs Auto-Pilot: Heteroskedasticity detected (p < .05). Robust Standard Errors (HC3) were applied automatically (Hayes & Cai, 2007)."
## [2] "INFO (Multicollinearity): Maximum VIF = 3.07. Moderate collinearity detected but below critical threshold. Standard errors may be slightly inflated (O'Brien, 2007)."
## [3] "WARNING (Normality): p < .05 in Shapiro-Wilk. With N=397, OLS may be robust due to CLT (Lumley et al., 2002)."
OLSengine simplifies the transition from raw data to
paper-ready results across six estimation methods. Every step is backed
by rigorous methodological audits, ensuring researchers can focus on
interpretation rather than implementation.
Key Features: - Zero external dependencies (pure base R) - Integrated diagnostics with actionable warnings - APA-formatted tables and publication-ready plots - Unified interface across OLS, ANOVA, Logit, Panel, IV, and DiD