Package 'OLSengine' reference manual

Title:	Transparent Linear and Causal Inference Models for Social Sciences
Description:	Unified estimation, diagnostics, and reporting for ordinary least squares (OLS) regression, ANOVA/t-tests, logistic regression, panel data (fixed/random effects with Hausman test), instrumental variables (2SLS with weak instrument diagnostics), and difference-in-differences. Designed for applied researchers in social sciences with integrated "Methodological Customs" that audit assumptions and provide literature references. All methods implemented in pure base R without external dependencies beyond stats and graphics packages.
Authors:	Manuel Soto-Pérez [aut, cre] (ORCID: <https://orcid.org/0000-0002-8249-7410>)
Maintainer:	Manuel Soto-Pérez <[email protected]>
License:	MIT + file LICENSE
Version:	1.1.0
Built:	2026-05-15 08:16:47 UTC
Source:	https://github.com/msoto-perez/olsengine

Academic Salaries Dataset for U.S. College Professors

Description

Real data on 9-month academic salaries for assistant professors, associate professors, and full professors at a U.S. college. This dataset is provided for educational purposes to demonstrate regression modeling, ANOVA, and logistic regression with paper_engine.

Usage

academic_salaries
academic_salaries

Format

A data frame with 397 observations and 7 variables:

rank: Factor with 3 levels: "AsstProf" (Assistant Professor), "AssocProf" (Associate Professor), "Prof" (Full Professor). Represents academic rank.
discipline: Factor with 2 levels: "A" (theoretical departments, e.g., mathematics, physics) and "B" (applied departments, e.g., engineering, business). Represents academic discipline category.
years_since_phd: Numeric. Number of years since the faculty member earned their PhD.
years_service: Numeric. Number of years the faculty member has served at this institution.
sex: Factor with 2 levels: "Female" and "Male".
salary: Numeric. Nine-month academic salary in U.S. dollars (2008-09 academic year).
high_earner: Integer. Binary indicator (0 = No, 1 = Yes) marking faculty in the top 33% of salaries. Created for logistic regression demonstrations.

Details

This dataset enables demonstration of OLSengine's three core methods:

OLS Regression: Modeling salary as a function of rank, discipline, experience, and sex to assess wage determinants and potential gender disparities.
ANOVA: Comparing mean salaries across academic ranks or disciplines.
Logistic Regression: Predicting the probability of being a high earner based on experience, rank, and discipline.

The data were collected in the 2008-09 academic year and reflect institutional salary structures at that time. Gender wage gap research in academia remains an active area of inquiry (Ginther & Kahn, 2021).

Source

This dataset is adapted from the Salaries dataset in the carData package (Fox & Weisberg, 2019), which was originally compiled for the textbook An R Companion to Applied Regression (Fox & Weisberg, 2011). The original data source is a U.S. college during the 2008-09 academic year.

Licensed under GPL (>= 2), consistent with the carData package license.

References

Fox, J., & Weisberg, S. (2011). An R Companion to Applied Regression (2nd ed.). Thousand Oaks, CA: Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/

Fox, J., & Weisberg, S. (2019). carData: Companion to Applied Regression Data Sets. R package version 3.0-3. https://CRAN.R-project.org/package=carData

Ginther, D. K., & Kahn, S. (2021). Women in academic science: A changing landscape. Psychological Science in the Public Interest, 22(1), 3-65.

Examples

# Load the dataset
data(academic_salaries)

# Explore structure
str(academic_salaries)
summary(academic_salaries)

# OLS: Modeling salary determinants
ols_model <- paper_engine(
  salary ~ rank + discipline + years_since_phd + sex,
  data = academic_salaries,
  model = "ols",
  robust = "auto"
)
print(ols_model$tables$Table2_OLS_Estimation)
print(ols_model$messages)

# ANOVA: Salary differences across academic ranks
anova_model <- paper_engine(
  salary ~ rank,
  data = academic_salaries,
  model = "anova"
)
print(anova_model$tables$Descriptive_Means)

# Logit: Predicting high earner status
logit_model <- paper_engine(
  high_earner ~ years_since_phd + rank + discipline,
  data = academic_salaries,
  model = "logit"
)
print(logit_model$tables$Table2_Logit_Estimation)

# Visualization
plot_engine(ols_model)

# Load the dataset
data(academic_salaries)

# Explore structure
str(academic_salaries)
summary(academic_salaries)

# OLS: Modeling salary determinants
ols_model <- paper_engine(
  salary ~ rank + discipline + years_since_phd + sex,
  data = academic_salaries,
  model = "ols",
  robust = "auto"
)
print(ols_model$tables$Table2_OLS_Estimation)
print(ols_model$messages)

# ANOVA: Salary differences across academic ranks
anova_model <- paper_engine(
  salary ~ rank,
  data = academic_salaries,
  model = "anova"
)
print(anova_model$tables$Descriptive_Means)

# Logit: Predicting high earner status
logit_model <- paper_engine(
  high_earner ~ years_since_phd + rank + discipline,
  data = academic_salaries,
  model = "logit"
)
print(logit_model$tables$Table2_Logit_Estimation)

# Visualization
plot_engine(ols_model)

Transparent and Assisted Linear Modeling Engine

Description

Estimates OLS regression, ANOVA/t-tests, binary logistic regression, panel data models, instrumental variables, or difference-in-differences using pure base R matrix algebra. Automatically audits statistical assumptions through an integrated methodological customs layer and returns publication-ready APA-formatted tables. Designed for applied researchers and early-career academics who need a single, transparent workflow from estimation to reporting.

Usage

paper_engine(
  formula,
  data,
  model = "ols",
  robust = FALSE,
  non_parametric = FALSE,
  paired = FALSE,
  entity_id = NULL,
  time_id = NULL,
  method = "auto",
  instruments = NULL,
  treatment_var = NULL,
  time_var = NULL,
  treatment_level = NULL,
  post_level = NULL,
  digits = 2
)
paper_engine(
  formula,
  data,
  model = "ols",
  robust = FALSE,
  non_parametric = FALSE,
  paired = FALSE,
  entity_id = NULL,
  time_id = NULL,
  method = "auto",
  instruments = NULL,
  treatment_var = NULL,
  time_var = NULL,
  treatment_level = NULL,
  post_level = NULL,
  digits = 2
)

Arguments

formula

A formula object specifying the model (e.g., y ~ x1 + x2).

data

A data frame containing all variables referenced in formula.

model

A character string indicating the estimation engine. One of "ols" (default), "anova", "logit", "panel", "iv", or "did".

robust

Logical or "auto". Controls heteroskedasticity-robust standard errors (HC3) for OLS models. If TRUE, HC3 SEs are always applied. If "auto", they are applied only when the Breusch-Pagan test detects heteroskedasticity (p < .05). Default is FALSE.

non_parametric

Logical or "auto". Controls non-parametric fallback for ANOVA/t-test models. If TRUE, Kruskal-Wallis or Wilcoxon tests are used. If "auto", transition occurs when Shapiro-Wilk detects non-normality (p < .05). Default is FALSE.

paired

Logical. If TRUE, assumes paired/dependent samples for ANOVA/t-test models (pre-post designs). Default is FALSE.

entity_id

Character string. Name of the entity/individual identifier variable for panel data models. Required when model = "panel".

time_id

Character string. Name of the time period identifier variable for panel data models. Required when model = "panel".

method

Character string for panel data. One of "auto" (default, uses Hausman test to select between FE and RE), "fe" (Fixed Effects), or "re" (Random Effects). Only used when model = "panel".

instruments

A formula specifying instrumental variables for IV models (e.g., ~ z1 + z2). Required when model = "iv". Instruments must satisfy relevance (correlated with endogenous X) and exogeneity (uncorrelated with error term).

treatment_var

Character string. Name of the treatment group variable for DiD models. Required when model = "did".

time_var

Character string. Name of the time period variable (pre/post) for DiD models. Required when model = "did".

treatment_level

Character string. Which level of treatment_var represents the treated group. If NULL, the second level is used.

post_level

Character string. Which level of time_var represents the post-treatment period. If NULL, the second level is used.

digits

Integer. Number of decimal places in output tables. Default is 2.

Value

An object of class basic_model, which is a list containing:

tables: A list of formatted data frames with estimation results.
diagnostics: A list of raw diagnostic statistics (p-values, fit indices).
messages: A character vector of methodological guidance messages from the customs layer.
method: A character string indicating the engine used ("ols", "anova", "logit", "panel", "iv", or "did").
data: The cleaned data frame used for estimation (after listwise deletion).

Examples

# OLS example
set.seed(42)
df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
result <- paper_engine(y ~ x1 + x2, data = df, model = "ols")
print(result$tables)
print(result$messages)

# ANOVA example
df2 <- data.frame(score = c(rnorm(30, 5), rnorm(30, 7)),
                  group = rep(c("A", "B"), each = 30))
result2 <- paper_engine(score ~ group, data = df2, model = "anova")
print(result2$tables)

# Logit example
df3 <- data.frame(y = rbinom(100, 1, 0.5), x = rnorm(100))
result3 <- paper_engine(y ~ x, data = df3, model = "logit")
print(result3$tables)

# OLS example
set.seed(42)
df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
result <- paper_engine(y ~ x1 + x2, data = df, model = "ols")
print(result$tables)
print(result$messages)

# ANOVA example
df2 <- data.frame(score = c(rnorm(30, 5), rnorm(30, 7)),
                  group = rep(c("A", "B"), each = 30))
result2 <- paper_engine(score ~ group, data = df2, model = "anova")
print(result2$tables)

# Logit example
df3 <- data.frame(y = rbinom(100, 1, 0.5), x = rnorm(100))
result3 <- paper_engine(y ~ x, data = df3, model = "logit")
print(result3$tables)

Generate Publication-Ready Plots for Basic Models

Description

Produces minimalist APA-style plots from a basic_model object returned by paper_engine. The plot type is selected automatically based on the estimation method: a forest plot of coefficients with 95 and a logistic probability curve for logistic regression.

Usage

plot_engine(model_object, y_label = NULL, x_label = NULL)
plot_engine(model_object, y_label = NULL, x_label = NULL)

Arguments

model_object

An object of class basic_model generated by paper_engine.

y_label

A character string for the Y-axis label. If NULL (default), a label is generated automatically from the model type.

x_label

A character string for the X-axis label. If NULL (default), a label is generated automatically from the model type.

Value

A base R plot rendered in the active graphics device. The function is called for its side effect (the plot) and returns NULL invisibly.

Examples

set.seed(42)
df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
result <- paper_engine(y ~ x1 + x2, data = df, model = "ols")
plot_engine(result, y_label = "Outcome", x_label = "Predictors")

set.seed(42)
df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
result <- paper_engine(y ~ x1 + x2, data = df, model = "ols")
plot_engine(result, y_label = "Outcome", x_label = "Predictors")

Package 'OLSengine'

Help Index

Academic Salaries Dataset for U.S. College Professors

Description

Usage

Format

Details

Source

References

Examples

Transparent and Assisted Linear Modeling Engine

Description

Usage

Arguments

Value

Examples

Generate Publication-Ready Plots for Basic Models

Description

Usage

Arguments

Value

Examples