| Title: | Transparent Linear and Causal Inference Models for Social Sciences |
|---|---|
| Description: | Unified estimation, diagnostics, and reporting for ordinary least squares (OLS) regression, ANOVA/t-tests, logistic regression, panel data (fixed/random effects with Hausman test), instrumental variables (2SLS with weak instrument diagnostics), and difference-in-differences. Designed for applied researchers in social sciences with integrated "Methodological Customs" that audit assumptions and provide literature references. All methods implemented in pure base R without external dependencies beyond stats and graphics packages. |
| Authors: | Manuel Soto-Pérez [aut, cre] (ORCID: <https://orcid.org/0000-0002-8249-7410>) |
| Maintainer: | Manuel Soto-Pérez <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-05-15 08:16:47 UTC |
| Source: | https://github.com/msoto-perez/olsengine |
Real data on 9-month academic salaries for assistant professors,
associate professors, and full professors at a U.S. college. This dataset
is provided for educational purposes to demonstrate regression modeling,
ANOVA, and logistic regression with paper_engine.
academic_salariesacademic_salaries
A data frame with 397 observations and 7 variables:
Factor with 3 levels: "AsstProf" (Assistant Professor),
"AssocProf" (Associate Professor), "Prof" (Full Professor).
Represents academic rank.
Factor with 2 levels: "A" (theoretical departments,
e.g., mathematics, physics) and "B" (applied departments, e.g.,
engineering, business). Represents academic discipline category.
Numeric. Number of years since the faculty member earned their PhD.
Numeric. Number of years the faculty member has served at this institution.
Factor with 2 levels: "Female" and "Male".
Numeric. Nine-month academic salary in U.S. dollars (2008-09 academic year).
Integer. Binary indicator (0 = No, 1 = Yes) marking faculty in the top 33% of salaries. Created for logistic regression demonstrations.
This dataset enables demonstration of OLSengine's three core methods:
OLS Regression: Modeling salary as a function of rank, discipline, experience, and sex to assess wage determinants and potential gender disparities.
ANOVA: Comparing mean salaries across academic ranks or disciplines.
Logistic Regression: Predicting the probability of being a high earner based on experience, rank, and discipline.
The data were collected in the 2008-09 academic year and reflect institutional salary structures at that time. Gender wage gap research in academia remains an active area of inquiry (Ginther & Kahn, 2021).
This dataset is adapted from the Salaries dataset in the carData
package (Fox & Weisberg, 2019), which was originally compiled for the textbook
An R Companion to Applied Regression (Fox & Weisberg, 2011). The
original data source is a U.S. college during the 2008-09 academic year.
Licensed under GPL (>= 2), consistent with the carData package license.
Fox, J., & Weisberg, S. (2011). An R Companion to Applied Regression (2nd ed.). Thousand Oaks, CA: Sage. https://socialsciences.mcmaster.ca/jfox/Books/Companion/
Fox, J., & Weisberg, S. (2019). carData: Companion to Applied Regression Data Sets. R package version 3.0-3. https://CRAN.R-project.org/package=carData
Ginther, D. K., & Kahn, S. (2021). Women in academic science: A changing landscape. Psychological Science in the Public Interest, 22(1), 3-65.
# Load the dataset data(academic_salaries) # Explore structure str(academic_salaries) summary(academic_salaries) # OLS: Modeling salary determinants ols_model <- paper_engine( salary ~ rank + discipline + years_since_phd + sex, data = academic_salaries, model = "ols", robust = "auto" ) print(ols_model$tables$Table2_OLS_Estimation) print(ols_model$messages) # ANOVA: Salary differences across academic ranks anova_model <- paper_engine( salary ~ rank, data = academic_salaries, model = "anova" ) print(anova_model$tables$Descriptive_Means) # Logit: Predicting high earner status logit_model <- paper_engine( high_earner ~ years_since_phd + rank + discipline, data = academic_salaries, model = "logit" ) print(logit_model$tables$Table2_Logit_Estimation) # Visualization plot_engine(ols_model)# Load the dataset data(academic_salaries) # Explore structure str(academic_salaries) summary(academic_salaries) # OLS: Modeling salary determinants ols_model <- paper_engine( salary ~ rank + discipline + years_since_phd + sex, data = academic_salaries, model = "ols", robust = "auto" ) print(ols_model$tables$Table2_OLS_Estimation) print(ols_model$messages) # ANOVA: Salary differences across academic ranks anova_model <- paper_engine( salary ~ rank, data = academic_salaries, model = "anova" ) print(anova_model$tables$Descriptive_Means) # Logit: Predicting high earner status logit_model <- paper_engine( high_earner ~ years_since_phd + rank + discipline, data = academic_salaries, model = "logit" ) print(logit_model$tables$Table2_Logit_Estimation) # Visualization plot_engine(ols_model)
Estimates OLS regression, ANOVA/t-tests, binary logistic regression, panel data models, instrumental variables, or difference-in-differences using pure base R matrix algebra. Automatically audits statistical assumptions through an integrated methodological customs layer and returns publication-ready APA-formatted tables. Designed for applied researchers and early-career academics who need a single, transparent workflow from estimation to reporting.
paper_engine( formula, data, model = "ols", robust = FALSE, non_parametric = FALSE, paired = FALSE, entity_id = NULL, time_id = NULL, method = "auto", instruments = NULL, treatment_var = NULL, time_var = NULL, treatment_level = NULL, post_level = NULL, digits = 2 )paper_engine( formula, data, model = "ols", robust = FALSE, non_parametric = FALSE, paired = FALSE, entity_id = NULL, time_id = NULL, method = "auto", instruments = NULL, treatment_var = NULL, time_var = NULL, treatment_level = NULL, post_level = NULL, digits = 2 )
formula |
A |
data |
A data frame containing all variables referenced in |
model |
A character string indicating the estimation engine.
One of |
robust |
Logical or |
non_parametric |
Logical or |
paired |
Logical. If |
entity_id |
Character string. Name of the entity/individual identifier
variable for panel data models. Required when |
time_id |
Character string. Name of the time period identifier variable
for panel data models. Required when |
method |
Character string for panel data. One of |
instruments |
A |
treatment_var |
Character string. Name of the treatment group variable for
DiD models. Required when |
time_var |
Character string. Name of the time period variable (pre/post) for
DiD models. Required when |
treatment_level |
Character string. Which level of |
post_level |
Character string. Which level of |
digits |
Integer. Number of decimal places in output tables.
Default is |
An object of class basic_model, which is a list containing:
A list of formatted data frames with estimation results.
A list of raw diagnostic statistics (p-values, fit indices).
A character vector of methodological guidance messages from the customs layer.
A character string indicating the engine used ("ols",
"anova", "logit", "panel", "iv", or "did").
The cleaned data frame used for estimation (after listwise deletion).
# OLS example set.seed(42) df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100)) result <- paper_engine(y ~ x1 + x2, data = df, model = "ols") print(result$tables) print(result$messages) # ANOVA example df2 <- data.frame(score = c(rnorm(30, 5), rnorm(30, 7)), group = rep(c("A", "B"), each = 30)) result2 <- paper_engine(score ~ group, data = df2, model = "anova") print(result2$tables) # Logit example df3 <- data.frame(y = rbinom(100, 1, 0.5), x = rnorm(100)) result3 <- paper_engine(y ~ x, data = df3, model = "logit") print(result3$tables)# OLS example set.seed(42) df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100)) result <- paper_engine(y ~ x1 + x2, data = df, model = "ols") print(result$tables) print(result$messages) # ANOVA example df2 <- data.frame(score = c(rnorm(30, 5), rnorm(30, 7)), group = rep(c("A", "B"), each = 30)) result2 <- paper_engine(score ~ group, data = df2, model = "anova") print(result2$tables) # Logit example df3 <- data.frame(y = rbinom(100, 1, 0.5), x = rnorm(100)) result3 <- paper_engine(y ~ x, data = df3, model = "logit") print(result3$tables)
Produces minimalist APA-style plots from a basic_model
object returned by paper_engine. The plot type is selected
automatically based on the estimation method: a forest plot of coefficients
with 95
and a logistic probability curve for logistic regression.
plot_engine(model_object, y_label = NULL, x_label = NULL)plot_engine(model_object, y_label = NULL, x_label = NULL)
model_object |
An object of class |
y_label |
A character string for the Y-axis label. If |
x_label |
A character string for the X-axis label. If |
A base R plot rendered in the active graphics device. The function
is called for its side effect (the plot) and returns NULL invisibly.
set.seed(42) df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100)) result <- paper_engine(y ~ x1 + x2, data = df, model = "ols") plot_engine(result, y_label = "Outcome", x_label = "Predictors")set.seed(42) df <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100)) result <- paper_engine(y ~ x1 + x2, data = df, model = "ols") plot_engine(result, y_label = "Outcome", x_label = "Predictors")