Research Workflow in R

Author
Affiliation

Mingze Gao, PhD

Macquarie University

Published

August 28, 2025

Modified

August 29, 2025

This is an example template for a research workflow in R.

After many years working with SAS and Stata, I am experimenting a pure R workflow. This template outlines a typical research workflow in R, including data loading, cleaning, analysis, and presentation of results.1

1 You can download the template by clicking the Code button at the top right of this page and selecting View Source.

1 Setup

This section initializes the R environment, sets project parameters, and loads required libraries for data processing and analysis.

Initial setup.
# Clear the environment
rm(list = ls())

# Disable scientific notation
options(scipen = 999)

# A hack to set the working directory to the location of this file
if (requireNamespace("rstudioapi", quietly = TRUE) && rstudioapi::isAvailable()) {
  wd <- dirname(rstudioapi::getSourceEditorContext()$path)
  setwd(wd)
}
# Set the root directory for the project
here::i_am("index.qmd")

We first define some key project parameters.

year_start <- 1990
year_end <- 2024
remove_financial_firms <- TRUE
remove_utility_firms <- TRUE
force_data_download <- FALSE
Load libraries.
# Database and data manipulation
library(RPostgres)
library(here)
library(glue)
library(tidyverse)
library(slider)

# Data analysis and statistics
library(fixest)
library(modelsummary)

# Plotting and presentation
library(ggplot2)
library(tinytable)

# Variable labelling
library(labelled)
Define helper functions.
# Load data from local file or WRDS database.
load_data <- function(local, wrds, sql, force_download = FALSE) {
  if (file.exists(local) && !force_download) {
    df <- readRDS(local)
  } else {
    res <- dbSendQuery(wrds, sql)
    df <- dbFetch(res, n = -1)
    dbClearResult(res)
    saveRDS(df, local, compress = "gzip")
  }
  return(df)
}

# Check for duplicate rows in a data frame based on specified columns.
check_duplicates <- function(df, cols) {
  if (any(duplicated(df[, cols]))) {
    warning(
      glue(
        "Found duplicate rows based on columns: ",
        "{paste(cols, collapse = ', ')}."
      )
    )
  } else {
    message(
      glue(
        "Rows in {deparse(substitute(df))} are unique based on columns: ",
        "{paste(cols, collapse = ', ')}."
      )
    )
  }
}

# Winsorize a numeric vector at specified percentiles.
winsorize_vec <- function(x, probs = c(0.01, 0.99), na.rm = TRUE) {
  stopifnot(length(probs) == 2, probs[1] <= probs[2])
  q <- stats::quantile(x, probs = probs, na.rm = na.rm, names = FALSE)
  x <- pmin(pmax(x, q[1]), q[2])
  x
}

Define some custom CSS styles for the tables in the rendered HTML. Notably, rendering this document can produce an HTML page containing formatted outputs (tables and figures), while such outputs can also be exported as LaTeX files and image files to be used elsewhere (e.g., Overleaf, Word).

Define CSS for table formatting.
table_css <- "
.regtable table {
  border-collapse: collapse;      /* removes default cell gaps */
  border-spacing: 0;
  font-size: 0.9rem;              /* relative, responsive */
}

.regtable caption {
  caption-side: top;
  font-weight: bold;
  color: black;
}

.regtable th,
.regtable td {
  font-weight: normal;            /* normal weight for data cells */
  line-height: 1;                 /* tight vertical spacing */
  padding: 0.2em 0.4em;           /* minimal padding for readability */
  white-space: nowrap;            /* no wrapping */
  margin: 0;                      /* avoid extra vertical gaps */
}
"

2 Data

Data used in an empirical research project can be retrieved from either some remote sources (e.g., web pages, database like WRDS) or from local files, depending on availability.

In this section, we load data from WRDS or from local storage in the data directory.

2.1 Fetch from WRDS

Before running the code chunks in this section, ensure that you have followed the guide on WRDS to set up your local R environment for database connection.

Connect to WRDS database. We will use wrds to access the data. However, if force_data_download is set to FALSE, we will attempt to load data from local files first.

wrds <- NULL 
if (force_data_download) {
  wrds <- dbConnect(
    Postgres(),
    host = "wrds-pgdata.wharton.upenn.edu",
    port = 9737,
    dbname = "wrds",
    sslmode = "require",
    user = Sys.getenv("WRDS_USERNAME")
  )
}

My WRDS username is stored in the environment variable WRDS_USERNAME. You can do the same or simply replace Sys.getenv("WRDS_USERNAME") with your actual WRDS username.

2.1.1 Compustat

Try to load Compustat Fundamentals Annual (funda) data from local storage first. If the local file does not exist, we download the data from WRDS and save it locally for future use.

local_path <- here("data", "comp.funda.rds")
sql <- glue("
  select * from comp.funda
  where
    extract(year from datadate) between {year_start} and {year_end}
    and indfmt = 'INDL'
    and datafmt = 'STD'
    and popsrc = 'D'
    and consol = 'C'
")
comp.funda <- load_data(local_path, wrds, sql) 

Following Pflueger, Siriwardane, and Sunderam (2020), for firms that change fiscal year within a calendar year, we take the last reported date when extracting financial data. This leaves us with one set of observations for each firm (gvkey) in each year.

comp.funda <- comp.funda |>
  group_by(gvkey, fyear) |>
  filter(datadate == max(datadate, na.rm = TRUE)) |>
  ungroup()

2.1.2 CRSP

Need the CRSP-Compustat Merged Link Table to link Compustat gvkey with CRSP permno and permco.

local_path <- here("data", "crsp.ccmxpf_lnkhist.rds")
sql <- "
  select * from crsp.ccmxpf_lnkhist
  where
    linktype in ('LC', 'LU')
    and linkprim in ('C', 'P')"
crsp.ccmxpf_lnkhist <- load_data(local_path, wrds, sql)

Because Compustat records only the current company information. We need the historical SIC code hsiccd, which is only available in CRSP.

local_path <- here("data", "crsp.msenames.rds")
sql <- "
  select * from crsp.msenames
  where
    shrcd in (10, 11)"
crsp.msenames <- load_data(local_path, wrds, sql)

shrcd in (10, 11) ensures common stocks.

Later, we will merge crsp.msenames and comp.funda to add the historical SIC code.

Download daily stock return data from CRSP.

local_path <- here("data", "crsp.dsf.rds")
sql <- glue("
  select date, permno, ret
  from crsp.dsf
  where
    extract(year from date) between {year_start} and {year_end}
")
crsp.dsf <- load_data(local_path, wrds, sql)

Housekeeping.

Clear up temporary variables and disconnect from WRDS
rm(local_path, sql)

if (!is.null(wrds)) {
  dbDisconnect(wrds)
}

gc() # garbage collection
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells   1686192   90.1    4313734  230.4   4313734  230.4
Vcells 554681525 4231.9 1090411812 8319.2 942236294 7188.7

2.2 Load from Local Storage

If we have other datasets stored locally, we can load them here.

3 Sample

After loading the necessary datasets, we proceed to create the sample for analysis.

3.1 Merge Compustat and CRSP Data

We first merge Compustat and CRSP data using the link table crsp.ccmxpf_lnkhist. The linking is based on gvkey and the date datadate falling between the link start date linkdt and link end date linkenddt.

ccm <- inner_join(
  comp.funda,
  crsp.ccmxpf_lnkhist |>
    mutate(
      linkdt = coalesce(linkdt, as.Date("1900-01-01")),
      linkenddt = coalesce(linkenddt, as.Date("9999-12-31"))
    ),
  join_by(
    gvkey == gvkey,
    between(datadate, linkdt, linkenddt)
  )
)

Next, we add the historical SIC code hsiccd from crsp.msenames. The linking is based on permno and the date datadate falling between the name start date namedt and name end date nameendt.

ccm <- left_join(
  ccm,
  crsp.msenames |> select(permno, permco, namedt, nameendt, hsiccd),
  join_by(
    lpermno == permno,
    between(datadate, namedt, nameendt)
  )
)

Before proceeding, we verify that we have unique gvkey-permco and gvkey-permno links for each date datadate.

check_duplicates(ccm, c("gvkey", "datadate"))
check_duplicates(ccm, c("lpermco", "gvkey", "datadate"))
check_duplicates(ccm, c("lpermno", "gvkey", "datadate"))

3.2 Create the Sample

After all data is fetched and loaded, we create the sample for analysis.

3.2.1 Stock Return Volatility

Compute the standard deviation of daily stock returns over the past 12 months for each firm-year observation in ccm. We require at least 60 daily return observations within the 12-month window to compute the standard deviation; otherwise, we set it to missing.

ccm_returns <- ccm |>
  # create a lookup table of firm-year (permno, datadate) with the 12-month window
  select(gvkey, lpermno, datadate) |>
  rename(permno = lpermno) |>
  mutate(
    window_start = datadate %m-% months(12),
    window_end = datadate
  ) |>
  # join with daily returns to get daily returns in the past 12 months
  left_join(
    crsp.dsf,
    join_by(permno, between(y$date, x$window_start, x$window_end))
  ) |>
  group_by(gvkey, datadate) |>
  summarise(
    n_obs = sum(!is.na(ret)),
    ret_sd_12m = ifelse(n_obs >= 60, sd(ret, na.rm = TRUE), NA_real_),
    .groups = "drop"
  ) |>
  # keep only gvkey, datadate, ret_sd_12m
  select(gvkey, datadate, ret_sd_12m)

# Merge back to ccm
ccm <- left_join(ccm, ccm_returns, by = c("gvkey", "datadate"))

# Remove temporary data frame
rm(ccm_returns)

3.2.2 Final Sample Creation

We now create the final sample for analysis.

df_firm <- ccm |>
  # Create new variables
  mutate(
    # sic2: 2-digit SIC code
    sic2 = floor(hsiccd / 100), 
    # size: log of total assets
    size = log(at),
    # mktcap: market capitalization
    mktcap = log(abs(prcc_f * csho)),
    # seq: stockholder's equity
    seq = coalesce(seq, ceq + pstk, at - lt - mib),
    # pref: preferred stock
    pref = coalesce(pstkrv, pstkl, pstk, 0),
    # be: book equity
    be = seq - pref,
    # tangibility
    tangibility = coalesce(ppent, 0) / at,
    # market-to-book ratio
    mtb = (prcc_f * csho) / be,
    # ROA
    roa = ib / at,
    # lev: leverage
    lev = (dltt + dlc) / at,
    # Altman Z-score
    z_score = (3.3 * ebit + sale + 1.4 * re + 1.2 * (act - lct)) / at 
      + 0.6 * (prcc_f * csho / lt),
    # CF is operating income before depreciation scaled by total assets
    cf = coalesce(oibdp, 0) / at,
    # Capital expenditures (capex) is capital expenditures scaled by total assets
    capex = coalesce(capx, 0) / at,
    # R&D is research and development expenses scaled by total assets
    rd = coalesce(xrd, 0) / at,
  ) |>
  # Remove rows without historical SIC code
  filter(!is.na(hsiccd)) |>
  # Remove rows without total assets
  filter(!is.na(at)) |>
  # Remove rows without permco or permno
  filter(!is.na(lpermco) & !is.na(lpermno)) |>
  # Keep non-financial and non-utility firms
  filter(!(remove_financial_firms & sic2 %in% c(60:69))) |>
  filter(!(remove_utility_firms & sic2 %in% c(49))) |>
  filter(
    # Remove obs with negative book equity
    be > 0,
    # Remove obs with negative market-to-book ratio
    mtb > 0,
    # Remove obs with negative or zero market capitalization
    mktcap > 0,
    # Remove obs with R&D expenses greater than total assets
    rd <= 1
  )

Define variable labels.

label_map <- list(
  size = "Size",
  mktcap = "Market capitalization",
  tangibility = "Asset tangibility",
  roa = "ROA",
  mtb = "Market-to-book",
  lev = "Leverage",
  z_score = "Altman Z-score",
  capex = "Capital expenditures",
  rd = "R&D",
  cf = "Cash flow",
  ret_sd_12m = "Stock return volatility"
)

# list of variables to be used in the analysis
vars <- names(label_map)

# Set variable labels
var_label(df_firm) <- label_map

Finally, winsorize variables to reduce the impact of outliers. Specifically, we winsorize all continuous variables at the 1st and 99th percentiles.

df_firm <- df_firm |>
  # Keep only rows with non-missing values for all vars
  filter(if_all(all_of(vars), ~ !is.na(.))) |>
  # Winsorize variables at the 1st and 99th percentiles
  mutate(
    across(all_of(vars), \(x) winsorize_vec(x, probs = c(0.01, 0.99)))
  )

3.3 Summary Statistics

The summary statistics for the final sample are presented in the table below.

Summary statistics for the analysis.
modelsummary::datasummary(
  as.formula(paste(
    paste(vars, collapse = " + "),
    "~ N + Mean + SD + Min + P25 + Median + P75 + Max"
  )),
  data = df_firm,
  title = "Summary Statistics",
  fmt = 3,
  notes = "Continuous variables are winsorized at the 1st and 99th percentiles."
) |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Summary Statistics
N Mean SD Min P25 Median P75 Max
Continuous variables are winsorized at the 1st and 99th percentiles.
Size 115405 5.456 2.149 1.151 3.860 5.329 6.945 10.787
Market capitalization 115405 5.531 2.254 0.938 3.853 5.443 7.083 11.174
Asset tangibility 115405 0.247 0.223 0.001 0.075 0.174 0.354 0.895
ROA 115405 -0.060 0.259 -1.312 -0.078 0.025 0.071 0.259
Market-to-book 115405 3.856 5.509 0.290 1.279 2.211 4.025 39.609
Leverage 115405 0.201 0.189 0.000 0.018 0.165 0.331 0.722
Altman Z-score 115405 5.177 8.624 -13.228 1.805 3.333 5.821 55.799
Capital expenditures 115405 0.054 0.062 0.000 0.015 0.034 0.067 0.347
R&D 115405 0.063 0.117 0.000 0.000 0.006 0.078 0.626
Cash flow 115405 0.029 0.244 -1.066 -0.002 0.098 0.159 0.379
Stock return volatility 115405 0.040 0.024 0.011 0.023 0.035 0.051 0.135

Visualization is also a powerful tool to understand the data. The following figure shows the time series of average capital expenditure and the number of firms in the sample.

Plot time series of average capital expenditure and number of firms.
df_capex <- df_firm |>
  group_by(fyear) |>
  summarise(
    mean_capex = mean(capex, na.rm = TRUE),
    n_firms = n_distinct(gvkey)
  )

ggplot(df_capex, aes(x = fyear)) +
  geom_col(aes(y = n_firms / max(n_firms) * max(mean_capex) * 0.3), 
           fill = "#bdbdbd", width = 0.8, alpha = 0.6) +
  geom_line(aes(y = mean_capex), color = "#A6192E", linewidth = 1.2) +
  scale_y_continuous(
    name = "Average Capital Expenditure",
    sec.axis = sec_axis(
      transform = ~ . / (max(df_capex$mean_capex, na.rm = TRUE) * 0.3) * max(df_capex$n_firms),
      name = "Number of Firms"
    )
  ) +
  labs(
    title = "Time Series of Average Capital Expenditure and Number of Firms",
    x = "Year"
  ) +
  theme_minimal()

4 Empirical Analysis

4.1 Baseline Contemporaneous Model

In practice, our research questions vary. For illustration, we examine the relationship between capital expenditures and cash flow, controlling for a set of other firm characteristics.2

2 This is just a hypothetical example. Let’s not get too caught up by whether this is a good research question or not.

Specifically, we estimate the following regression model:

\[ \begin{equation} \textit{CAPEX}_{it} = \beta \textit{CF}_{it} + \gamma X_{it} + \theta_t + \lambda_i + \varepsilon_{it} \end{equation}, \tag{1}\]

where \(\textit{CAPEX}_{it}\) is capital expenditures scaled by total assets for firm \(i\) in year \(t\), \(\textit{CF}_{it}\) is cash flow scaled by total assets, \(X_{it}\) is a vector of control variables including firm size, market-to-book ratio, leverage, ROA, asset tangibility, R&D expenses, and Altman Z-score. \(\theta_t\) and \(\lambda_i\) represent year and firm fixed effects, respectively. Standard errors are clustered at the firm level.

I will use fixest to estimate the model and modelsummary to create the regression table.

Below are several examples of how to present regression results using modelsummary, from a minimal example to more complex customizations.

Minimal regression table.
models <- list(
  "Model 1" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 2" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  )
)

# Report results using modelsummary
modelsummary(
  models,
  title = "Capital Expenditures and Cash Flow",
  stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
  note = "Robust standard errors in parentheses.",
) |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Capital Expenditures and Cash Flow
Model 1 Model 2
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
cf 0.006*** 0.006**
(0.002) (0.002)
size 0.001*** 0.001***
(0.000) (0.000)
mtb 0.001*** 0.001***
(0.000) (0.000)
lev -0.037*** -0.035***
(0.002) (0.002)
roa 0.003 0.002
(0.002) (0.002)
tangibility 0.165*** 0.166***
(0.004) (0.004)
rd 0.021*** 0.023***
(0.003) (0.003)
z_score 0.000***
(0.000)
Num.Obs. 115405 115405
R2 0.666 0.666
R2 Adj. 0.624 0.625
R2 Within 0.114 0.114
R2 Within Adj. 0.114 0.114
AIC -416745.4 -416795.6
BIC -293937.8 -293978.3
RMSE 0.04 0.04
Std.Errors by: gvkey by: gvkey
FE: fyear X X
FE: gvkey X X

Use the coef_map argument in the modelsummary function to apply the variable labels and ordering.

Show variable labels instead of raw variable names.
models <- list(
  "Model 1" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 2" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  )
)

# Sort `label_map` to have `cf` to be the first var
label_map <- label_map[c("cf", setdiff(names(label_map), "cf"))]

# Report results using modelsummary
modelsummary(
  models,
  title = "Capital Expenditures and Cash Flow",
  coef_map = label_map, # Use label_map to rename, reorder, and subset reported vars
  stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
  note = "Robust standard errors in parentheses.",
) |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Capital Expenditures and Cash Flow
Model 1 Model 2
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow 0.006*** 0.006**
(0.002) (0.002)
Size 0.001*** 0.001***
(0.000) (0.000)
Asset tangibility 0.165*** 0.166***
(0.004) (0.004)
ROA 0.003 0.002
(0.002) (0.002)
Market-to-book 0.001*** 0.001***
(0.000) (0.000)
Leverage -0.037*** -0.035***
(0.002) (0.002)
Altman Z-score 0.000***
(0.000)
R&D 0.021*** 0.023***
(0.003) (0.003)
Num.Obs. 115405 115405
R2 0.666 0.666
R2 Adj. 0.624 0.625
R2 Within 0.114 0.114
R2 Within Adj. 0.114 0.114
AIC -416745.4 -416795.6
BIC -293937.8 -293978.3
RMSE 0.04 0.04
Std.Errors by: gvkey by: gvkey
FE: fyear X X
FE: gvkey X X

I will customize the table’s bottom goodness of fit (GoF) panel in a way that I find most useful:

  • Report fixed effects included in each model.
  • Report the clustering information.
  • Report the mean of the dependent variable.
  • Report the number of observations and adjusted R-squared.

I use custom functions to extract and format the information.

Customize GoF panel.
models <- list(
  "Model 1" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 2" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 3" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,
    data = df_firm,
    cluster = ~gvkey
  )
)

# Extract clustering variables
get_cluster_vars <- function(x) {
  # assert x is a fixest object
  if (!inherits(x, "fixest")) stop("Input must be a fixest object")
  typ <- attr(x$se, "type")
  if (is.null(typ) || !nzchar(typ)) return(NULL)
  if (!grepl("\\(", typ)) return(NULL)
  inside <- sub(".*\\((.*)\\).*", "\\1", typ)
  vars <- trimws(unlist(strsplit(inside, "[,&+]")))
  vars[nzchar(vars)]
}

# Pretty print clustering variables
# Rename `gvkey` to `Firm`, `fyear` to `Year`, and `sic2` to `Industry`
pretty_cluster <- function(vars) {
  map <- c(gvkey = "Firm", fyear = "Year", sic2 = "Industry")
  pretty <- ifelse(vars %in% names(map), unname(map[vars]), vars)
  paste(pretty, collapse = " & ")
}

# Custom glance method for fixest objects
glance_custom.fixest <- function(x) {
  out <- list()
  # Fixed effects
  for (v in x$fixef_vars) {
    out[[paste0("FE: ", v)]] <- "✓"
  }
  # Cluster info
  cl_vars <- get_cluster_vars(x)
  if (length(cl_vars)) {
    out[["vcov.type"]] <- paste("by", pretty_cluster(cl_vars))
  } else {
    out[["vcov.type"]] <- "Not clustered"
  }
  # Mean of dependent variable
  dv <- insight::get_response(x)
  dv <- mean(dv, na.rm = TRUE)
  out[["Mean dep. var."]] <- dv

  return(data.frame(out, check.names = FALSE))
}

# Format numbers with commas
f_comma <- function(x) format(round(x, 3), big.mark = ",")

# This list must have an element for each statistics we want to print. 
# Each element must be a named list with exactly three elements:
# raw: a string with the name of a column produced by get_gof(model).
# clean: a string with the “clean” name of the statistic you want to appear in your final table.
# fmt: a string, numeric value, or function which will be used to round/format the string in question.
gm <- list(
  list("raw" = "FE: gvkey", "clean" = "Firm fixed effects", "fmt" = 0),
  list("raw" = "FE: sic2", "clean" = "Industry fixed effects", "fmt" = 0),
  list("raw" = "FE: fyear", "clean" = "Year fixed effects", "fmt" = 0),
  list("raw" = "nobs", "clean" = "N", "fmt" = f_comma),
  list("raw" = "adj.r.squared", "clean" = "Adjusted R<sup>2</sup>", "fmt" = 3),
  list("raw" = "Mean dep. var.", "clean" = "Mean of dep. var.", "fmt" = 3),
  list("raw" = "vcov.type", "clean" = "Standard errors clustering", "fmt" = 0)
)

# Report results using modelsummary
modelsummary(
  models,
  title = "Capital Expenditures and Cash Flow",
  coef_map = label_map, # Use label_map to rename, reorder, and subset reported vars
  gof_map = gm,
  stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
  note = "Robust standard errors in parentheses.",
) |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Capital Expenditures and Cash Flow
Model 1 Model 2 Model 3
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow 0.006*** 0.006** 0.028***
(0.002) (0.002) (0.003)
Size 0.001*** 0.001*** -0.000*
(0.000) (0.000) (0.000)
Asset tangibility 0.165*** 0.166*** 0.173***
(0.004) (0.004) (0.003)
ROA 0.003 0.002 -0.021***
(0.002) (0.002) (0.002)
Market-to-book 0.001*** 0.001*** 0.001***
(0.000) (0.000) (0.000)
Leverage -0.037*** -0.035*** -0.033***
(0.002) (0.002) (0.002)
Altman Z-score 0.000*** 0.000***
(0.000) (0.000)
R&D 0.021*** 0.023*** 0.016***
(0.003) (0.003) (0.003)
Firm fixed effects
Industry fixed effects
Year fixed effects
N 115,405 115,405 115,405
Adjusted R2 0.624 0.625 0.466
Mean of dep. var. 0.054 0.054 0.054
Standard errors clustering by Firm by Firm by Firm

We can also group the models when presenting the results.

Group models
models <- list(
  "Model 1" = feols(
    capex ~ cf | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 2" = feols(
    capex ~ cf | fyear + sic2,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 3" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 4" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,
    data = df_firm,
    cluster = ~gvkey
  )
)

# Report results using modelsummary
modelsummary(
  models,
  title = "Capital Expenditures and Cash Flow",
  coef_map = label_map, # Use label_map to rename, reorder, and subset reported vars
  gof_map = gm,
  stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
  note = "Robust standard errors in parentheses.",
) |>
  group_tt(j = list(
    "Without Controls" = c(2,3),
    "With Controls" = c(4,5)
    )) |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Without Controls With Controls
Capital Expenditures and Cash Flow
Model 1 Model 2 Model 3 Model 4
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow 0.003* 0.017*** 0.006** 0.028***
(0.002) (0.001) (0.002) (0.003)
Size 0.001*** -0.000*
(0.000) (0.000)
Asset tangibility 0.166*** 0.173***
(0.004) (0.003)
ROA 0.002 -0.021***
(0.002) (0.002)
Market-to-book 0.001*** 0.001***
(0.000) (0.000)
Leverage -0.035*** -0.033***
(0.002) (0.002)
Altman Z-score 0.000*** 0.000***
(0.000) (0.000)
R&D 0.023*** 0.016***
(0.003) (0.003)
Firm fixed effects
Industry fixed effects
Year fixed effects
N 115,405 115,405 115,405 115,405
Adjusted R2 0.576 0.271 0.625 0.466
Mean of dep. var. 0.054 0.054 0.054 0.054
Standard errors clustering by Firm by Firm by Firm by Firm

If the models list is unnamed, modelsummary will number the models automatically.

Number models
models <- list(
  feols(
    capex ~ cf | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  feols(
    capex ~ cf | fyear + sic2,
    data = df_firm,
    cluster = ~gvkey
  ),
  feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,
    data = df_firm,
    cluster = ~gvkey
  )
)

# Report results using modelsummary
modelsummary(
  models,
  title = "Capital Expenditures and Cash Flow",
  coef_map = label_map,
  gof_map = gm,
  stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
  note = "Robust standard errors in parentheses.",
) |>
  group_tt(j = list(
    "Without Controls" = c(2,3),
    "With Controls" = c(4,5)
    )) |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Without Controls With Controls
Capital Expenditures and Cash Flow
(1) (2) (3) (4)
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow 0.003* 0.017*** 0.006** 0.028***
(0.002) (0.001) (0.002) (0.003)
Size 0.001*** -0.000*
(0.000) (0.000)
Asset tangibility 0.166*** 0.173***
(0.004) (0.003)
ROA 0.002 -0.021***
(0.002) (0.002)
Market-to-book 0.001*** 0.001***
(0.000) (0.000)
Leverage -0.035*** -0.033***
(0.002) (0.002)
Altman Z-score 0.000*** 0.000***
(0.000) (0.000)
R&D 0.023*** 0.016***
(0.003) (0.003)
Firm fixed effects
Industry fixed effects
Year fixed effects
N 115,405 115,405 115,405 115,405
Adjusted R2 0.576 0.271 0.625 0.466
Mean of dep. var. 0.054 0.054 0.054 0.054
Standard errors clustering by Firm by Firm by Firm by Firm

If the models list is named, we can manually add column numbers.

Number and name models
models <- list(
  "Model 1" = feols(
    capex ~ cf | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 2" = feols(
    capex ~ cf | fyear + sic2,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 3" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,
    data = df_firm,
    cluster = ~gvkey
  ),
  "Model 4" = feols(
    capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,
    data = df_firm,
    cluster = ~gvkey
  )
)

n_cols <- length(models)

# Report results using modelsummary
table <-
  modelsummary(
    models,
    title = "Capital Expenditures and Cash Flow",
    coef_map = label_map,
    gof_map = gm,
    stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
    note = "Robust standard errors in parentheses.",
  ) |>
    group_tt(j = setNames(as.list(2:(n_cols + 1)), paste0("(", seq_len(n_cols), ")"))) |>
    group_tt(j = list(
      "Dependent Variable: CAPEX" = c(2:(n_cols + 1))
      ))
    
table |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Dependent Variable: CAPEX
(1) (2) (3) (4)
Capital Expenditures and Cash Flow
Model 1 Model 2 Model 3 Model 4
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow 0.003* 0.017*** 0.006** 0.028***
(0.002) (0.001) (0.002) (0.003)
Size 0.001*** -0.000*
(0.000) (0.000)
Asset tangibility 0.166*** 0.173***
(0.004) (0.003)
ROA 0.002 -0.021***
(0.002) (0.002)
Market-to-book 0.001*** 0.001***
(0.000) (0.000)
Leverage -0.035*** -0.033***
(0.002) (0.002)
Altman Z-score 0.000*** 0.000***
(0.000) (0.000)
R&D 0.023*** 0.016***
(0.003) (0.003)
Firm fixed effects
Industry fixed effects
Year fixed effects
N 115,405 115,405 115,405 115,405
Adjusted R2 0.576 0.271 0.625 0.466
Mean of dep. var. 0.054 0.054 0.054 0.054
Standard errors clustering by Firm by Firm by Firm by Firm

The tables presented here are in HTML format but can also be easily exported to other formats such as LaTeX.3

3 See documentation of tinytable for details, especially the required preamble.

We need to tweak the table formatting for LaTeX export.

For example, the label of variable R&D contains a special character &, which needs to be escaped in LaTeX. We can use the escape = TRUE argument in modelsummary to handle this.

Also, the term Adjusted R<sup>2</sup> in the GoF panel contains HTML code, which needs to be replaced with Adjusted R\textsuperscript{2} or Adjusted $R^2$ for LaTeX. We can modify the clean element of the corresponding item in gm to achieve this.

Finally, the checkmark symbol used to indicate the inclusion of fixed effects is a Unicode character that may not render correctly in LaTeX. We can modify the glance_custom.fixest function to return \checkmark instead.

Modify various formatting for LaTeX export.
# Update label_map and gm for LaTeX export

label_map[["rd"]] <- "R\\&D"
var_label(df_firm) <- label_map

gm <- list(
  list("raw" = "FE: gvkey", "clean" = "Firm fixed effects", "fmt" = 0),
  list("raw" = "FE: sic2", "clean" = "Industry fixed effects", "fmt" = 0),
  list("raw" = "FE: fyear", "clean" = "Year fixed effects", "fmt" = 0),
  list("raw" = "nobs", "clean" = "N", "fmt" = f_comma),
  list("raw" = "adj.r.squared", "clean" = "Adjusted $R^2$", "fmt" = 3),
  list("raw" = "Mean dep. var.", "clean" = "Mean of dep. var.", "fmt" = 3),
  list("raw" = "vcov.type", "clean" = "Standard errors clustering", "fmt" = 0)
)

# Update to use LaTeX checkmark instead of unicode
glance_custom.fixest <- function(x) {
  out <- list()
  # Fixed effects
  for (v in x$fixef_vars) {
    out[[paste0("FE: ", v)]] <- "\\checkmark"
  }
  # Cluster info
  cl_vars <- get_cluster_vars(x)
  if (length(cl_vars)) {
    out[["vcov.type"]] <- paste("by", pretty_cluster(cl_vars))
  } else {
    out[["vcov.type"]] <- "Not clustered"
  }
  # Mean of dependent variable
  dv <- insight::get_response(x)
  dv <- mean(dv, na.rm = TRUE)
  out[["Mean dep. var."]] <- dv

  return(data.frame(out, check.names = FALSE))
}

The code exports the regression table to a LaTeX file. Note that escape = TRUE is used to ensure special characters are properly formatted for LaTeX.

output_path <- here("output", "regression_table.tex")

modelsummary(
  models,
  title = "Capital Expenditures and Cash Flow",
  coef_map = label_map,
  gof_map = gm,
  stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
  note = "Robust standard errors in parentheses.",
  escape = TRUE
) |>
  group_tt(j = setNames(as.list(2:(n_cols + 1)), paste0("(", seq_len(n_cols), ")"))) |>
  group_tt(j = list(
    "Dependent Variable: CAPEX" = c(2:(n_cols + 1))
    )) |>
  save_tt(output_path, overwrite = TRUE)

The output is

Raw LaTeX content of output/regression_table.tex.
output/regression_table.tex
\begin{table}
\centering
\begin{talltblr}[         %% tabularray outer open
caption={Capital Expenditures and Cash Flow},
note{}={* p < 0.1, ** p < 0.05, *** p < 0.01},
note{ }={Robust standard errors in parentheses.},
]                     %% tabularray outer close
{                     %% tabularray inner open
colspec={Q[]Q[]Q[]Q[]Q[]},
column{1}={}{halign=l,},
cell{1,3-26}{3}={}{halign=c,},
cell{1,3-26}{4}={}{halign=c,},
cell{1,3-26}{5}={}{halign=c,},
cell{1}{2}={c=4,}{halign=c, halign=c,},
cell{2}{2}={}{halign=c, halign=c,},
cell{2}{3}={}{halign=c, halign=c,},
cell{2}{4}={}{halign=c, halign=c,},
cell{2}{5}={}{halign=c, halign=c,},
cell{3-26}{2}={}{halign=c,},
hline{20}={1-5}{solid, black, 0.05em},
}                     %% tabularray inner close
\toprule
& Dependent Variable: CAPEX &  &  &  \\ \cmidrule[lr]{2-5}
& (1) & (2) & (3) & (4) \\ \cmidrule[lr]{2-2}\cmidrule[lr]{3-3}\cmidrule[lr]{4-4}\cmidrule[lr]{5-5}
& Model 1 & Model 2 & Model 3 & Model 4 \\ \midrule %% TinyTableHeader
Cash flow & 0.003* & 0.017*** & 0.006** & 0.028*** \\
& (0.002) & (0.001) & (0.002) & (0.003) \\
Size &  &  & 0.001*** & -0.000* \\
&  &  & (0.000) & (0.000) \\
Asset tangibility &  &  & 0.166*** & 0.173*** \\
&  &  & (0.004) & (0.003) \\
ROA &  &  & 0.002 & -0.021*** \\
&  &  & (0.002) & (0.002) \\
Market-to-book &  &  & 0.001*** & 0.001*** \\
&  &  & (0.000) & (0.000) \\
Leverage &  &  & -0.035*** & -0.033*** \\
&  &  & (0.002) & (0.002) \\
Altman Z-score &  &  & 0.000*** & 0.000*** \\
&  &  & (0.000) & (0.000) \\
R\&D &  &  & 0.023*** & 0.016*** \\
&  &  & (0.003) & (0.003) \\
Firm fixed effects & \checkmark &  & \checkmark &  \\
Industry fixed effects &  & \checkmark &  & \checkmark \\
Year fixed effects & \checkmark & \checkmark & \checkmark & \checkmark \\
N & 115,405 & 115,405 & 115,405 & 115,405 \\
Adjusted $R^2$ & 0.576 & 0.271 & 0.625 & 0.466 \\
Mean of dep. var. & 0.054 & 0.054 & 0.054 & 0.054 \\
Standard errors clustering & by Firm & by Firm & by Firm & by Firm \\
\bottomrule
\end{talltblr}
\end{table}

The rendered PDF is shown below. Note that we need to include several LaTeX packages in the preamble to ensure proper rendering of the table.

This PDF can be downloaded here. The table is surely of publication quality.

\documentclass{article}

% For \checkmark
\usepackage{amssymb}

% For tinytable export
\usepackage{tabularray}
\usepackage{float}
\usepackage{graphicx}
\usepackage{rotating}
\usepackage[normalem]{ulem}
\UseTblrLibrary{booktabs}
\UseTblrLibrary{siunitx}
\newcommand{\tinytableTabularrayUnderline}[1]{\underline{#1}}
\newcommand{\tinytableTabularrayStrikeout}[1]{\sout{#1}}
\NewTableCommand{\tinytableDefineColor}[3]{\definecolor{#1}{#2}{#3}}

\begin{document}

\input{output/regression_table}

\end{document}

4.2 Panel with Lagged Variables

Very often in empirical finance, we need to control for lagged independent variables (e.g., lagged firm characteristics). For example, we can estimate the following model:

\[ \begin{equation} \textit{CAPEX}_{it} = \beta \textit{CF}_{i,t-1} + \gamma X_{i,t-1} + \theta_t + \lambda_i + \varepsilon_{it} \end{equation}. \tag{2}\]

Equation 2 includes lagged values of the independent variables instead of contemporaneous values in Equation 1.

Before proceeding, we need to set the panel.

panel_sample = panel(df_firm, ~gvkey+fyear)
Restore HTML formatting for variable labels and GoF panel.
label_map[["rd"]] <- "R&D"
var_label(df_firm) <- label_map

gm <- list(
  list("raw" = "FE: gvkey", "clean" = "Firm fixed effects", "fmt" = 0),
  list("raw" = "FE: sic2", "clean" = "Industry fixed effects", "fmt" = 0),
  list("raw" = "FE: fyear", "clean" = "Year fixed effects", "fmt" = 0),
  list("raw" = "nobs", "clean" = "N", "fmt" = f_comma),
  list("raw" = "adj.r.squared", "clean" = "Adjusted R-squared", "fmt" = 3),
  list("raw" = "Mean dep. var.", "clean" = "Mean of dep. var.", "fmt" = 3),
  list("raw" = "vcov.type", "clean" = "Standard errors clustering", "fmt" = 0)
)

# Update to use LaTeX checkmark instead of unicode
glance_custom.fixest <- function(x) {
  out <- list()
  # Fixed effects
  for (v in x$fixef_vars) {
    out[[paste0("FE: ", v)]] <- "Y"
  }
  # Cluster info
  cl_vars <- get_cluster_vars(x)
  if (length(cl_vars)) {
    out[["vcov.type"]] <- paste("by", pretty_cluster(cl_vars))
  } else {
    out[["vcov.type"]] <- "Not clustered"
  }
  # Mean of dependent variable
  dv <- insight::get_response(x)
  dv <- mean(dv, na.rm = TRUE)
  out[["Mean dep. var."]] <- dv

  return(data.frame(out, check.names = FALSE))
}
Estimate model with lagged variables.
# Variables to lag
lag_vars <- c("cf", "size", "mtb", "lev", "roa", "tangibility", "rd", "z_score")
fml <- as.formula(
          paste0(
            # dependent
            "capex ~ ",
            # lagged variables
            paste0("l(", lag_vars, ",1)", collapse = " + "),
            # fixed effects
            " | fyear + gvkey"
          )
        )

models <- list(
  feols(
    capex ~ l(cf,0:1) | fyear + gvkey,
    data = panel_sample,
    cluster = ~gvkey
  ),
  feols(fml,
    data = panel_sample,
    cluster = ~gvkey
  )
)
NOTE: 14,696 observations removed because of NA values (RHS: 14,696).
NOTE: 14,696 observations removed because of NA values (RHS: 14,696).

Below are examples of presenting regression results using modelsummary, from a minimal example to more complex customizations.

Panel data with lagged variables. No variable labelling.
# Report results using modelsummary
table <-
  modelsummary(
    models,
    title = "Capital Expenditures and Cash Flow",
    gof_map = gm,
    stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
    note = "Robust standard errors in parentheses.",
  ) |>
    group_tt(j = list(
      "Dependent Variable: CAPEX" = c(2:(length(models) + 1))
      ))
    
table |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Dependent Variable: CAPEX
Capital Expenditures and Cash Flow
(1) (2)
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
cf -0.007***
(0.002)
l(cf, 1) 0.036*** 0.029***
(0.002) (0.003)
l(size, 1) -0.001***
(0.000)
l(mtb, 1) 0.001***
(0.000)
l(lev, 1) -0.040***
(0.002)
l(roa, 1) 0.004**
(0.002)
l(tangibility, 1) 0.023***
(0.004)
l(rd, 1) 0.016***
(0.004)
l(z_score, 1) 0.000***
(0.000)
Firm fixed effects Y Y
Year fixed effects Y Y
N 100,709 100,709
Adjusted R-squared 0.591 0.601
Mean of dep. var. 0.054 0.054
Standard errors clustering by Firm by Firm
Lagged variables with variable labelling and ordering.
# Create a new label_map for lagged variables
label_map_lagged <- list()

# Add lagged variable labels
# The order here determines the display order in the table
label_map_lagged[["cf"]] <- "Cash flow (t)"
label_map_lagged[["l(cf, 1)"]] <- "Cash flow (t-1)"
label_map_lagged[["l(size, 1)"]] <- "Size (t-1)"
label_map_lagged[["l(mtb, 1)"]] <- "Market-to-book (t-1)"
label_map_lagged[["l(lev, 1)"]] <- "Leverage (t-1)"
label_map_lagged[["l(roa, 1)"]] <- "ROA (t-1)"
label_map_lagged[["l(tangibility, 1)"]] <- "Asset tangibility (t-1)"
label_map_lagged[["l(rd, 1)"]] <- "R&D (t-1)"
label_map_lagged[["l(z_score, 1)"]] <- "Altman Z-score (t-1)"

# Report results using modelsummary
table <-
  modelsummary(
    models,
    title = "Capital Expenditures and Cash Flow",
    coef_map = label_map_lagged,
    gof_map = gm,
    stars = c("*" = 0.1, "**" = 0.05, "***" = 0.01),
    note = "Robust standard errors in parentheses.",
  ) |>
    group_tt(j = list(
      "Dependent Variable: CAPEX" = c(2:(length(models) + 1))
      ))
    
table |>
  theme_html(class="table table-borderless regtable", css_rule=table_css)
Dependent Variable: CAPEX
Capital Expenditures and Cash Flow
(1) (2)
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow (t) -0.007***
(0.002)
Cash flow (t-1) 0.036*** 0.029***
(0.002) (0.003)
Size (t-1) -0.001***
(0.000)
Market-to-book (t-1) 0.001***
(0.000)
Leverage (t-1) -0.040***
(0.002)
ROA (t-1) 0.004**
(0.002)
Asset tangibility (t-1) 0.023***
(0.004)
R&D (t-1) 0.016***
(0.004)
Altman Z-score (t-1) 0.000***
(0.000)
Firm fixed effects Y Y
Year fixed effects Y Y
N 100,709 100,709
Adjusted R-squared 0.591 0.601
Mean of dep. var. 0.054 0.054
Standard errors clustering by Firm by Firm

5 Conclusion

In this example template for a research workflow in R, I showcase various steps in the research process from data collection, manipulation, model estimation, to result presentation.

While I initially planned to cover a more comprehensive set of empirical analyses, I realized that that this document would be bloated if I included too many examples.

I will add more examples of various statistical analysis and tabulation with different customizations in separate posts. Of course, feedback and suggestions are always welcome!

Back to top

References

Pflueger, Carolin, Emil Siriwardane, and Adi Sunderam. 2020. “Financial Market Risk Perceptions and the Macroeconomy.” The Quarterly Journal of Economics 135 (3): 1443–91.