This is an example template for a research workflow in R.
After many years working with SAS and Stata, I am experimenting a pure R workflow. This template outlines a typical research workflow in R, including data loading, cleaning, analysis, and presentation of results.1
1 You can download the template by clicking the Code button at the top right of this page and selecting View Source.
1 Setup
This section initializes the R environment, sets project parameters, and loads required libraries for data processing and analysis.
Initial setup.
# Clear the environmentrm(list =ls())# Disable scientific notationoptions(scipen =999)# A hack to set the working directory to the location of this fileif(requireNamespace("rstudioapi", quietly =TRUE)&&rstudioapi::isAvailable()){wd<-dirname(rstudioapi::getSourceEditorContext()$path)setwd(wd)}# Set the root directory for the projecthere::i_am("index.qmd")
# Load data from local file or WRDS database.load_data<-function(local, wrds, sql, force_download=FALSE){if(file.exists(local)&&!force_download){df<-readRDS(local)}else{res<-dbSendQuery(wrds, sql)df<-dbFetch(res, n =-1)dbClearResult(res)saveRDS(df, local, compress ="gzip")}return(df)}# Check for duplicate rows in a data frame based on specified columns.check_duplicates<-function(df, cols){if(any(duplicated(df[, cols]))){warning(glue("Found duplicate rows based on columns: ","{paste(cols, collapse = ', ')}."))}else{message(glue("Rows in {deparse(substitute(df))} are unique based on columns: ","{paste(cols, collapse = ', ')}."))}}# Winsorize a numeric vector at specified percentiles.winsorize_vec<-function(x, probs=c(0.01, 0.99), na.rm=TRUE){stopifnot(length(probs)==2, probs[1]<=probs[2])q<-stats::quantile(x, probs =probs, na.rm =na.rm, names =FALSE)x<-pmin(pmax(x, q[1]), q[2])x}
Define some custom CSS styles for the tables in the rendered HTML. Notably, rendering this document can produce an HTML page containing formatted outputs (tables and figures), while such outputs can also be exported as LaTeX files and image files to be used elsewhere (e.g., Overleaf, Word).
Data used in an empirical research project can be retrieved from either some remote sources (e.g., web pages, database like WRDS) or from local files, depending on availability.
In this section, we load data from WRDS or from local storage in the data directory.
2.1 Fetch from WRDS
Before running the code chunks in this section, ensure that you have followed the guide on WRDS to set up your local R environment for database connection.
Connect to WRDS database. We will use wrds to access the data. However, if force_data_download is set to FALSE, we will attempt to load data from local files first.
wrds<-NULLif(force_data_download){wrds<-dbConnect(Postgres(), host ="wrds-pgdata.wharton.upenn.edu", port =9737, dbname ="wrds", sslmode ="require", user =Sys.getenv("WRDS_USERNAME"))}
My WRDS username is stored in the environment variable WRDS_USERNAME. You can do the same or simply replace Sys.getenv("WRDS_USERNAME") with your actual WRDS username.
2.1.1 Compustat
Try to load Compustat Fundamentals Annual (funda) data from local storage first. If the local file does not exist, we download the data from WRDS and save it locally for future use.
local_path<-here("data", "comp.funda.rds")sql<-glue(" select * from comp.funda where extract(year from datadate) between {year_start} and {year_end} and indfmt = 'INDL' and datafmt = 'STD' and popsrc = 'D' and consol = 'C'")comp.funda<-load_data(local_path, wrds, sql)
Following Pflueger, Siriwardane, and Sunderam (2020), for firms that change fiscal year within a calendar year, we take the last reported date when extracting financial data. This leaves us with one set of observations for each firm (gvkey) in each year.
Need the CRSP-Compustat Merged Link Table to link Compustat gvkey with CRSP permno and permco.
local_path<-here("data", "crsp.ccmxpf_lnkhist.rds")sql<-" select * from crsp.ccmxpf_lnkhist where linktype in ('LC', 'LU') and linkprim in ('C', 'P')"crsp.ccmxpf_lnkhist<-load_data(local_path, wrds, sql)
Because Compustat records only the current company information. We need the historical SIC code hsiccd, which is only available in CRSP.
local_path<-here("data", "crsp.msenames.rds")sql<-" select * from crsp.msenames where shrcd in (10, 11)"crsp.msenames<-load_data(local_path, wrds, sql)
shrcd in (10, 11) ensures common stocks.
Later, we will merge crsp.msenames and comp.funda to add the historical SIC code.
Download daily stock return data from CRSP.
local_path<-here("data", "crsp.dsf.rds")sql<-glue(" select date, permno, ret from crsp.dsf where extract(year from date) between {year_start} and {year_end}")crsp.dsf<-load_data(local_path, wrds, sql)
Housekeeping.
Clear up temporary variables and disconnect from WRDS
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1686192 90.1 4313734 230.4 4313734 230.4
Vcells 554681525 4231.9 1090411812 8319.2 942236294 7188.7
2.2 Load from Local Storage
If we have other datasets stored locally, we can load them here.
3 Sample
After loading the necessary datasets, we proceed to create the sample for analysis.
3.1 Merge Compustat and CRSP Data
We first merge Compustat and CRSP data using the link table crsp.ccmxpf_lnkhist. The linking is based on gvkey and the date datadate falling between the link start date linkdt and link end date linkenddt.
Next, we add the historical SIC code hsiccd from crsp.msenames. The linking is based on permno and the date datadate falling between the name start date namedt and name end date nameendt.
After all data is fetched and loaded, we create the sample for analysis.
3.2.1 Stock Return Volatility
Compute the standard deviation of daily stock returns over the past 12 months for each firm-year observation in ccm. We require at least 60 daily return observations within the 12-month window to compute the standard deviation; otherwise, we set it to missing.
ccm_returns<-ccm|># create a lookup table of firm-year (permno, datadate) with the 12-month windowselect(gvkey, lpermno, datadate)|>rename(permno =lpermno)|>mutate( window_start =datadate%m-%months(12), window_end =datadate)|># join with daily returns to get daily returns in the past 12 monthsleft_join(crsp.dsf,join_by(permno, between(y$date, x$window_start, x$window_end)))|>group_by(gvkey, datadate)|>summarise( n_obs =sum(!is.na(ret)), ret_sd_12m =ifelse(n_obs>=60, sd(ret, na.rm =TRUE), NA_real_), .groups ="drop")|># keep only gvkey, datadate, ret_sd_12mselect(gvkey, datadate, ret_sd_12m)# Merge back to ccmccm<-left_join(ccm, ccm_returns, by =c("gvkey", "datadate"))# Remove temporary data framerm(ccm_returns)
3.2.2 Final Sample Creation
We now create the final sample for analysis.
df_firm<-ccm|># Create new variablesmutate(# sic2: 2-digit SIC code sic2 =floor(hsiccd/100), # size: log of total assets size =log(at),# mktcap: market capitalization mktcap =log(abs(prcc_f*csho)),# seq: stockholder's equity seq =coalesce(seq, ceq+pstk, at-lt-mib),# pref: preferred stock pref =coalesce(pstkrv, pstkl, pstk, 0),# be: book equity be =seq-pref,# tangibility tangibility =coalesce(ppent, 0)/at,# market-to-book ratio mtb =(prcc_f*csho)/be,# ROA roa =ib/at,# lev: leverage lev =(dltt+dlc)/at,# Altman Z-score z_score =(3.3*ebit+sale+1.4*re+1.2*(act-lct))/at+0.6*(prcc_f*csho/lt),# CF is operating income before depreciation scaled by total assets cf =coalesce(oibdp, 0)/at,# Capital expenditures (capex) is capital expenditures scaled by total assets capex =coalesce(capx, 0)/at,# R&D is research and development expenses scaled by total assets rd =coalesce(xrd, 0)/at,)|># Remove rows without historical SIC codefilter(!is.na(hsiccd))|># Remove rows without total assetsfilter(!is.na(at))|># Remove rows without permco or permnofilter(!is.na(lpermco)&!is.na(lpermno))|># Keep non-financial and non-utility firmsfilter(!(remove_financial_firms&sic2%in%c(60:69)))|>filter(!(remove_utility_firms&sic2%in%c(49)))|>filter(# Remove obs with negative book equitybe>0,# Remove obs with negative market-to-book ratiomtb>0,# Remove obs with negative or zero market capitalizationmktcap>0,# Remove obs with R&D expenses greater than total assetsrd<=1)
Define variable labels.
label_map<-list( size ="Size", mktcap ="Market capitalization", tangibility ="Asset tangibility", roa ="ROA", mtb ="Market-to-book", lev ="Leverage", z_score ="Altman Z-score", capex ="Capital expenditures", rd ="R&D", cf ="Cash flow", ret_sd_12m ="Stock return volatility")# list of variables to be used in the analysisvars<-names(label_map)# Set variable labelsvar_label(df_firm)<-label_map
Finally, winsorize variables to reduce the impact of outliers. Specifically, we winsorize all continuous variables at the 1st and 99th percentiles.
df_firm<-df_firm|># Keep only rows with non-missing values for all varsfilter(if_all(all_of(vars), ~!is.na(.)))|># Winsorize variables at the 1st and 99th percentilesmutate(across(all_of(vars), \(x)winsorize_vec(x, probs =c(0.01, 0.99))))
3.3 Summary Statistics
The summary statistics for the final sample are presented in the table below.
Summary statistics for the analysis.
modelsummary::datasummary(as.formula(paste(paste(vars, collapse =" + "),"~ N + Mean + SD + Min + P25 + Median + P75 + Max")), data =df_firm, title ="Summary Statistics", fmt =3, notes ="Continuous variables are winsorized at the 1st and 99th percentiles.")|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Summary Statistics
N
Mean
SD
Min
P25
Median
P75
Max
Continuous variables are winsorized at the 1st and 99th percentiles.
Size
115405
5.456
2.149
1.151
3.860
5.329
6.945
10.787
Market capitalization
115405
5.531
2.254
0.938
3.853
5.443
7.083
11.174
Asset tangibility
115405
0.247
0.223
0.001
0.075
0.174
0.354
0.895
ROA
115405
-0.060
0.259
-1.312
-0.078
0.025
0.071
0.259
Market-to-book
115405
3.856
5.509
0.290
1.279
2.211
4.025
39.609
Leverage
115405
0.201
0.189
0.000
0.018
0.165
0.331
0.722
Altman Z-score
115405
5.177
8.624
-13.228
1.805
3.333
5.821
55.799
Capital expenditures
115405
0.054
0.062
0.000
0.015
0.034
0.067
0.347
R&D
115405
0.063
0.117
0.000
0.000
0.006
0.078
0.626
Cash flow
115405
0.029
0.244
-1.066
-0.002
0.098
0.159
0.379
Stock return volatility
115405
0.040
0.024
0.011
0.023
0.035
0.051
0.135
Visualization is also a powerful tool to understand the data. The following figure shows the time series of average capital expenditure and the number of firms in the sample.
Plot time series of average capital expenditure and number of firms.
df_capex<-df_firm|>group_by(fyear)|>summarise( mean_capex =mean(capex, na.rm =TRUE), n_firms =n_distinct(gvkey))ggplot(df_capex, aes(x =fyear))+geom_col(aes(y =n_firms/max(n_firms)*max(mean_capex)*0.3), fill ="#bdbdbd", width =0.8, alpha =0.6)+geom_line(aes(y =mean_capex), color ="#A6192E", linewidth =1.2)+scale_y_continuous( name ="Average Capital Expenditure", sec.axis =sec_axis( transform =~./(max(df_capex$mean_capex, na.rm =TRUE)*0.3)*max(df_capex$n_firms), name ="Number of Firms"))+labs( title ="Time Series of Average Capital Expenditure and Number of Firms", x ="Year")+theme_minimal()
4 Empirical Analysis
4.1 Baseline Contemporaneous Model
In practice, our research questions vary. For illustration, we examine the relationship between capital expenditures and cash flow, controlling for a set of other firm characteristics.2
2 This is just a hypothetical example. Let’s not get too caught up by whether this is a good research question or not.
Specifically, we estimate the following regression model:
where \(\textit{CAPEX}_{it}\) is capital expenditures scaled by total assets for firm \(i\) in year \(t\), \(\textit{CF}_{it}\) is cash flow scaled by total assets, \(X_{it}\) is a vector of control variables including firm size, market-to-book ratio, leverage, ROA, asset tangibility, R&D expenses, and Altman Z-score. \(\theta_t\) and \(\lambda_i\) represent year and firm fixed effects, respectively. Standard errors are clustered at the firm level.
I will use fixest to estimate the model and modelsummary to create the regression table.
Below are several examples of how to present regression results using modelsummary, from a minimal example to more complex customizations.
models<-list("Model 1"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 2"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+gvkey, data =df_firm, cluster =~gvkey))# Report results using modelsummarymodelsummary(models, title ="Capital Expenditures and Cash Flow", stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Capital Expenditures and Cash Flow
Model 1
Model 2
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
cf
0.006***
0.006**
(0.002)
(0.002)
size
0.001***
0.001***
(0.000)
(0.000)
mtb
0.001***
0.001***
(0.000)
(0.000)
lev
-0.037***
-0.035***
(0.002)
(0.002)
roa
0.003
0.002
(0.002)
(0.002)
tangibility
0.165***
0.166***
(0.004)
(0.004)
rd
0.021***
0.023***
(0.003)
(0.003)
z_score
0.000***
(0.000)
Num.Obs.
115405
115405
R2
0.666
0.666
R2 Adj.
0.624
0.625
R2 Within
0.114
0.114
R2 Within Adj.
0.114
0.114
AIC
-416745.4
-416795.6
BIC
-293937.8
-293978.3
RMSE
0.04
0.04
Std.Errors
by: gvkey
by: gvkey
FE: fyear
X
X
FE: gvkey
X
X
Use the coef_map argument in the modelsummary function to apply the variable labels and ordering.
Show variable labels instead of raw variable names.
models<-list("Model 1"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 2"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+gvkey, data =df_firm, cluster =~gvkey))# Sort `label_map` to have `cf` to be the first varlabel_map<-label_map[c("cf", setdiff(names(label_map), "cf"))]# Report results using modelsummarymodelsummary(models, title ="Capital Expenditures and Cash Flow", coef_map =label_map, # Use label_map to rename, reorder, and subset reported vars stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Capital Expenditures and Cash Flow
Model 1
Model 2
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow
0.006***
0.006**
(0.002)
(0.002)
Size
0.001***
0.001***
(0.000)
(0.000)
Asset tangibility
0.165***
0.166***
(0.004)
(0.004)
ROA
0.003
0.002
(0.002)
(0.002)
Market-to-book
0.001***
0.001***
(0.000)
(0.000)
Leverage
-0.037***
-0.035***
(0.002)
(0.002)
Altman Z-score
0.000***
(0.000)
R&D
0.021***
0.023***
(0.003)
(0.003)
Num.Obs.
115405
115405
R2
0.666
0.666
R2 Adj.
0.624
0.625
R2 Within
0.114
0.114
R2 Within Adj.
0.114
0.114
AIC
-416745.4
-416795.6
BIC
-293937.8
-293978.3
RMSE
0.04
0.04
Std.Errors
by: gvkey
by: gvkey
FE: fyear
X
X
FE: gvkey
X
X
I will customize the table’s bottom goodness of fit (GoF) panel in a way that I find most useful:
Report fixed effects included in each model.
Report the clustering information.
Report the mean of the dependent variable.
Report the number of observations and adjusted R-squared.
I use custom functions to extract and format the information.
Customize GoF panel.
models<-list("Model 1"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 2"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 3"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+sic2, data =df_firm, cluster =~gvkey))# Extract clustering variablesget_cluster_vars<-function(x){# assert x is a fixest objectif(!inherits(x, "fixest"))stop("Input must be a fixest object")typ<-attr(x$se, "type")if(is.null(typ)||!nzchar(typ))return(NULL)if(!grepl("\\(", typ))return(NULL)inside<-sub(".*\\((.*)\\).*", "\\1", typ)vars<-trimws(unlist(strsplit(inside, "[,&+]")))vars[nzchar(vars)]}# Pretty print clustering variables# Rename `gvkey` to `Firm`, `fyear` to `Year`, and `sic2` to `Industry`pretty_cluster<-function(vars){map<-c(gvkey ="Firm", fyear ="Year", sic2 ="Industry")pretty<-ifelse(vars%in%names(map), unname(map[vars]), vars)paste(pretty, collapse =" & ")}# Custom glance method for fixest objectsglance_custom.fixest<-function(x){out<-list()# Fixed effectsfor(vinx$fixef_vars){out[[paste0("FE: ", v)]]<-"✓"}# Cluster infocl_vars<-get_cluster_vars(x)if(length(cl_vars)){out[["vcov.type"]]<-paste("by", pretty_cluster(cl_vars))}else{out[["vcov.type"]]<-"Not clustered"}# Mean of dependent variabledv<-insight::get_response(x)dv<-mean(dv, na.rm =TRUE)out[["Mean dep. var."]]<-dvreturn(data.frame(out, check.names =FALSE))}# Format numbers with commasf_comma<-function(x)format(round(x, 3), big.mark =",")# This list must have an element for each statistics we want to print. # Each element must be a named list with exactly three elements:# raw: a string with the name of a column produced by get_gof(model).# clean: a string with the “clean” name of the statistic you want to appear in your final table.# fmt: a string, numeric value, or function which will be used to round/format the string in question.gm<-list(list("raw"="FE: gvkey", "clean"="Firm fixed effects", "fmt"=0),list("raw"="FE: sic2", "clean"="Industry fixed effects", "fmt"=0),list("raw"="FE: fyear", "clean"="Year fixed effects", "fmt"=0),list("raw"="nobs", "clean"="N", "fmt"=f_comma),list("raw"="adj.r.squared", "clean"="Adjusted R<sup>2</sup>", "fmt"=3),list("raw"="Mean dep. var.", "clean"="Mean of dep. var.", "fmt"=3),list("raw"="vcov.type", "clean"="Standard errors clustering", "fmt"=0))# Report results using modelsummarymodelsummary(models, title ="Capital Expenditures and Cash Flow", coef_map =label_map, # Use label_map to rename, reorder, and subset reported vars gof_map =gm, stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Capital Expenditures and Cash Flow
Model 1
Model 2
Model 3
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow
0.006***
0.006**
0.028***
(0.002)
(0.002)
(0.003)
Size
0.001***
0.001***
-0.000*
(0.000)
(0.000)
(0.000)
Asset tangibility
0.165***
0.166***
0.173***
(0.004)
(0.004)
(0.003)
ROA
0.003
0.002
-0.021***
(0.002)
(0.002)
(0.002)
Market-to-book
0.001***
0.001***
0.001***
(0.000)
(0.000)
(0.000)
Leverage
-0.037***
-0.035***
-0.033***
(0.002)
(0.002)
(0.002)
Altman Z-score
0.000***
0.000***
(0.000)
(0.000)
R&D
0.021***
0.023***
0.016***
(0.003)
(0.003)
(0.003)
Firm fixed effects
✓
✓
Industry fixed effects
✓
Year fixed effects
✓
✓
✓
N
115,405
115,405
115,405
Adjusted R2
0.624
0.625
0.466
Mean of dep. var.
0.054
0.054
0.054
Standard errors clustering
by Firm
by Firm
by Firm
We can also group the models when presenting the results.
Group models
models<-list("Model 1"=feols(capex~cf|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 2"=feols(capex~cf|fyear+sic2, data =df_firm, cluster =~gvkey),"Model 3"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 4"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+sic2, data =df_firm, cluster =~gvkey))# Report results using modelsummarymodelsummary(models, title ="Capital Expenditures and Cash Flow", coef_map =label_map, # Use label_map to rename, reorder, and subset reported vars gof_map =gm, stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>group_tt(j =list("Without Controls"=c(2,3),"With Controls"=c(4,5)))|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Without Controls
With Controls
Capital Expenditures and Cash Flow
Model 1
Model 2
Model 3
Model 4
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow
0.003*
0.017***
0.006**
0.028***
(0.002)
(0.001)
(0.002)
(0.003)
Size
0.001***
-0.000*
(0.000)
(0.000)
Asset tangibility
0.166***
0.173***
(0.004)
(0.003)
ROA
0.002
-0.021***
(0.002)
(0.002)
Market-to-book
0.001***
0.001***
(0.000)
(0.000)
Leverage
-0.035***
-0.033***
(0.002)
(0.002)
Altman Z-score
0.000***
0.000***
(0.000)
(0.000)
R&D
0.023***
0.016***
(0.003)
(0.003)
Firm fixed effects
✓
✓
Industry fixed effects
✓
✓
Year fixed effects
✓
✓
✓
✓
N
115,405
115,405
115,405
115,405
Adjusted R2
0.576
0.271
0.625
0.466
Mean of dep. var.
0.054
0.054
0.054
0.054
Standard errors clustering
by Firm
by Firm
by Firm
by Firm
If the models list is unnamed, modelsummary will number the models automatically.
Number models
models<-list(feols(capex~cf|fyear+gvkey, data =df_firm, cluster =~gvkey),feols(capex~cf|fyear+sic2, data =df_firm, cluster =~gvkey),feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+gvkey, data =df_firm, cluster =~gvkey),feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+sic2, data =df_firm, cluster =~gvkey))# Report results using modelsummarymodelsummary(models, title ="Capital Expenditures and Cash Flow", coef_map =label_map, gof_map =gm, stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>group_tt(j =list("Without Controls"=c(2,3),"With Controls"=c(4,5)))|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Without Controls
With Controls
Capital Expenditures and Cash Flow
(1)
(2)
(3)
(4)
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow
0.003*
0.017***
0.006**
0.028***
(0.002)
(0.001)
(0.002)
(0.003)
Size
0.001***
-0.000*
(0.000)
(0.000)
Asset tangibility
0.166***
0.173***
(0.004)
(0.003)
ROA
0.002
-0.021***
(0.002)
(0.002)
Market-to-book
0.001***
0.001***
(0.000)
(0.000)
Leverage
-0.035***
-0.033***
(0.002)
(0.002)
Altman Z-score
0.000***
0.000***
(0.000)
(0.000)
R&D
0.023***
0.016***
(0.003)
(0.003)
Firm fixed effects
✓
✓
Industry fixed effects
✓
✓
Year fixed effects
✓
✓
✓
✓
N
115,405
115,405
115,405
115,405
Adjusted R2
0.576
0.271
0.625
0.466
Mean of dep. var.
0.054
0.054
0.054
0.054
Standard errors clustering
by Firm
by Firm
by Firm
by Firm
If the models list is named, we can manually add column numbers.
Number and name models
models<-list("Model 1"=feols(capex~cf|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 2"=feols(capex~cf|fyear+sic2, data =df_firm, cluster =~gvkey),"Model 3"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+gvkey, data =df_firm, cluster =~gvkey),"Model 4"=feols(capex~cf+size+mtb+lev+roa+tangibility+rd+z_score|fyear+sic2, data =df_firm, cluster =~gvkey))n_cols<-length(models)# Report results using modelsummarytable<-modelsummary(models, title ="Capital Expenditures and Cash Flow", coef_map =label_map, gof_map =gm, stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>group_tt(j =setNames(as.list(2:(n_cols+1)), paste0("(", seq_len(n_cols), ")")))|>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(n_cols+1))))table|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Dependent Variable: CAPEX
(1)
(2)
(3)
(4)
Capital Expenditures and Cash Flow
Model 1
Model 2
Model 3
Model 4
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow
0.003*
0.017***
0.006**
0.028***
(0.002)
(0.001)
(0.002)
(0.003)
Size
0.001***
-0.000*
(0.000)
(0.000)
Asset tangibility
0.166***
0.173***
(0.004)
(0.003)
ROA
0.002
-0.021***
(0.002)
(0.002)
Market-to-book
0.001***
0.001***
(0.000)
(0.000)
Leverage
-0.035***
-0.033***
(0.002)
(0.002)
Altman Z-score
0.000***
0.000***
(0.000)
(0.000)
R&D
0.023***
0.016***
(0.003)
(0.003)
Firm fixed effects
✓
✓
Industry fixed effects
✓
✓
Year fixed effects
✓
✓
✓
✓
N
115,405
115,405
115,405
115,405
Adjusted R2
0.576
0.271
0.625
0.466
Mean of dep. var.
0.054
0.054
0.054
0.054
Standard errors clustering
by Firm
by Firm
by Firm
by Firm
The tables presented here are in HTML format but can also be easily exported to other formats such as LaTeX.3
We need to tweak the table formatting for LaTeX export.
For example, the label of variable R&D contains a special character &, which needs to be escaped in LaTeX. We can use the escape = TRUE argument in modelsummary to handle this.
Also, the term Adjusted R<sup>2</sup> in the GoF panel contains HTML code, which needs to be replaced with Adjusted R\textsuperscript{2} or Adjusted $R^2$ for LaTeX. We can modify the clean element of the corresponding item in gm to achieve this.
Finally, the checkmark symbol ✓ used to indicate the inclusion of fixed effects is a Unicode character that may not render correctly in LaTeX. We can modify the glance_custom.fixest function to return \checkmark instead.
Modify various formatting for LaTeX export.
# Update label_map and gm for LaTeX exportlabel_map[["rd"]]<-"R\\&D"var_label(df_firm)<-label_mapgm<-list(list("raw"="FE: gvkey", "clean"="Firm fixed effects", "fmt"=0),list("raw"="FE: sic2", "clean"="Industry fixed effects", "fmt"=0),list("raw"="FE: fyear", "clean"="Year fixed effects", "fmt"=0),list("raw"="nobs", "clean"="N", "fmt"=f_comma),list("raw"="adj.r.squared", "clean"="Adjusted $R^2$", "fmt"=3),list("raw"="Mean dep. var.", "clean"="Mean of dep. var.", "fmt"=3),list("raw"="vcov.type", "clean"="Standard errors clustering", "fmt"=0))# Update to use LaTeX checkmark instead of unicodeglance_custom.fixest<-function(x){out<-list()# Fixed effectsfor(vinx$fixef_vars){out[[paste0("FE: ", v)]]<-"\\checkmark"}# Cluster infocl_vars<-get_cluster_vars(x)if(length(cl_vars)){out[["vcov.type"]]<-paste("by", pretty_cluster(cl_vars))}else{out[["vcov.type"]]<-"Not clustered"}# Mean of dependent variabledv<-insight::get_response(x)dv<-mean(dv, na.rm =TRUE)out[["Mean dep. var."]]<-dvreturn(data.frame(out, check.names =FALSE))}
The code exports the regression table to a LaTeX file. Note that escape = TRUE is used to ensure special characters are properly formatted for LaTeX.
output_path<-here("output", "regression_table.tex")modelsummary(models, title ="Capital Expenditures and Cash Flow", coef_map =label_map, gof_map =gm, stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.", escape =TRUE)|>group_tt(j =setNames(as.list(2:(n_cols+1)), paste0("(", seq_len(n_cols), ")")))|>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(n_cols+1))))|>save_tt(output_path, overwrite =TRUE)
This PDF can be downloaded here. The table is surely of publication quality.
\documentclass{article}% For \checkmark\usepackage{amssymb}% For tinytable export\usepackage{tabularray}\usepackage{float}\usepackage{graphicx}\usepackage{rotating}\usepackage[normalem]{ulem}\UseTblrLibrary{booktabs}\UseTblrLibrary{siunitx}\newcommand{\tinytableTabularrayUnderline}[1]{\underline{#1}}\newcommand{\tinytableTabularrayStrikeout}[1]{\sout{#1}}\NewTableCommand{\tinytableDefineColor}[3]{\definecolor{#1}{#2}{#3}}\begin{document}\input{output/regression_table}\end{document}
4.2 Panel with Lagged Variables
Very often in empirical finance, we need to control for lagged independent variables (e.g., lagged firm characteristics). For example, we can estimate the following model:
Panel data with lagged variables. No variable labelling.
# Report results using modelsummarytable<-modelsummary(models, title ="Capital Expenditures and Cash Flow", gof_map =gm, stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(length(models)+1))))table|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Dependent Variable: CAPEX
Capital Expenditures and Cash Flow
(1)
(2)
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
cf
-0.007***
(0.002)
l(cf, 1)
0.036***
0.029***
(0.002)
(0.003)
l(size, 1)
-0.001***
(0.000)
l(mtb, 1)
0.001***
(0.000)
l(lev, 1)
-0.040***
(0.002)
l(roa, 1)
0.004**
(0.002)
l(tangibility, 1)
0.023***
(0.004)
l(rd, 1)
0.016***
(0.004)
l(z_score, 1)
0.000***
(0.000)
Firm fixed effects
Y
Y
Year fixed effects
Y
Y
N
100,709
100,709
Adjusted R-squared
0.591
0.601
Mean of dep. var.
0.054
0.054
Standard errors clustering
by Firm
by Firm
Lagged variables with variable labelling and ordering.
# Create a new label_map for lagged variableslabel_map_lagged<-list()# Add lagged variable labels# The order here determines the display order in the tablelabel_map_lagged[["cf"]]<-"Cash flow (t)"label_map_lagged[["l(cf, 1)"]]<-"Cash flow (t-1)"label_map_lagged[["l(size, 1)"]]<-"Size (t-1)"label_map_lagged[["l(mtb, 1)"]]<-"Market-to-book (t-1)"label_map_lagged[["l(lev, 1)"]]<-"Leverage (t-1)"label_map_lagged[["l(roa, 1)"]]<-"ROA (t-1)"label_map_lagged[["l(tangibility, 1)"]]<-"Asset tangibility (t-1)"label_map_lagged[["l(rd, 1)"]]<-"R&D (t-1)"label_map_lagged[["l(z_score, 1)"]]<-"Altman Z-score (t-1)"# Report results using modelsummarytable<-modelsummary(models, title ="Capital Expenditures and Cash Flow", coef_map =label_map_lagged, gof_map =gm, stars =c("*"=0.1, "**"=0.05, "***"=0.01), note ="Robust standard errors in parentheses.",)|>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(length(models)+1))))table|>theme_html(class="table table-borderless regtable", css_rule=table_css)
Dependent Variable: CAPEX
Capital Expenditures and Cash Flow
(1)
(2)
* p < 0.1, ** p < 0.05, *** p < 0.01
Robust standard errors in parentheses.
Cash flow (t)
-0.007***
(0.002)
Cash flow (t-1)
0.036***
0.029***
(0.002)
(0.003)
Size (t-1)
-0.001***
(0.000)
Market-to-book (t-1)
0.001***
(0.000)
Leverage (t-1)
-0.040***
(0.002)
ROA (t-1)
0.004**
(0.002)
Asset tangibility (t-1)
0.023***
(0.004)
R&D (t-1)
0.016***
(0.004)
Altman Z-score (t-1)
0.000***
(0.000)
Firm fixed effects
Y
Y
Year fixed effects
Y
Y
N
100,709
100,709
Adjusted R-squared
0.591
0.601
Mean of dep. var.
0.054
0.054
Standard errors clustering
by Firm
by Firm
5 Conclusion
In this example template for a research workflow in R, I showcase various steps in the research process from data collection, manipulation, model estimation, to result presentation.
While I initially planned to cover a more comprehensive set of empirical analyses, I realized that that this document would be bloated if I included too many examples.
I will add more examples of various statistical analysis and tabulation with different customizations in separate posts. Of course, feedback and suggestions are always welcome!
Pflueger, Carolin, Emil Siriwardane, and Adi Sunderam. 2020. “Financial Market Risk Perceptions and the Macroeconomy.”The Quarterly Journal of Economics 135 (3): 1443–91.
Source Code
---title: "Research Workflow in R"date: 2025-08-28date-modified: 2025-08-29execute: freeze: trueformat: html: embed-resources: true toc: true number-sections: true code-fold: false code-tools: true code-link: true cap-location: top---[This is an example template for a research workflow in R.]{.mark} After many years working with SAS and Stata, I am experimenting a pure R workflow. This template outlines a typical research workflow in R, including data loading, cleaning, analysis, and presentation of results.[^download-template][^download-template]: You can download the template by clicking the `Code` button at the top right of this page and selecting `View Source`. ## SetupThis section initializes the R environment, sets project parameters, and loads required libraries for data processing and analysis.```{r}#| label: init-setup#| message: false#| code-summary: Initial setup.#| code-fold: true# Clear the environmentrm(list =ls())# Disable scientific notationoptions(scipen =999)# A hack to set the working directory to the location of this fileif (requireNamespace("rstudioapi", quietly =TRUE) && rstudioapi::isAvailable()) { wd <-dirname(rstudioapi::getSourceEditorContext()$path)setwd(wd)}# Set the root directory for the projecthere::i_am("index.qmd")```We first define some key project parameters.```{r}#| label: define-parameters#| code-fold: false#| code-summary: Define key parameters.year_start <-1990year_end <-2024remove_financial_firms <-TRUEremove_utility_firms <-TRUEforce_data_download <-FALSE``````{r}#| label: load-libraries#| code-summary: Load libraries.#| code-fold: true#| message: false# Database and data manipulationlibrary(RPostgres)library(here)library(glue)library(tidyverse)library(slider)# Data analysis and statisticslibrary(fixest)library(modelsummary)# Plotting and presentationlibrary(ggplot2)library(tinytable)# Variable labellinglibrary(labelled)``````{r}#| label: define-helper-functions#| code-fold: true#| code-summary: Define helper functions.# Load data from local file or WRDS database.load_data <-function(local, wrds, sql, force_download =FALSE) {if (file.exists(local) &&!force_download) { df <-readRDS(local) } else { res <-dbSendQuery(wrds, sql) df <-dbFetch(res, n =-1)dbClearResult(res)saveRDS(df, local, compress ="gzip") }return(df)}# Check for duplicate rows in a data frame based on specified columns.check_duplicates <-function(df, cols) {if (any(duplicated(df[, cols]))) {warning(glue("Found duplicate rows based on columns: ","{paste(cols, collapse = ', ')}." ) ) } else {message(glue("Rows in {deparse(substitute(df))} are unique based on columns: ","{paste(cols, collapse = ', ')}." ) ) }}# Winsorize a numeric vector at specified percentiles.winsorize_vec <-function(x, probs =c(0.01, 0.99), na.rm =TRUE) {stopifnot(length(probs) ==2, probs[1] <= probs[2]) q <- stats::quantile(x, probs = probs, na.rm = na.rm, names =FALSE) x <-pmin(pmax(x, q[1]), q[2]) x}```Define some custom CSS styles for the tables in the rendered HTML. Notably, rendering this document can produce an HTML page containing formatted outputs (tables and figures), while such outputs can also be exported as LaTeX files and image files to be used elsewhere (e.g., Overleaf, Word).```{r}#| label: define-css#| code-summary: Define CSS for table formatting. #| code-fold: true table_css <-".regtable table { border-collapse: collapse; /* removes default cell gaps */ border-spacing: 0; font-size: 0.9rem; /* relative, responsive */}.regtable caption { caption-side: top; font-weight: bold; color: black;}.regtable th,.regtable td { font-weight: normal; /* normal weight for data cells */ line-height: 1; /* tight vertical spacing */ padding: 0.2em 0.4em; /* minimal padding for readability */ white-space: nowrap; /* no wrapping */ margin: 0; /* avoid extra vertical gaps */}"```## DataData used in an empirical research project can be retrieved from either some remote sources (e.g., web pages, database like WRDS) or from local files, depending on availability. In this section, we load data from WRDS or from local storage in the `data` directory.### Fetch from WRDS::: {.callout-note}Before running the code chunks in this section, ensure that you have followed the guide on WRDS to set up your local R environment for database connection.:::Connect to WRDS database. We will use `wrds` to access the data. However, if `force_data_download` is set to `FALSE`, we will attempt to load data from local files first.```{r}#| label: connect-wrds#| code-summary: Connect to WRDS database.wrds <-NULLif (force_data_download) { wrds <-dbConnect(Postgres(),host ="wrds-pgdata.wharton.upenn.edu",port =9737,dbname ="wrds",sslmode ="require",user =Sys.getenv("WRDS_USERNAME") )}```::: {.callout-note}My WRDS username is stored in the environment variable `WRDS_USERNAME`. You can do the same or simply replace `Sys.getenv("WRDS_USERNAME")` with your actual WRDS username.:::#### CompustatTry to load Compustat Fundamentals Annual (funda) data from local storage first. If the local file does not exist, we download the data from WRDS and save it locally for future use.```{r}#| label: load-compustat-funda#| code-summary: Load Compustat Funda data.local_path <-here("data", "comp.funda.rds")sql <-glue(" select * from comp.funda where extract(year from datadate) between {year_start} and {year_end} and indfmt = 'INDL' and datafmt = 'STD' and popsrc = 'D' and consol = 'C'")comp.funda <-load_data(local_path, wrds, sql) ```Following @pflueger2020financial, for firms that change fiscal year within a calendar year, we take the last reported date when extracting financial data. This leaves us with one set of observations for each firm (`gvkey`) in each year.```{r}#| label: remove-duplicate-dates#| code-summary: Remove duplicate fiscal year-end dates.comp.funda <- comp.funda |>group_by(gvkey, fyear) |>filter(datadate ==max(datadate, na.rm =TRUE)) |>ungroup()```#### CRSPNeed the CRSP-Compustat Merged Link Table to link Compustat `gvkey` with CRSP `permno` and `permco`.```{r}#| label: load-crsp-link-history#| code-summary: Load CRSP Link History data.local_path <-here("data", "crsp.ccmxpf_lnkhist.rds")sql <-" select * from crsp.ccmxpf_lnkhist where linktype in ('LC', 'LU') and linkprim in ('C', 'P')"crsp.ccmxpf_lnkhist <-load_data(local_path, wrds, sql)```Because Compustat records only the *current* company information. We need the historical SIC code `hsiccd`, which is only available in CRSP. ```{r}#| label: load-crsp-msenames#| code-summary: Load CRSP Msenames data.local_path <-here("data", "crsp.msenames.rds")sql <-" select * from crsp.msenames where shrcd in (10, 11)"crsp.msenames <-load_data(local_path, wrds, sql)```::: {.callout-note}`shrcd` in (10, 11) ensures common stocks.:::Later, we will merge `crsp.msenames` and `comp.funda` to add the historical SIC code.Download daily stock return data from CRSP.```{r}#| label: load-crsp-daily-returns#| code-summary: Load stock return data.local_path <-here("data", "crsp.dsf.rds")sql <-glue(" select date, permno, ret from crsp.dsf where extract(year from date) between {year_start} and {year_end}")crsp.dsf <-load_data(local_path, wrds, sql)```Housekeeping.```{r}#| message: false#| label: housekeeping#| code-fold: true #| code-summary: Clear up temporary variables and disconnect from WRDSrm(local_path, sql)if (!is.null(wrds)) {dbDisconnect(wrds)}gc() # garbage collection```### Load from Local Storage::: {.callout-note}If we have other datasets stored locally, we can load them here.:::## SampleAfter loading the necessary datasets, we proceed to create the sample for analysis.### Merge Compustat and CRSP DataWe first merge Compustat and CRSP data using the link table `crsp.ccmxpf_lnkhist`. The linking is based on `gvkey` and the date `datadate` falling between the link start date `linkdt` and link end date `linkenddt`.```{r}#| label: merge-compustat-crsp#| code-summary: Merge Compustat and CRSP data.ccm <-inner_join( comp.funda, crsp.ccmxpf_lnkhist |>mutate(linkdt =coalesce(linkdt, as.Date("1900-01-01")),linkenddt =coalesce(linkenddt, as.Date("9999-12-31")) ),join_by( gvkey == gvkey,between(datadate, linkdt, linkenddt) ))```Next, we add the historical SIC code `hsiccd` from `crsp.msenames`. The linking is based on `permno` and the date `datadate` falling between the name start date `namedt` and name end date `nameendt`.```{r}#| label: add-hsiccd#| code-summary: Add historical SIC code.ccm <-left_join( ccm, crsp.msenames |>select(permno, permco, namedt, nameendt, hsiccd),join_by( lpermno == permno,between(datadate, namedt, nameendt) ))```Before proceeding, we verify that we have unique `gvkey`-`permco` and `gvkey`-`permno` links for each date `datadate`.```{r message=FALSE}#| label: verify-unique-links#| code-summary: Verify that we have unique gvkey-permco and gvkey-permno links.check_duplicates(ccm, c("gvkey", "datadate"))check_duplicates(ccm, c("lpermco", "gvkey", "datadate"))check_duplicates(ccm, c("lpermno", "gvkey", "datadate"))```### Create the SampleAfter all data is fetched and loaded, we create the sample for analysis. #### Stock Return VolatilityCompute the standard deviation of daily stock returns over the past 12 months for each firm-year observation in `ccm`. We require at least 60 daily return observations within the 12-month window to compute the standard deviation; otherwise, we set it to missing.```{r}#| label: compute-ret-sd#| code-summary: Compute the volatility of daily returns over the past 12 months.ccm_returns <- ccm |># create a lookup table of firm-year (permno, datadate) with the 12-month windowselect(gvkey, lpermno, datadate) |>rename(permno = lpermno) |>mutate(window_start = datadate %m-%months(12),window_end = datadate ) |># join with daily returns to get daily returns in the past 12 monthsleft_join( crsp.dsf,join_by(permno, between(y$date, x$window_start, x$window_end)) ) |>group_by(gvkey, datadate) |>summarise(n_obs =sum(!is.na(ret)),ret_sd_12m =ifelse(n_obs >=60, sd(ret, na.rm =TRUE), NA_real_),.groups ="drop" ) |># keep only gvkey, datadate, ret_sd_12mselect(gvkey, datadate, ret_sd_12m)# Merge back to ccmccm <-left_join(ccm, ccm_returns, by =c("gvkey", "datadate"))# Remove temporary data framerm(ccm_returns)```#### Final Sample CreationWe now create the final sample for analysis.```{r}#| label: create-sample#| code-summary: Create the sample.df_firm <- ccm |># Create new variablesmutate(# sic2: 2-digit SIC codesic2 =floor(hsiccd /100), # size: log of total assetssize =log(at),# mktcap: market capitalizationmktcap =log(abs(prcc_f * csho)),# seq: stockholder's equityseq =coalesce(seq, ceq + pstk, at - lt - mib),# pref: preferred stockpref =coalesce(pstkrv, pstkl, pstk, 0),# be: book equitybe = seq - pref,# tangibilitytangibility =coalesce(ppent, 0) / at,# market-to-book ratiomtb = (prcc_f * csho) / be,# ROAroa = ib / at,# lev: leveragelev = (dltt + dlc) / at,# Altman Z-scorez_score = (3.3* ebit + sale +1.4* re +1.2* (act - lct)) / at +0.6* (prcc_f * csho / lt),# CF is operating income before depreciation scaled by total assetscf =coalesce(oibdp, 0) / at,# Capital expenditures (capex) is capital expenditures scaled by total assetscapex =coalesce(capx, 0) / at,# R&D is research and development expenses scaled by total assetsrd =coalesce(xrd, 0) / at, ) |># Remove rows without historical SIC codefilter(!is.na(hsiccd)) |># Remove rows without total assetsfilter(!is.na(at)) |># Remove rows without permco or permnofilter(!is.na(lpermco) &!is.na(lpermno)) |># Keep non-financial and non-utility firmsfilter(!(remove_financial_firms & sic2 %in%c(60:69))) |>filter(!(remove_utility_firms & sic2 %in%c(49))) |>filter(# Remove obs with negative book equity be >0,# Remove obs with negative market-to-book ratio mtb >0,# Remove obs with negative or zero market capitalization mktcap >0,# Remove obs with R&D expenses greater than total assets rd <=1 )```Define variable labels.```{r}#| label: variable-labels#| code-summary: Variable labels for the analysis.label_map <-list(size ="Size",mktcap ="Market capitalization",tangibility ="Asset tangibility",roa ="ROA",mtb ="Market-to-book",lev ="Leverage",z_score ="Altman Z-score",capex ="Capital expenditures",rd ="R&D",cf ="Cash flow",ret_sd_12m ="Stock return volatility")# list of variables to be used in the analysisvars <-names(label_map)# Set variable labelsvar_label(df_firm) <- label_map```Finally, winsorize variables to reduce the impact of outliers. Specifically, we winsorize all continuous variables at the 1st and 99th percentiles.```{r}#| label: winsorize-variables#| code-summary: Winsorize variables to reduce the impact of outliers.df_firm <- df_firm |># Keep only rows with non-missing values for all varsfilter(if_all(all_of(vars), ~!is.na(.))) |># Winsorize variables at the 1st and 99th percentilesmutate(across(all_of(vars), \(x) winsorize_vec(x, probs =c(0.01, 0.99))) )```### Summary StatisticsThe summary statistics for the final sample are presented in the table below. ```{r}#| label: summary-statistics#| code-fold: true#| code-summary: Summary statistics for the analysis.#| output: asis modelsummary::datasummary(as.formula(paste(paste(vars, collapse =" + "),"~ N + Mean + SD + Min + P25 + Median + P75 + Max" )),data = df_firm,title ="Summary Statistics",fmt =3,notes ="Continuous variables are winsorized at the 1st and 99th percentiles.") |>theme_html(class="table table-borderless regtable", css_rule=table_css)```Visualization is also a powerful tool to understand the data. The following figure shows the time series of average capital expenditure and the number of firms in the sample.```{r}#| label: plot-capex-n-firms#| code-fold: true #| code-summary: Plot time series of average capital expenditure and number of firms.df_capex <- df_firm |>group_by(fyear) |>summarise(mean_capex =mean(capex, na.rm =TRUE),n_firms =n_distinct(gvkey) )ggplot(df_capex, aes(x = fyear)) +geom_col(aes(y = n_firms /max(n_firms) *max(mean_capex) *0.3), fill ="#bdbdbd", width =0.8, alpha =0.6) +geom_line(aes(y = mean_capex), color ="#A6192E", linewidth =1.2) +scale_y_continuous(name ="Average Capital Expenditure",sec.axis =sec_axis(transform =~ . / (max(df_capex$mean_capex, na.rm =TRUE) *0.3) *max(df_capex$n_firms),name ="Number of Firms" ) ) +labs(title ="Time Series of Average Capital Expenditure and Number of Firms",x ="Year" ) +theme_minimal()```## Empirical Analysis### Baseline Contemporaneous ModelIn practice, our research questions vary. For illustration, we examine the relationship between capital expenditures and cash flow, controlling for a set of other firm characteristics.[^research-question][^research-question]: This is just a hypothetical example. Let's not get too caught up by whether this is a good research question or not.Specifically, we estimate the following regression model:$$\begin{equation}\textit{CAPEX}_{it} = \beta \textit{CF}_{it} + \gamma X_{it} + \theta_t + \lambda_i + \varepsilon_{it}\end{equation},$$ {#eq-capex-regression}where $\textit{CAPEX}_{it}$ is capital expenditures scaled by total assets for firm $i$ in year $t$, $\textit{CF}_{it}$ is cash flow scaled by total assets, $X_{it}$ is a vector of control variables including firm size, market-to-book ratio, leverage, ROA, asset tangibility, R&D expenses, and Altman Z-score. $\theta_t$ and $\lambda_i$ represent year and firm fixed effects, respectively. Standard errors are clustered at the firm level.::: {.callout-note}I will use `fixest` to estimate the model and `modelsummary` to create the regression table.:::Below are several examples of how to present regression results using `modelsummary`, from a minimal example to more complex customizations.::: {.column-page-inset}::: {.panel-tabset}## Minimal Example```{r}#| label: baseline-model#| code-fold: true#| code-summary: Minimal regression table.#| output: asis models <-list("Model 1"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 2"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,data = df_firm,cluster =~gvkey ))# Report results using modelsummarymodelsummary( models,title ="Capital Expenditures and Cash Flow",stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.",) |>theme_html(class="table table-borderless regtable", css_rule=table_css)```## Variable Labels and OrderingUse the `coef_map` argument in the `modelsummary` function to apply the variable labels and ordering.```{r}#| label: baseline-model-with-variable-labels-and-ordering#| code-fold: true#| code-summary: Show variable labels instead of raw variable names.#| output: asis models <-list("Model 1"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 2"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,data = df_firm,cluster =~gvkey ))# Sort `label_map` to have `cf` to be the first varlabel_map <- label_map[c("cf", setdiff(names(label_map), "cf"))]# Report results using modelsummarymodelsummary( models,title ="Capital Expenditures and Cash Flow",coef_map = label_map, # Use label_map to rename, reorder, and subset reported varsstars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.",) |>theme_html(class="table table-borderless regtable", css_rule=table_css)```## Customize the GoF PanelI will customize the table's bottom goodness of fit (GoF) panel in a way that I find most useful:- Report fixed effects included in each model.- Report the clustering information.- Report the mean of the dependent variable.- Report the number of observations and adjusted R-squared.::: {.callout-note}I use custom functions to extract and format the information.:::```{r}#| label: baseline-model-gof#| code-fold: true#| code-summary: Customize GoF panel.#| output: asis models <-list("Model 1"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 2"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 3"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,data = df_firm,cluster =~gvkey ))# Extract clustering variablesget_cluster_vars <-function(x) {# assert x is a fixest objectif (!inherits(x, "fixest")) stop("Input must be a fixest object") typ <-attr(x$se, "type")if (is.null(typ) ||!nzchar(typ)) return(NULL)if (!grepl("\\(", typ)) return(NULL) inside <-sub(".*\\((.*)\\).*", "\\1", typ) vars <-trimws(unlist(strsplit(inside, "[,&+]"))) vars[nzchar(vars)]}# Pretty print clustering variables# Rename `gvkey` to `Firm`, `fyear` to `Year`, and `sic2` to `Industry`pretty_cluster <-function(vars) { map <-c(gvkey ="Firm", fyear ="Year", sic2 ="Industry") pretty <-ifelse(vars %in%names(map), unname(map[vars]), vars)paste(pretty, collapse =" & ")}# Custom glance method for fixest objectsglance_custom.fixest <-function(x) { out <-list()# Fixed effectsfor (v in x$fixef_vars) { out[[paste0("FE: ", v)]] <-"✓" }# Cluster info cl_vars <-get_cluster_vars(x)if (length(cl_vars)) { out[["vcov.type"]] <-paste("by", pretty_cluster(cl_vars)) } else { out[["vcov.type"]] <-"Not clustered" }# Mean of dependent variable dv <- insight::get_response(x) dv <-mean(dv, na.rm =TRUE) out[["Mean dep. var."]] <- dvreturn(data.frame(out, check.names =FALSE))}# Format numbers with commasf_comma <-function(x) format(round(x, 3), big.mark =",")# This list must have an element for each statistics we want to print. # Each element must be a named list with exactly three elements:# raw: a string with the name of a column produced by get_gof(model).# clean: a string with the “clean” name of the statistic you want to appear in your final table.# fmt: a string, numeric value, or function which will be used to round/format the string in question.gm <-list(list("raw"="FE: gvkey", "clean"="Firm fixed effects", "fmt"=0),list("raw"="FE: sic2", "clean"="Industry fixed effects", "fmt"=0),list("raw"="FE: fyear", "clean"="Year fixed effects", "fmt"=0),list("raw"="nobs", "clean"="N", "fmt"= f_comma),list("raw"="adj.r.squared", "clean"="Adjusted R<sup>2</sup>", "fmt"=3),list("raw"="Mean dep. var.", "clean"="Mean of dep. var.", "fmt"=3),list("raw"="vcov.type", "clean"="Standard errors clustering", "fmt"=0))# Report results using modelsummarymodelsummary( models,title ="Capital Expenditures and Cash Flow",coef_map = label_map, # Use label_map to rename, reorder, and subset reported varsgof_map = gm,stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.",) |>theme_html(class="table table-borderless regtable", css_rule=table_css)```## Group ModelsWe can also group the models when presenting the results.```{r}#| label: baseline-model-group#| code-fold: true#| code-summary: Group models#| output: asis models <-list("Model 1"=feols( capex ~ cf | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 2"=feols( capex ~ cf | fyear + sic2,data = df_firm,cluster =~gvkey ),"Model 3"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 4"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,data = df_firm,cluster =~gvkey ))# Report results using modelsummarymodelsummary( models,title ="Capital Expenditures and Cash Flow",coef_map = label_map, # Use label_map to rename, reorder, and subset reported varsgof_map = gm,stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.",) |>group_tt(j =list("Without Controls"=c(2,3),"With Controls"=c(4,5) )) |>theme_html(class="table table-borderless regtable", css_rule=table_css)```## Column NumberingIf the `models` list is *unnamed*, `modelsummary` will number the models automatically.```{r}#| label: baseline-model-column-numbering#| code-fold: true#| code-summary: Number models#| output: asis models <-list(feols( capex ~ cf | fyear + gvkey,data = df_firm,cluster =~gvkey ),feols( capex ~ cf | fyear + sic2,data = df_firm,cluster =~gvkey ),feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,data = df_firm,cluster =~gvkey ),feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,data = df_firm,cluster =~gvkey ))# Report results using modelsummarymodelsummary( models,title ="Capital Expenditures and Cash Flow",coef_map = label_map,gof_map = gm,stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.",) |>group_tt(j =list("Without Controls"=c(2,3),"With Controls"=c(4,5) )) |>theme_html(class="table table-borderless regtable", css_rule=table_css)```## Column Numbering and NamingIf the `models` list is *named*, we can manually add column numbers.```{r}#| label: baseline-model-column-numbering-naming#| code-fold: true#| code-summary: Number and name models#| output: asis models <-list("Model 1"=feols( capex ~ cf | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 2"=feols( capex ~ cf | fyear + sic2,data = df_firm,cluster =~gvkey ),"Model 3"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + gvkey,data = df_firm,cluster =~gvkey ),"Model 4"=feols( capex ~ cf + size + mtb + lev + roa + tangibility + rd + z_score | fyear + sic2,data = df_firm,cluster =~gvkey ))n_cols <-length(models)# Report results using modelsummarytable <-modelsummary( models,title ="Capital Expenditures and Cash Flow",coef_map = label_map,gof_map = gm,stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.", ) |>group_tt(j =setNames(as.list(2:(n_cols +1)), paste0("(", seq_len(n_cols), ")"))) |>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(n_cols +1)) ))table |>theme_html(class="table table-borderless regtable", css_rule=table_css)```:::<!-- end of panel -->:::<!-- end of .column-screen-inset-shaded -->The tables presented here are in HTML format but can also be easily exported to other formats such as LaTeX.[^tinytable][^tinytable]: See [documentation of `tinytable`](https://vincentarelbundock.github.io/tinytable/man/save_tt.html) for details, especially the required preamble.::: {.callout-important}We need to tweak the table formatting for LaTeX export.For example, the label of variable `R&D` contains a special character `&`, which needs to be escaped in LaTeX. We can use the `escape = TRUE` argument in `modelsummary` to handle this. Also, the term `Adjusted R<sup>2</sup>` in the GoF panel contains HTML code, which needs to be replaced with `Adjusted R\textsuperscript{2}` or `Adjusted $R^2$` for LaTeX. We can modify the `clean` element of the corresponding item in `gm` to achieve this.Finally, the checkmark symbol `✓` used to indicate the inclusion of fixed effects is a Unicode character that may not render correctly in LaTeX. We can modify the `glance_custom.fixest` function to return `\checkmark` instead.:::```{r}#| label: prepare-latex-export#| code-fold: true#| code-summary: Modify various formatting for LaTeX export.#| output: asis # Update label_map and gm for LaTeX exportlabel_map[["rd"]] <-"R\\&D"var_label(df_firm) <- label_mapgm <-list(list("raw"="FE: gvkey", "clean"="Firm fixed effects", "fmt"=0),list("raw"="FE: sic2", "clean"="Industry fixed effects", "fmt"=0),list("raw"="FE: fyear", "clean"="Year fixed effects", "fmt"=0),list("raw"="nobs", "clean"="N", "fmt"= f_comma),list("raw"="adj.r.squared", "clean"="Adjusted $R^2$", "fmt"=3),list("raw"="Mean dep. var.", "clean"="Mean of dep. var.", "fmt"=3),list("raw"="vcov.type", "clean"="Standard errors clustering", "fmt"=0))# Update to use LaTeX checkmark instead of unicodeglance_custom.fixest <-function(x) { out <-list()# Fixed effectsfor (v in x$fixef_vars) { out[[paste0("FE: ", v)]] <-"\\checkmark" }# Cluster info cl_vars <-get_cluster_vars(x)if (length(cl_vars)) { out[["vcov.type"]] <-paste("by", pretty_cluster(cl_vars)) } else { out[["vcov.type"]] <-"Not clustered" }# Mean of dependent variable dv <- insight::get_response(x) dv <-mean(dv, na.rm =TRUE) out[["Mean dep. var."]] <- dvreturn(data.frame(out, check.names =FALSE))}```The code exports the regression table to a LaTeX file. Note that `escape = TRUE` is used to ensure special characters are properly formatted for LaTeX.```{r}#| label: export-regression-table#| code-summary: Export regression table to LaTeX.output_path <-here("output", "regression_table.tex")modelsummary( models,title ="Capital Expenditures and Cash Flow",coef_map = label_map,gof_map = gm,stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.",escape =TRUE) |>group_tt(j =setNames(as.list(2:(n_cols +1)), paste0("(", seq_len(n_cols), ")"))) |>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(n_cols +1)) )) |>save_tt(output_path, overwrite =TRUE)```The output is```{latex filename="output/regression_table.tex"}#| label: raw-latex-regression-table#| code-fold: true #| code-summary: Raw LaTeX content of `output/regression_table.tex`.{{< include output/regression_table.tex >}}```The rendered PDF is shown below. Note that we need to include several LaTeX packages in the preamble to ensure proper rendering of the table.::: {.panel-tabset}## Rendered PDFThis PDF can be [downloaded here](./output/table.pdf). The table is surely of publication quality.{height=850 width=100%}## Raw LaTeX```latex\documentclass{article}% For \checkmark\usepackage{amssymb}% For tinytable export\usepackage{tabularray}\usepackage{float}\usepackage{graphicx}\usepackage{rotating}\usepackage[normalem]{ulem}\UseTblrLibrary{booktabs}\UseTblrLibrary{siunitx}\newcommand{\tinytableTabularrayUnderline}[1]{\underline{#1}}\newcommand{\tinytableTabularrayStrikeout}[1]{\sout{#1}}\NewTableCommand{\tinytableDefineColor}[3]{\definecolor{#1}{#2}{#3}}\begin{document}\input{output/regression_table}\end{document}```:::<!-- end of tabset -->### Panel with Lagged VariablesVery often in empirical finance, we need to control for lagged independent variables (e.g., lagged firm characteristics). For example, we can estimate the following model:$$\begin{equation}\textit{CAPEX}_{it} = \beta \textit{CF}_{i,t-1} + \gamma X_{i,t-1} + \theta_t + \lambda_i + \varepsilon_{it}\end{equation}.$$ {#eq-capex-regression-lagged}@eq-capex-regression-lagged includes lagged values of the independent variables instead of contemporaneous values in @eq-capex-regression. Before proceeding, we need to set the panel.```{r}#| label: create-panelpanel_sample =panel(df_firm, ~gvkey+fyear)``````{r}#| label: restore-html-formatting#| code-fold: true#| code-summary: Restore HTML formatting for variable labels and GoF panel.label_map[["rd"]] <-"R&D"var_label(df_firm) <- label_mapgm <-list(list("raw"="FE: gvkey", "clean"="Firm fixed effects", "fmt"=0),list("raw"="FE: sic2", "clean"="Industry fixed effects", "fmt"=0),list("raw"="FE: fyear", "clean"="Year fixed effects", "fmt"=0),list("raw"="nobs", "clean"="N", "fmt"= f_comma),list("raw"="adj.r.squared", "clean"="Adjusted R-squared", "fmt"=3),list("raw"="Mean dep. var.", "clean"="Mean of dep. var.", "fmt"=3),list("raw"="vcov.type", "clean"="Standard errors clustering", "fmt"=0))# Update to use LaTeX checkmark instead of unicodeglance_custom.fixest <-function(x) { out <-list()# Fixed effectsfor (v in x$fixef_vars) { out[[paste0("FE: ", v)]] <-"Y" }# Cluster info cl_vars <-get_cluster_vars(x)if (length(cl_vars)) { out[["vcov.type"]] <-paste("by", pretty_cluster(cl_vars)) } else { out[["vcov.type"]] <-"Not clustered" }# Mean of dependent variable dv <- insight::get_response(x) dv <-mean(dv, na.rm =TRUE) out[["Mean dep. var."]] <- dvreturn(data.frame(out, check.names =FALSE))}``````{r}#| label: estimate-model-with-lagged-variables#| code-fold: true#| code-summary: Estimate model with lagged variables.# Variables to laglag_vars <-c("cf", "size", "mtb", "lev", "roa", "tangibility", "rd", "z_score")fml <-as.formula(paste0(# dependent"capex ~ ",# lagged variablespaste0("l(", lag_vars, ",1)", collapse =" + "),# fixed effects" | fyear + gvkey" ) )models <-list(feols( capex ~l(cf,0:1) | fyear + gvkey,data = panel_sample,cluster =~gvkey ),feols(fml,data = panel_sample,cluster =~gvkey ))```Below are examples of presenting regression results using `modelsummary`, from a minimal example to more complex customizations.::: {.column-page-inset}::: {.panel-tabset}## Minimal Example```{r}#| label: panel-lagged#| code-fold: true #| code-summary: Panel data with lagged variables. No variable labelling.# Report results using modelsummarytable <-modelsummary( models,title ="Capital Expenditures and Cash Flow",gof_map = gm,stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.", ) |>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(length(models) +1)) ))table |>theme_html(class="table table-borderless regtable", css_rule=table_css)```## With Variable Labelling and Ordering```{r}#| label: panel-lagged-var-label#| code-fold: true #| code-summary: Lagged variables with variable labelling and ordering.# Create a new label_map for lagged variableslabel_map_lagged <-list()# Add lagged variable labels# The order here determines the display order in the tablelabel_map_lagged[["cf"]] <-"Cash flow (t)"label_map_lagged[["l(cf, 1)"]] <-"Cash flow (t-1)"label_map_lagged[["l(size, 1)"]] <-"Size (t-1)"label_map_lagged[["l(mtb, 1)"]] <-"Market-to-book (t-1)"label_map_lagged[["l(lev, 1)"]] <-"Leverage (t-1)"label_map_lagged[["l(roa, 1)"]] <-"ROA (t-1)"label_map_lagged[["l(tangibility, 1)"]] <-"Asset tangibility (t-1)"label_map_lagged[["l(rd, 1)"]] <-"R&D (t-1)"label_map_lagged[["l(z_score, 1)"]] <-"Altman Z-score (t-1)"# Report results using modelsummarytable <-modelsummary( models,title ="Capital Expenditures and Cash Flow",coef_map = label_map_lagged,gof_map = gm,stars =c("*"=0.1, "**"=0.05, "***"=0.01),note ="Robust standard errors in parentheses.", ) |>group_tt(j =list("Dependent Variable: CAPEX"=c(2:(length(models) +1)) ))table |>theme_html(class="table table-borderless regtable", css_rule=table_css)```:::<!-- end of tabset -->:::<!-- end of .column-screen-inset-shaded -->## ConclusionIn this example template for a research workflow in R, I showcase various steps in the research process from data collection, manipulation, model estimation, to result presentation.While I initially planned to cover a more comprehensive set of empirical analyses, I realized that that this document would be bloated if I included too many examples.I will add more examples of various statistical analysis and tabulation with different customizations in separate posts. Of course, feedback and suggestions are always welcome!