Skip to content

Research Notes

Identify Retail Investors

Retail investors and their trading behaviour attract many research interests. One strand of literature uses proprietary datasets to identify retail investors. The other uses algorithms. A recent JF paper Boehmer et al. (2021) proposes a simple one based only on the trade price, which also signs the trade direction effectively. Even more interestingly, I just read a follow-up work forthcoming on JF by Barber et al. (2023). The authors placed 85,000 retail trades themselves to validate the Boehmer et al. (2021) algorithm.

Translog Production and Cost Functions

In this post, I'll carefully explain the derivation of cost function from a CES production function, as well as the derivation of translog (transcendental logarithmic) production and cost functions.

flowchart TB
    subgraph Production
    A[Production Function] -. approximation .-> D(Translog Production Function)
    subgraph Cost
    B[Cost Function] -. approximation .-> C(Translog Cost Function)
    A == Conversion via Duality ==> B

Before I start, the graph above illustrate the relations. Specifically, we can derive the cost function from a CES production function via the duality theorem. Translog production and translog cost functions are approximations to the production and corresponding cost function, respectively, via Taylor expansion.

Difference-in-Differences Estimation

Empirical researchers have been using difference-in-differences (DiD) estimation to identify an event's Average Treatment effect on the Treated entities (ATT). This post is my understanding and a non-technical note of the DiD approach as it evolves over the past years, especially on the problems and solutions when multiple treatment events are staggered.

Correlated Random Effects

Can we estimate the coefficient of gender while controlling for individual fixed effects? This sounds impossible as an individual's gender typically does not vary and hence would be absorbed by individual fixed effects. However, Correlated Random Effects (CRE) may actually help.

At last year's FMA Annual Meeting, I learned this CRE estimation technique when discussing a paper titled "Gender Gap in Returns to Publications" by Piotr Spiewanowski, Ivan Stetsyuk and Oleksandr Talavera. Let me recollect my memory and summarize the technique in this post.

Adding Another Factor to Principal-Agent Model

In a traditional principal-agent model, firm output is a function of the agent's effort and the principal observes only the output not agent's effort. The principal carefully designs the agent's compensation package, especially the sensitivity of the agent's pay to firm output, to maximize the firm value. Now, what if we add another factor to the relationship between firm output and agent's effort? How would the optimal pay sensitivity change?

Estimate Organization Capital

As in Eisfeldt and Papanikolaou (2013), we obtain firm-year accounting data from the Compustat and compute the stock of organization capital for firms using the perpetual inventory method that recursively calculates the stock of OC by accumulating the deflated value of SG&A expenses.

Download M&A Deals from SDC Platinum

Thomson One Banker SDC Platinum database provides comprehensive M&A transaction data from early 1980s, and is perhaps the most widely used M&A database in the world.

This post documents the steps of downloading M&A deals from the SDC Platinum database. Specifically, I show how to download the complete M&A data where:

  • both the acquiror and the target are US firms,
  • the acquiror is a public firm or a private firm,
  • the target is a public firm, a private firm, or a subsidiary,
  • the deal value is at least $1m, and
  • the form of the deal is a acquisition, a merger or an acquisition of majority interest.

Specification Curve Analysis


More often than not, empirical researchers need to argue that their chosen model specification reigns. If not, they need to run a battery of tests on alternative specifications and report them. The problem is, researchers can fit a few tables each with a few models in the paper at best, and it's extremely hard for readers to know whether the reported results are being cherry-picked.

So, why not run all possible model specifications and find a concise way to report them all?

Firm Historical Headquarter State from SEC 10K/Q Filings

Why the need to use SEC filings?

In the Compustat database, a firm's headquarter state (and other identification) is in fact the current record stored in This means once a firm relocates (or updates its incorporate state, address, etc.), all historical observations will be updated and not recording historical state information anymore.

To resolve this issue, an effective way is to use the firm's historical SEC filings. You can follow my previous post Textual Analysis on SEC filings to extract the header information, which includes a wide range of meta data. Alternatively, the University of Notre Dame's Software Repository for Accounting and Finance provides an augmented 10-X header dataset.

2023 March Update

In this update I use 1,491,368 8-K filings of U.S. firms from 2004 to Dec 2022 and extract their HQ state and zipcode.

Compute Jackknife Coefficient Estimates in SAS

In certain scenarios, we want to estimate a model's parameters on the sample for each observation with itself excluded. This can be achieved by estimating the model repeatedly on the leave-one-out samples but is very inefficient. If we estimate the model on the full sample, however, the coefficient estimates will certainly be biased. Thankfully, we have the Jackknife method to correct for the bias, which produces the Jackknifed coefficient estimates for each observation.

Textual Analysis on SEC Filings

Nowadays top journals favour more granular studies. Sometimes it's useful to dig into the raw SEC filings and perform textual analysis. This note documents how I download all historical SEC filings via EDGAR and conduct some textual analyses.

Merge Compustat and CRSP

Using the CRSP/Compustat Merged Database (CCM) to extract data is one of the fundamental steps in most finance studies. Here I document several SAS programs for annual, quarterly and monthly data, inspired by and adapted from several examples from the WRDS.1

Decomposing Herfindahl–Hirschman (HHI) Index

Herfindahl–Hirschman (HHI) Index is a well-known market concentration measure determined by two factors:

  1. the size distribution (variance) of firms, and
  2. the number of firms.

Intuitively, having a hundred similar-sized gas stations in town means a far less concentrated environment than just one or two available, and when the number of firms is constant, their size distribution or variance determines the magnitude of market concentration.

Since these two properties jointly determine the HHI measure of concentration, naturally we want a decomposition of HHI that can reflects these two dimensions respectively. This is particularly useful when two distinct markets have the same level of HHI measure, but the concentration may result from different sources. Note that here these two markets do not necessarily have to be industry A versus industry B, but can be the same industry niche in two geographical areas, for example.

Thus, we can think of HHI as the sum of the actual market state's deviation from 1) all firms having the same size, and the deviation from 2) a fully competitive environment with infinite number of firms in the market. Some simple math can solve our problem.

Identify Chinese State-Owned Enterprise using CSMAR

Many research papers on Chinese firms include a control variable that indicates if the firm is a state-owned enterprise (SOE). This is important as SOEs and non-SOEs differ in many aspects and may have structural differences. This post documents the way to construct this indicator variable from the CSMAR databases.