Textual Analysis on SEC Filings
Nowadays top journals favour more granular studies. Sometimes it’s useful to dig into the raw SEC filings and perform textual analysis. This note documents how I download all historical SEC filings via EDGAR and conduct some textual analyses.
If you don’t require a very customized textual analysis, you should try for example SeekEdgar.com.
Alternatively, I’ve made a program specifically for downloading SEC filings from EDGAR. So you can now skip the steps 1 and 2 below.
1. Build a master index of SEC filings
I use the python-edgar
to download quarterly zipped index files to ./edgar-idx
.
$ mkdir ~/edgar && cd ~/edgar
$ git clone https://github.com/edouardswiac/python-edgar.git
$ python ./python-edgar/run.py -d ./edgar-idx
Then merge the downloaded tsv files into a master file using cat
.
$ cat ./edgar-idx/*.tsv > ./edgar-idx/master.tsv
$ du -h ./edgar-idx/master.tsv
The resulting master.tsv
is about 2.6G as at Feb 2020. I then use the following python script to build a SQLite database for more efficient query.
# Load index files in `edgar-idx` to a sqlite database.
import sqlite3
= "https://www.sec.gov/Archives/"
EDGAR_BASE
def parse(line):
# each line: "cik|firm_name|file_type|date|url_txt|url_html"
# an example:
# "99780|TRINITY INDUSTRIES INC|8-K|2020-01-15|edgar/data/99780/0000099780-\
# 20-000008.txt|edgar/data/99780/0000099780-20-000008-index.html"
= tuple(line.split('|')[:5])
line = list(line)
l -1] = EDGAR_BASE + l[-1]
l[return tuple(l)
if __name__ == '__main__':
= sqlite3.connect(r"edgar-idx.sqlite3")
conn = conn.cursor()
c '''CREATE TABLE IF NOT EXISTS edgar_idx
c.execute( (cik TEXT, firm_name TEXT, file_type TEXT, date DATE, url TEXT,
PRIMARY KEY(cik, file_type, date));''')
= './edgar-idx/master.tsv'
filename with open(filename, 'r') as f:
= f.readlines()
lines
= [parse(line) for line in lines]
data 'INSERT OR IGNORE INTO edgar_idx \
c.executemany( (cik, firm_name, file_type, date, url) VALUES (?,?,?,?,?)', data)
conn.commit() conn.close()
2. Download filings from EDGAR
I write the following script to download filings from EDGAR. Note that this script is only a skeleton. The full implementation has proper logging, speed control and detailed error handling. For example, you’ll need to keep track of failures and re-download them later.
This example script download all 8-K files to ./data/{cik}/{file_type}/{date}.txt.gz
.
Compression is highly recommended unless you’ve TBs of free disk space!
# Download all 8-K filings.
import os
import sqlite3
import requests
import concurrent.futures
import gzip
import tqdm
def download(job):
= job
cik, _, file_type, date, url try:
= requests.get(url)
res = f'./data/{cik}/{file_type}/{date}.txt.gz'
filename if res.status_code == 200:
with gzip.open(filename, 'wb') as f:
f.write(res.content)except Exception:
pass
if __name__ == "__main__":
# select what to download
= sqlite3.connect(r"edgar-idx.sqlite3")
conn = conn.cursor()
c 'SELECT * FROM edgar_idx WHERE file_type="8-K";')
c.execute(= c.fetchall()
jobs
conn.close()# start downloading
= tqdm.tqdm(total=len(jobs))
progress = []
futures with concurrent.futures.ThreadPoolExecutor(max_workers=16) as exe:
for job in jobs:
= job
cik, _, file_type, date, url = f'./data/{cik}/{file_type}/{date}.txt.gz'
filename =True)
os.makedirs(os.path.dirname(filename), exist_okif os.path.exists(filename):
progress.update()else:
= exe.submit(download, job)
f
f.add_done_callback(progress.update)
futures.append(f)for f in concurrent.futures.as_completed(futures):
pass
3. Example textual analyses
The downloaded txt files are the text version of filings htmls, which generally are well structured. Specifically, each filing is structured as:
<SEC-DOCUMENT>
<SEC-HEADER></SEC-HEADER>
<DOCUMENT>
<TYPE>
<SEQUENCE>
<FILENAME>
<DESCRIPTION>
<TEXT>
</TEXT>
</DESCRIPTION>
</FILENAME>
</SEQUENCE>
</TYPE>
</DOCUMENT>
<DOCUMENT></DOCUMENT>
<DOCUMENT></DOCUMENT>
...</SEC-DOCUMENT>
<SEC-DOCUMENT>
<SEC-HEADER></SEC-HEADER>
<DOCUMENT>
<TYPE>8-K
<SEQUENCE>1
<FILENAME>f13478e8vk.htm
<DESCRIPTION>FORM 8-K
<TEXT>
...</TEXT>
</DESCRIPTION>
</FILENAME>
</SEQUENCE>
</TYPE>
</DOCUMENT>
<DOCUMENT>
<TYPE>EX-99.1
<SEQUENCE>2
<FILENAME>f13478exv99w1.htm
<DESCRIPTION>EXHIBIT 99.1
<TEXT>
...</TEXT>
</DESCRIPTION>
</FILENAME>
</SEQUENCE>
</TYPE>
</DOCUMENT>
<DOCUMENT></DOCUMENT>
...</SEC-DOCUMENT>
3.1 Extract all items reported in 8-K filings since 2004
Since 2004, SEC requires companies to file 8-K within 4 business days of many types of events. For a short description, see SEC’s fast answer to Form 8-K. The detailed instruction (PDF) is available at here.
To extract all items reported in each filing since 2004, there are several ways. First, I can use a regular expression to extract all "Item X.XX"
from the 8-K <DOCUMENT>
. Or, I can take advantage of the information in <SEC-HEADER>
. Below is an example <SEC-HEADER>
1, of which the lines of ITEM INFORMATION
actually describe the items reported in the filing.
1 The original file is at https://www.sec.gov/Archives/edgar/data/0000008192/0000079732-02-000036.txt
<SEC-HEADER>0000079732-02-000036.hdr.sgml : 20020802
<ACCEPTANCE-DATETIME>20020802082752
ACCESSION NUMBER: 0000079732-02-000036
CONFORMED SUBMISSION TYPE: 8-K
PUBLIC DOCUMENT COUNT: 4
CONFORMED PERIOD OF REPORT: 20020801
ITEM INFORMATION: Changes in control of registrant
ITEM INFORMATION: Financial statements and exhibits
FILED AS OF DATE: 20020802
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: ATLANTIC CITY ELECTRIC CO
CENTRAL INDEX KEY: 0000008192
STANDARD INDUSTRIAL CLASSIFICATION: ELECTRIC SERVICES [4911]
IRS NUMBER: 210398280
STATE OF INCORPORATION: NJ
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: 8-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-03559
FILM NUMBER: 02717802
BUSINESS ADDRESS:
STREET 1: 800 KING STREET
STREET 2: PO BOX 231
CITY: WILMINGTON
STATE: DE
ZIP: 19899
BUSINESS PHONE: 6096454100
MAIL ADDRESS:
STREET 1: 800 KING STREET
STREET 2: PO BOX 231
CITY: WILMINGTON
STATE: DE
ZIP: 19899</SEC-HEADER>
Following this strategy, I write the code below to extract all items reported in 8-K filings since 2004. I didn’t use regex for this task because the text portion of the filing is actually dirty. For instance, you’ll need to remove all html tags, and be careful about the “non-breaking space”,
, etc. My experience is that using <SEC-HEADER>
for this task is the best.
# Extract all items reported in 8-K filings since 2004.
import os
import gzip
import tqdm
import sqlite3
import concurrent.futures
= './data'
BASE_DIR = '8-K'
FILE_TYPE = "result.sqlite3"
DB
def walk_dirpath(cik, file_type):
""" Yield paths of all files for a given cik and file type """
for root, _, files in os.walk(os.path.join(BASE_DIR, cik, file_type)):
for filename in files:
yield os.path.join(root, filename)
def regsearch(cik):
= []
matches for filepath in walk_dirpath(cik, FILE_TYPE):
= os.path.split(filepath)[1].strip('.txt.gz')
date if int(date.split('-')[0]) < 2004:
continue
with gzip.open(filepath, 'rb') as f:
= f.readlines()
data = [l for l in data if l.startswith(b'ITEM INFORMATION')]
ls for l in ls:
= l.decode().replace('\t','').replace('ITEM INFORMATION:', '')
item if len(item.strip()):
matches.append((cik, FILE_TYPE, date, item.strip()))return matches
if __name__ == "__main__":
= sqlite3.connect(DB)
conn = conn.cursor()
c '''CREATE TABLE IF NOT EXISTS files_all_items
c.execute( (cik TEXT, file_type TEXT, date DATE, item TEXT,
PRIMARY KEY(cik, file_type, date, item));''')
conn.commit()
= next(os.walk(BASE_DIR))
_, ciks, _ = tqdm.tqdm(total=len(ciks))
progress with concurrent.futures.ProcessPoolExecutor(max_workers=16) as exe:
= [exe.submit(regsearch, cik) for cik in ciks]
futures for f in concurrent.futures.as_completed(futures):
= f.result()
res
c.executemany("INSERT OR IGNORE INTO files_all_items \
(cik, file_type, date, item) VALUES (?,?,?,?)", res)
conn.commit()
progress.update()
conn.close()
3.2 Find all 8-K filings with Item 1.01 and/or Item 2.03
To get those filings that have either:
- Item 1.01 Entry into a Material Definitive Agreement, or
- Item 2.03 Creation of a Direct Financial Obligation or an Obligation under an Off-Balance Sheet Arrangement of a Registrant
I run the following SQL query:
-- SQLite
CREATE TABLE `files_with_items_101_or_203` AS
SELECT DISTINCT cik, file_type, date
FROM `files_all_items`
WHERE
instr(lower(item), "creation of a direct financial obligation") > 0 OR
instr(lower(item), "entry into a material definitive agreement") > 0
ORDER BY cik, file_type, date;
To get those with both items, use the following query:
-- SQLite
CREATE TABLE `files_with_items_101_and_203` AS
SELECT cik, file_type, date
FROM `files_all_items`
WHERE
instr(lower(item), "creation of a direct financial obligation") > 0 OR
instr(lower(item), "entry into a material definitive agreement") > 0
GROUP BY cik, file_type, date
HAVING count(*) > 1
ORDER BY cik, file_type, date;
3.3 Nini, Smith, and Sufi (2009)
This example code finds the appearance of any of the 10 search words used in “Creditor control rights and firm investment policy” by Nini, Smith, and Sufi (2009), which is used to identify the loan contracts as attached in the SEC filing.
import re
import os
import sys
import gzip
import tqdm
import sqlite3
import logging
import concurrent.futures
=sys.stdout, level=logging.WARN)
logging.basicConfig(stream
= './data'
BASE_DIR = '10-Q'
FILE_TYPE = "result.sqlite3"
DB
# Regex pattern used to remove html tags
= re.compile(b'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
cleanr
# Regex pattern used to find the appearance of any of the 10 search words used
# in "Creditor control rights and firm investment policy"
# by Nini, Smith and Sufi (JFE 2009)
# pat_10_words = r"CREDIT FACILITY|REVOLVING CREDIT|(CREDIT|LOAN|(LOAN (AND|&) \
# SECURITY)|(FINANCING (AND|&) SECURITY)|CREDIT (AND|&) GUARANTEE) AGREEMENT"
= ['credit facility',
NSS_10_words 'revolving credit',
'credit agreement',
'loan agreement',
'loan and security agreement',
'loan & security agreement',
'credit and guarantee agreement',
'credit & guarantee agreement',
'financing and security agreement',
'financing & security agreement']
= '|'.join([word.upper() for word in NSS_10_words])
NSS_10_words_str = re.compile(NSS_10_words_str.encode())
pat_10_words
# Regex pattern used in this search
= pat_10_words
pattern
def walk_dirpath(cik, file_type):
""" Yield paths of all files for a given cik and file type """
for root, _, files in os.walk(os.path.join(BASE_DIR, cik, file_type)):
for filename in files:
yield os.path.join(root, filename)
def regsearch(cik):
= []
matches for filepath in walk_dirpath(cik, FILE_TYPE):
= os.path.split(filepath)[1].strip('.txt.gz')
date try:
with gzip.open(filepath, 'rb') as f:
= b' '.join(f.read().splitlines())
data = re.sub(cleanr, b'', data)
data = pattern.search(data)
match if match:
matches.append((cik, FILE_TYPE, date))f'{filepath}, {match.group()}')
logging.info(except Exception as e:
f'failed at {filepath}, {e}')
logging.error(return matches
if __name__ == "__main__":
= sqlite3.connect(DB)
conn = conn.cursor()
c # create a table to store the indices
'''CREATE TABLE IF NOT EXISTS files_with_10_words
c.execute( (cik TEXT, file_type TEXT, date DATE,
PRIMARY KEY(cik, file_type, date));''')
conn.commit()= next(os.walk(BASE_DIR))
_, ciks, _ = tqdm.tqdm(total=len(ciks))
progress with concurrent.futures.ProcessPoolExecutor(max_workers=16) as exe:
= [exe.submit(regsearch, cik) for cik in ciks]
futures for f in concurrent.futures.as_completed(futures):
= f.result()
matches
c.executemany("INSERT OR IGNORE INTO files_with_10_words \
(cik, file_type, date) VALUES (?,?,?)", matches)
conn.commit()
progress.update()
conn.close()'complete') logging.info(