Stata - Working with datasets
We can use the command sysuse
to use builtin datasets, and use
to load other external datasets.
In the introduction, we briefly mentioned how to load Stata datasets to use. Now, let’s take a more in-depth look at how we work with datasets in Stata.
Datasets, here and there
Datasets are stored at different places, locally on our computer’s hard disk or remotely on a server. For Stata to use them, we need to load them into Stata, or putting them into memory.
Because Stata commands (e.g., summarize
, describe
, count
, etc.) operate on the current dataset in memory, working simultaneously on multiple datasets was painful – one needs to save current, load the other dataset, perform tasks and save/load again. But since Stata 16, a feature called frame
is introduced, where different datasets can be loaded into memory at the same time, but in different “frames”. The chart below gives a simple illustration.
Although we still can only operate Stata commands on a single frame/dataset at a given time, we no longer need to save/load datasets as they all reside in memory frames.
We as beginners can be agnostic about frame
, especially when dealing with only one dataset throughout. Technically, we are loading data into the default
frame1, and work in the default frame.
1 default
frame is just a frame named “default”.
But we start with loading and saving
Stata’s dta
The dta
is Stata’s proprietary binary data file format, and is the default file format used by Stata.
What I like very much about the dta
data format include:
- Variable Types:
dta
files can store different types of variables, including numeric variables (e.g., integers, floats) and string variables (text). It can represent missing values too. - Metadata:
dta
files can store metadata, such as variable labels (descriptive names for variables), value labels (labels for specific variable values), and variable formats (e.g., date formats). - Cross-Platform Compatibility:
dta
files created in one version of Stata can generally be read by other versions of Stata, ensuring cross-platform compatibility.
Also, dta
files are typically compressed to reduce file size and optimize storage.
More importantly, Stata provides commands (use
, save
, etc.) to read data from dta
files into memory and save data from memory to dta
files. This makes it extremely easy to work with.
For example, you can easily load a Stata dataset online to your Stata via use
:
use "http://www.stata-press.com/data/r13/nlswork.dta", clear
Note that , clear
option tells Stata to clear the memory (1) in case there is already some other dataset in it. { .annotate }
- Note that it does not clear the entire memory - datasets in other frames are not cleared. Only the dataset, if any, in the current frame is cleared when the new dataset is loaded.
Alternatively, you can download the nlswork.dta
dataset to your computer, and load it from your local computer:
use "/Users/mgao/Downloads/nlswork.dta", clear
use "C:\Users\mgao\Downloads\nlswork.dta", clear
After some work on the dataset, say, keeping only observations where year
is 88,
keep if year==88
we can save the modified dataset either to its original place, overwriting the original dataset, or to a different place, creating a new dataset:
save "/Users/mgao/Downloads/nlswork.dta", replace
save "C:\Users\mgao\Downloads\nlswork.dta", replace
Mighty csv
We all love csv
or “comma-separated-values” files. They are simple and readable without requiring any special software.2 Many datasets are also published online in csv
format.
2 You don’t need Excel or Numbers to open csv
files, in case you didn’t know…
What if we want to save a csv
version of the dataset? Easy, we use export
command:
export delimited using "/Users/mgao/Downloads/nlswork.csv", replace
export delimited using "C:\Users\mgao\Downloads\nlswork.csv", replace
Of course, we can import the csv
file back to Stata using import
command:
using "/Users/mgao/Downloads/nlswork.csv", clear import delimited
using "C:\Users\mgao\Downloads\nlswork.csv", clear import delimited
In some rare cases where the text file is not delimited/separated by comma, we can manually specify the delimiter. For example, some datasets use “tab-separated-values” or tsv
format:
"path/to/datafile.tsv", delimiter(tab) import delimited
Did someone say “Excel”?
Stata gets you covered. import excel
is all you need.
For example, we next are to use an Excel spreadsheet named :octicons-download-24: “BUSS7902 Chapter 4A Lecture (Data).xlsx”.
Before everything, we can ask Stata to describe the file:
"~/Downloads/BUSS7902 Chapter 4A Lecture (Data).xlsx", describe
. import excel
Sheet | Range
-----------------+-----------------
Magic Box | A1:C101
Assembly | A1:H76
Distance | A1:L42 Insurance+Survey | A1:H1501
As shown, the spreadsheet contains four sheets of different names, “Magic Box” and so on. Let’s say we are interested in the data in the “Magic Box” sheet,3 we can instruct Stata to load data from the sheet and optionally specify the data range in the sheet.4
3 The “Magic Box” sheet, as with other sheets, contains more than just data, but also some text. We want only data in the first column A.
4 We also use the firstrow
option to tell Stata to treat the first row as variable name, not value of an observation.
"~/Downloads/BUSS7902 Chapter 4A Lecture (Data).xlsx", firstrow sheet("Magic Box") cellrange(A1:A101) clear
. import excel var, 100 obs) (1
And that’s it! Stata will take care of the variable type and etc., and is pretty good at it most of the time.
So you want more frame
s?
This is for the tech-savvy. You don’t need frame
almost surely.
Okay. So you noticed that every time we import/use a new dataset, we set the clear
option to clear the memory, discarding whatever dataset we currently work on to make room for new new dataset. This is troublesome. What if we don’t want to give up the intermediate results while taking a peak at the different dataset?
We make a new frame
and load the new dataset into the new frame.5
5 Remember that the frame we work on so far is named “default”.
Let’s have a look first at what frames are currently there:
list
. frame default 100 x 1
*
data. Note: Frames marked with * contain unsaved
Create a new frame
We create a new frame
to which a new dataset can be loaded without clearing the existing dataset in the default frame. We can name it whichever we like, say, “assembly”:
"assembly" frame create
Now check again the frames, we can see it is indeed there.
list
. frame
assembly 0 x 0default 100 x 1
*
data. Note: Frames marked with * contain unsaved
Change to the new frame
Let’s now switch to the newly created “assembly” frame, leaving the “Magic Box” data untouched in the default frame.
frame change assembly
Forgot which frame you are in?
. frame (current frame is assembly)
You may notice that the Variables window is now blank, showing that there is no variable in this frame. Rest assured that the x
variable we imported earlier from “Magic Box” sheet stays in memory and in the default frame.
We can now load data into this empty frame using the same methods as discussed above.
For example, I can now load the data in the “Assembly” sheet into the frame.
"~/Downloads/BUSS7902 Chapter 4A Lecture (Data).xlsx", firstrow sheet("Assembly") cellrange(A1:A76) clear case(lower)
. import excel var, 75 obs) (1
If we check the frames, we can see that now both datasets exist in memory, albeit in two frames.
list
. frame
* assembly 75 x 1default 100 x 1
*
data. Note: Frames marked with * contain unsaved
To go back to the original frame (named “default”), use frame change default
.