Due: October 24
In this exercise, you will demonstrate a basic understanding of how to use econometrics
in practice with software such as Stata or R. We will walk you through the commands for
Stata, but if you prefer to use R, feel free to do so.
We will use a random sample of the 1988 National Maternal and Infant Health Survey,
collected by the U.S. Department of Health and Human Services, and presented in Jeff
Wooldridge’s Introductory Econometrics textbook, 3rd edition. This survey was one of
the first large-scale data collections to examine the correlation between maternal smoking
and infant weight at birth. Though the research design cannot be interpreted as causal,
the large correlations between maternal smoking and low infant birth among women
between the ages of 15 and 49 who had a pregnancy in 1988 spurred a tidal wave of
literature on the subject that continues to progress.
Please hand in, via bCourses, your log file, do file, and a concise write-up of
your findings, answering all questions below.
Note that many of the commands that we ask you to run in Stata are also available in
Stata’s dropdown menu. Please do not use it. Instead, write each command in your do file.
1. Download the dataset birthweight_sample.dta from the “Replication Exercise” folder
under the “Files” tab on bcourses. (This will not show up on your Stata log or your
2. Open Stata and create a new do file by clicking on the “New Do File Editor” Icon.
3. Open birthweight_sample.dta in Stata using the “use” command.
2 Descriptive Statistics
1. Locate the data entry error in the dataset. Briefly describe the error in your
write-up. (Stata hint: use the summarize command sum to search for the error.)
2. Replace the incorrect data entry with the missing entry symbol. (Stata hint: missing
values are coded with a . in Stata. Use replace variablename = . if
3. Report the median, mean, standard deviation, maximum and minimum of each variable
in the dataset. (Stata hint: sum, detail)
4. Graph histograms for our two main variables of interest, bwght and cigs. (Stata hint:
histogram). Briefly describe your findings.
3 A Simple Regression
1. Run a univariate linear regression of the variable bwght on the variable cigs. In your
write-up, interpret the coefficient on the right-hand side variable as it relates to bwght
and indicate whether the coefficient is statistically significant at the 95% level. (Stata
hint: regress bwght cigs, robust)
2. Make a publication-quality table for the above regression. Attach this table to your
write-up. (Stata: use the command ssc install outreg2 to install the outreg2
package. Use the command outreg2 using “~/olstable.xls”, excel replace to
make a table displaying the results from the previous regression, where “~” should be
replaced by the location on your computer’s directory where you would like the exported
.xls file to be placed, e.g. “/Academics/Econ 191/Replication Exercise”.)
3. Create a scatter plot of all observations’ values of bwght and cigs. Title the graph
“Scatter Plot of Birth Weight (oz.) on Maternal Smoking.” Label the horizontal and
vertical axes. (Stata hint: twoway scatter with options ytitle, xtitle, and title).
4. Add a best-fit prediction line to your scatter plot. Again, title the graph and the two
axes. (Stata hint: twoway (scatter bwght cigs) (lfit bwght cigs) )
5. In your write-up, describe the relationship you see between the two variables. Does
the best fit line explain anything that was not apparent in the simple scatter plot?
Explain. Finally, save the graph with best fit line and include it in your write-up.
(Stata hint: graph export)
4 More Regressions
An obvious concern with the previous regression is the possibility that there are other
factors which are correlated with both birth weight and cigarette consumption. This could
lead to what is called omitted variable bias, and a spurious correlation between birth weight
and cigarette consumption. For example, it could be the case that mother’s education is
correlated both with birth weight and cigarette consumption. One way to address this issue
is to add control variables to the regression.
1. Run a multiple linear regression of the variable bwght on the variables cigs, motheduc,
fatheduc, and parity. In your write-up, interpret the coefficient on each righthand
side variable as they relate to bwght and indicate whether each coefficient is
statistically significant at the 95% level. Append the results of this regression to your
table. (Stata hint: outreg2 using “~/olstable.xls”, excel append )
2. Add the dummy variables male and white to regression you ran in 1. Interpret the
coefficients of this new regression in your write-up, again indicating whether each
coefficient is statistically significant at the 95% level. Did anything change from the
results you found above? Append the results of this regression to your table.
3. Create a new variable cigssq, the square of cigs. Add the new variable ciggsq to
the regression you ran in 1. Interpret the coefficients of this new regression in your
write-up, again indicating whether each coefficient is statistically significant at the 95%
level. Did anything change from the results you found above? Append the results of
this regression to the your table. (Stata hint: gen cigssq = cigs^2)
4. Note that the dataset contains a log version of the left-hand side variable bwght:
lbwght. Rerun your regression from 1 with lbwght as the left-hand side variable.
Interpret and assess the statistical significance of each coefficient in this regression in
your write-up. Append the results of this regression to the your table.
5. Finally, add state fixed effects to the regression you ran in 1 using the variable state.
What type of variation does this control for? Interpret your results and append
the results of this regression to your table (without including the coefficients for
the state dummies). (Stata hint: areg bwght cigs motheduc fatheduc parity,
5 Instrumental Variables
So far, we have not been able to infer a causal relationship between cigarette smoking
and birth weight. Instrumental variables are sometimes useful to infer causal relationships.
Suppose that we believe that mother’s education is an appropriate instrument for cigarette
smoking, our endogenous independent variable.
1. Run a regression of cigarette smoking on mother’s education to verify that there is a
significant correlation between our instrument and our endogenous variable. This is
called the first-stage regression.
2. Run the two-stage-least-squares regression using mother’s education as an instrument
for cigarette smoking, with birth weight as the dependent variable. (Stata hint: ivreg
bwght (cigs=motheduc)). Interpret your result.
3. We have used mother’s education as an instrument for cigarette smoking. However,
this is a very poor choice for an instrument! Explain why. Can you think of a more