Douglas H. Wrenn, H. Allen Klaiber, and David A. Newburn, 
"Confronting Price Endogeneity in a Duration Model of Residential 
Subdivision Development", Journal of Applied Econometrics, Vol. 32,
No. 3, 2017, pp. 661-682.

Overview

All of the data and Stata .do files needed to replicate the results in
the tables in the paper are included in the zip file wkn-files.zip.
Each .do file is labeled as to which table and/or model it runs. To
generate the data needed for the second-stage duration models, you
will need to first run the "first-stage OLS" files (explained below). 
Everything should be clearly labeled within each file. 

Note: The raw housing- and land-price data used to create the price
indices in the paper are proprietary, so they are not included. 
However, the online appendix provides sufficient information on how to
set up the data and run the hedonic models that created the price
indices, so those interested in implementing our model in a different
context should be able to easily collect housing- and land-price data
and generate similar indices. The data included in this data archive
are just for the price indices.


Stata .do Files
---------------

TableII_First_Stage_OLS.do

This file runs the first-stage OLS price regression models based on
nearest-neighbor cutoffs for the IV variables. The file estimates the
OLS price regressions, generates the F-stats, creates the residuals
used as controls in each of the nearest-neighbor duration models, and 
merges the residuals with the main duration dataset:
Second_Stage_Duration_Data.txt. The file produces the results in Table
II in the paper. (Note: This file must be run first to create the
datasets used in main duration model.)

 
TableIII_Overidentification_Tests.do

This file runs a series of overidentification tests. The excluded
variable is the percentage of undeveloped land in each census tract
and year. The IV variables are based on a nearest-neighbor cutoff.
This file produces the results in Table III in the paper.


TablesIV_and_V_Second_Stage_Duration.do

This file runs the second-stage non-IV and IV duration models and
produces the parameter estimates and elasticity values in Tables IV
and V in the paper. The models are based on the nearest-neighbor
cutoffs.

 
TableA3_First_Stage_OLS.do

This file runs the first-stage OLS price regression models based on
distance cutoffs for the IV variables. The file estimates the OLS
price regressions, generates the F-stats, creates the residuals used
as controls in each of the nearest-neighbor duration models, and 
merges the residuals with the main duration dataset:
Second_Stage_Duration_Data.txt. The file produces the results in Table
A3 in the appendix. (Note: This file must be run first to create the
datasets used in main duration model.)


TableA3_Overidentification_Tests.do

This file runs a series of overidentification tests. The excluded
variable is the percentage of undeveloped land in each census tract
and year. The IV variables are based on a distance cutoff. This file
produces the results in in Table A3 in the appendix.


TableA3_Second_Stage_Duration.do

This file runs the second-stage non-IV and IV duration models and
produces the parameter estimates and elasticity values in Table A3 in
the appendix. The models are based on the distance cutoffs.


TableA4_Truncated_Poisson_Model.do

This file estimates a non-IV and IV truncated poisson model and
produces the elasticity results in Table A4 in the appendix.


TableA5_Second_Stage_Duration_Neighborhood_Cluster.do

This file estimates the same series of non-IV and IV duration models
as in the TablesIV_and_V_Second_Stage_Duration.do file, but clusters
the standard errors at the census tract-by-year level. This file
produces the results in Table A5 in the appendix.


Data Files
----------

First_Stage_OLS_Data.txt 

This file contains the neighborhood-level data (census tract-by-year
data) used to estimate the first-stage OLS models. This file is used
with the TableII_First_Stage_OLS.do and TableA3_First_Stage_OLS.do
files. These files loop over the data and estimate the first-stage OLS
models for the nearest-neighbor models (the main model in the paper)
and distance-based models (included in the appendix). 

This first-stage models regresses housing prices (houseprices1k: in 
$1,000) on: a constant (constant), the percentage of preservation in
the census tract in each year (prespercent), the percentage of
undeveloped area in the tract in each year (udarea), the number of
buildable lots allowed by zoning in each year (zndltqnt_ct), the
number of lots approved for development over the previous year
(apprvltqnt), a yearly land-price index in each census tract
(landprices1k: in $1,000s), a set of time dummies (i.timeprd), a set
of county fixed effects (carrind and harfind; Baltimore is the
excluded dummy), and a set of IV variables based on either a
nearest-neighbor cutoff or a distance cutoff. (See the Table I in the
paper for additional information.) This file also includes a
tract-by-year variable (fe), which identifies the neighborhood and is
a combination of 14 years (timeprd) and 229 census tracts. Thus, this
file should contain 3,206 observations for the neighborhoods in the
paper. All of these data were created using either GIS data or
houseing-price data from the Maryland Property View Database (MDPV).  

The IV variables are based on taking a weighted average value for the
variables prespercent, udarea, and zndltqnt_ct in distant
neighborhoods and adding them to the right-hand side of the OLS
models.  The notion of distance is defined either by 8-12 nearest
neighbors or 3-5 mile distance cutoffs. Each of these cutoffs has a
separate variables labeled as such in this file. For example, the
preservation variable for the cutoff of 8 neighbors is labeled
prespercent_fe_8nn. Each of the other variables and cutoffs is labeled
accordingly. 


Second_Stage_Duration_Data.txt

This file contains the data used to run the second-stage (main)
duration model. It is used by the TableII_First_Stage_OLS.do and
TableA3_First_Stage_OLS.do Stata files to merge with the residuals
generated by the first-stage OLS model (described above). After these
two first-stage .do files are executed you will have a series of .dta
Stata data files labeled "Second_Stage_Duration_Data_X_NearN" and
"Second_Stage_Duration_Data_X_Miles", where the X stands for 8-12
nearest-neighbors or 3-5 mile distance bands.

The land-use (subdivision) data in this file were generated by
combining historical zoning and subdivision plat maps with GIS parcel
data from Baltimore, Harford, and Carroll counties in Maryland. The
parcel data are available through Maryland's Property View online
database and the plat maps are available online through the Maryland
Historical Archives. The final dataset used in this paper was created
by manually placing each individual housing parcel in the GIS data
inside of its respective subdivision based on individual PDF files of
the plats of each subdivision. 

Each duration model (a binary probit model) regresses a binary
indicator variable for development (development) on: the exogenous
variables from the first-stage OLS model, including housing prices
(houseprices1k), and set of parcel-level cost variables. The
parcel-level variables are: the distance in kilometers of each parcel
from Baltimore City (dstbaltcen), the distance in kilometers from
parcel to closest primary highway (surprimerd), the size of the parcel
in acres (areaacre), the number of building lots allowed on the parcel
according to the by zoning on the parcel (zndltqnt), an indicator as
to whether the parcel has public sewer (sewer), an indicator as to
whether the parcel is in a flood zone (floodplne), an indicator as to
whether the parcel is suitable for a septic system (septicsuit), a
variable for the percentage of the parcel that has a slope greater
than 15%, and an indicator variable for whether the parcel has an
existing structure. 

Each of the datasets also has two variables -- confunc_p and
confunc_p2 -- which are the residuals or control functions generated
in the first-stage OLS model. These variables will vary over each of
the "Second_Stage_Duration_Data_X_NearN" and 
"Second_Stage_Duration_Data_X_Miles" datasets based on which
first-stage OLS model was used to generate the data; all of the other
variables are constant across each of the duration models. Finally,
the dataset contains a set of variables that indicate: the parcel
(id), the census tract (ctvar), the year (timeprd and y_1-y_14), the
neigborhoood (fe), and the county (baltind, carrind, harfind).

The final dataset contains 183,580 observations, which represents all
of the parcel-year combinations in the subdivision data. 


Second_Stage_Poisson_Data.txt

This file contains the data used to estimate the truncated poisson
model in Table A5 in the appendix. All of the independent variables
are the same as in the duration models. The dependent variable in this
dataset is ltqnt and represents the number of lots created by each
subdivision event. There were a total of 2,385 subdivision events in
during our study period, so this dataset contains 2,385 observations. 


Douglas H. Wrenn
dhw121 [AT] psu.edu