Douglas H. Wrenn, H. Allen Klaiber, and David A. Newburn, "Confronting Price Endogeneity in a Duration Model of Residential Subdivision Development", Journal of Applied Econometrics, Vol. 32, No. 3, 2017, pp. 661-682. Overview All of the data and Stata .do files needed to replicate the results in the tables in the paper are included in the zip file wkn-files.zip. Each .do file is labeled as to which table and/or model it runs. To generate the data needed for the second-stage duration models, you will need to first run the "first-stage OLS" files (explained below). Everything should be clearly labeled within each file. Note: The raw housing- and land-price data used to create the price indices in the paper are proprietary, so they are not included. However, the online appendix provides sufficient information on how to set up the data and run the hedonic models that created the price indices, so those interested in implementing our model in a different context should be able to easily collect housing- and land-price data and generate similar indices. The data included in this data archive are just for the price indices. Stata .do Files --------------- TableII_First_Stage_OLS.do This file runs the first-stage OLS price regression models based on nearest-neighbor cutoffs for the IV variables. The file estimates the OLS price regressions, generates the F-stats, creates the residuals used as controls in each of the nearest-neighbor duration models, and merges the residuals with the main duration dataset: Second_Stage_Duration_Data.txt. The file produces the results in Table II in the paper. (Note: This file must be run first to create the datasets used in main duration model.) TableIII_Overidentification_Tests.do This file runs a series of overidentification tests. The excluded variable is the percentage of undeveloped land in each census tract and year. The IV variables are based on a nearest-neighbor cutoff. This file produces the results in Table III in the paper. TablesIV_and_V_Second_Stage_Duration.do This file runs the second-stage non-IV and IV duration models and produces the parameter estimates and elasticity values in Tables IV and V in the paper. The models are based on the nearest-neighbor cutoffs. TableA3_First_Stage_OLS.do This file runs the first-stage OLS price regression models based on distance cutoffs for the IV variables. The file estimates the OLS price regressions, generates the F-stats, creates the residuals used as controls in each of the nearest-neighbor duration models, and merges the residuals with the main duration dataset: Second_Stage_Duration_Data.txt. The file produces the results in Table A3 in the appendix. (Note: This file must be run first to create the datasets used in main duration model.) TableA3_Overidentification_Tests.do This file runs a series of overidentification tests. The excluded variable is the percentage of undeveloped land in each census tract and year. The IV variables are based on a distance cutoff. This file produces the results in in Table A3 in the appendix. TableA3_Second_Stage_Duration.do This file runs the second-stage non-IV and IV duration models and produces the parameter estimates and elasticity values in Table A3 in the appendix. The models are based on the distance cutoffs. TableA4_Truncated_Poisson_Model.do This file estimates a non-IV and IV truncated poisson model and produces the elasticity results in Table A4 in the appendix. TableA5_Second_Stage_Duration_Neighborhood_Cluster.do This file estimates the same series of non-IV and IV duration models as in the TablesIV_and_V_Second_Stage_Duration.do file, but clusters the standard errors at the census tract-by-year level. This file produces the results in Table A5 in the appendix. Data Files ---------- First_Stage_OLS_Data.txt This file contains the neighborhood-level data (census tract-by-year data) used to estimate the first-stage OLS models. This file is used with the TableII_First_Stage_OLS.do and TableA3_First_Stage_OLS.do files. These files loop over the data and estimate the first-stage OLS models for the nearest-neighbor models (the main model in the paper) and distance-based models (included in the appendix). This first-stage models regresses housing prices (houseprices1k: in $1,000) on: a constant (constant), the percentage of preservation in the census tract in each year (prespercent), the percentage of undeveloped area in the tract in each year (udarea), the number of buildable lots allowed by zoning in each year (zndltqnt_ct), the number of lots approved for development over the previous year (apprvltqnt), a yearly land-price index in each census tract (landprices1k: in $1,000s), a set of time dummies (i.timeprd), a set of county fixed effects (carrind and harfind; Baltimore is the excluded dummy), and a set of IV variables based on either a nearest-neighbor cutoff or a distance cutoff. (See the Table I in the paper for additional information.) This file also includes a tract-by-year variable (fe), which identifies the neighborhood and is a combination of 14 years (timeprd) and 229 census tracts. Thus, this file should contain 3,206 observations for the neighborhoods in the paper. All of these data were created using either GIS data or houseing-price data from the Maryland Property View Database (MDPV). The IV variables are based on taking a weighted average value for the variables prespercent, udarea, and zndltqnt_ct in distant neighborhoods and adding them to the right-hand side of the OLS models. The notion of distance is defined either by 8-12 nearest neighbors or 3-5 mile distance cutoffs. Each of these cutoffs has a separate variables labeled as such in this file. For example, the preservation variable for the cutoff of 8 neighbors is labeled prespercent_fe_8nn. Each of the other variables and cutoffs is labeled accordingly. Second_Stage_Duration_Data.txt This file contains the data used to run the second-stage (main) duration model. It is used by the TableII_First_Stage_OLS.do and TableA3_First_Stage_OLS.do Stata files to merge with the residuals generated by the first-stage OLS model (described above). After these two first-stage .do files are executed you will have a series of .dta Stata data files labeled "Second_Stage_Duration_Data_X_NearN" and "Second_Stage_Duration_Data_X_Miles", where the X stands for 8-12 nearest-neighbors or 3-5 mile distance bands. The land-use (subdivision) data in this file were generated by combining historical zoning and subdivision plat maps with GIS parcel data from Baltimore, Harford, and Carroll counties in Maryland. The parcel data are available through Maryland's Property View online database and the plat maps are available online through the Maryland Historical Archives. The final dataset used in this paper was created by manually placing each individual housing parcel in the GIS data inside of its respective subdivision based on individual PDF files of the plats of each subdivision. Each duration model (a binary probit model) regresses a binary indicator variable for development (development) on: the exogenous variables from the first-stage OLS model, including housing prices (houseprices1k), and set of parcel-level cost variables. The parcel-level variables are: the distance in kilometers of each parcel from Baltimore City (dstbaltcen), the distance in kilometers from parcel to closest primary highway (surprimerd), the size of the parcel in acres (areaacre), the number of building lots allowed on the parcel according to the by zoning on the parcel (zndltqnt), an indicator as to whether the parcel has public sewer (sewer), an indicator as to whether the parcel is in a flood zone (floodplne), an indicator as to whether the parcel is suitable for a septic system (septicsuit), a variable for the percentage of the parcel that has a slope greater than 15%, and an indicator variable for whether the parcel has an existing structure. Each of the datasets also has two variables -- confunc_p and confunc_p2 -- which are the residuals or control functions generated in the first-stage OLS model. These variables will vary over each of the "Second_Stage_Duration_Data_X_NearN" and "Second_Stage_Duration_Data_X_Miles" datasets based on which first-stage OLS model was used to generate the data; all of the other variables are constant across each of the duration models. Finally, the dataset contains a set of variables that indicate: the parcel (id), the census tract (ctvar), the year (timeprd and y_1-y_14), the neigborhoood (fe), and the county (baltind, carrind, harfind). The final dataset contains 183,580 observations, which represents all of the parcel-year combinations in the subdivision data. Second_Stage_Poisson_Data.txt This file contains the data used to estimate the truncated poisson model in Table A5 in the appendix. All of the independent variables are the same as in the duration models. The dependent variable in this dataset is ltqnt and represents the number of lots created by each subdivision event. There were a total of 2,385 subdivision events in during our study period, so this dataset contains 2,385 observations. Douglas H. Wrenn dhw121 [AT] psu.edu