Sabine Deij, Jakob B. Madsen, and Laura Puzzello, "When Are Instruments Generated From Geographic Characteristics in Bilateral Relationships Invalid?", Journal of Applied Econometrics, Vol. 36, No. 4, 2021, pp. 437-452. There are three folders: 1) "Data" contains all the data sets saved as .csv (delimiter is ",") as well as the original .dta files (in two separate folders); 2) "Programs" contains all the Stata .do-files; and 3) "Output" to which each table and figure will be written. The folder "Data" is the working directory for all .do-files. Except for the .dta files, all files are ASCII text files in DOS format. Data files The file gravity_98_final.csv contains bilateral data for the 98 countries from the Mankiw (1992) sample and 162 partner countries, i.e., each country has 161 partners. Despite relevant data available for a larger set of partner countries, the analysis follows Frankel and Romer (1999) and limits partner countries outside the sample to those countries whose population is greater than 100,000. Bilateral trade data from the DOTS for the year 1985 is used to construct symmetric bilateral trade flows. Bilateral trade shares are calculated by dividing bilateral trade in nominal terms by the destination country’s GDP. The latter is the product of real GDP per capita (base year 1985) and a country’s population both from the Penn World Tables (PWT) Mark 5.6. In addition the bilateral data set contains data on area, bilateral distance, border and landlocked status from the CEPII GeoDist database. Population data are from the PWT Mark 5.6 and, when missing, the World Development Indicators (WDI). The variables and variable labels are listed in variable_descr.txt under the heading gravity_98_final. The file structural_98_final.csv contains the country level data set containing countries' real GDP, real openness as well as the controls used in models 2 through 14 of the paper. Real income per capita, trade openness as well as population data are taken from PWT Mark5.6. Area is sourced from CEPII. Data on the percentage of land or population in the tropics, and continents is from the Centre for International Development (CID). Latitude and distance to the equator are sourced from the CEPII. Legal origin is from La Porta et al. (2008) and, when missing, from the CIA World Factbook. The index of ethno-linguistic fractionalization is from Easterly and Levine (1997). Data on constraint on executive is from the Polity IV Project (2014). Finally, data on corruption and the quality of governance come from the International Country Risky Guide (ICRG) provided by the Political Risk Services Group. The variables and variable labels are listed in variable_descr.txt under the heading structural_98_final. The file WB_income_classification.csv contains the World Bank classification of low, lower-middle, upper-middle and high income countries used to calculate the averages in footnote 9. We combine lower-middle and upper-middle countries into one category middle income countries. We use the value for 1987 as this is the first observation available in this data set. The files T_hat_98_ols.csv, T_hat_98r_ppml.csv, vars_r_98_1000dr_org.csv, T_hat_r_98_1000dr_org.csv, and results_r_98_1000dr_org.csv are provided to facilitate running of program files. Please note that the file named T_hat_r_98_1000dr_org.csv contains 30,002 variables with each 98 observations so cannot be viewed in its entirety using excel. The four additional data sets are used only for the analysis described in Online Appendix E. The file data_regulation_share.csv contains data from Helpman et al. (2008). Our analysis uses two indicators of high fixed-costs of trade: 1) dummy High Regulation Cost for each reporter (ind_cost_new) which takes the value of 1 if country i‘s relative costs are above the cross-country median; and 2) High # of days and procedures for each reporter (ind_procdays_new) which equals 1 if country i’s required number of days and legal procedures are above the median, and zero otherwise. The data underlying our variables are based on firm entry regulation costs as a percentage of GDP per capita as well as the number of days and legal procedures that are required for an entrepreneur to legally start operating a business. The data set did not include data for Belgium and Luxemburg but for Benelux. We assumed that regulation costs were the same for both countries and used the value for Benelux for both Belgium and Luxemburg. The file trans_infr_limao-venables.csv is data from Limão and Venables (2001) containing two indexes used in our analysis. The first index, ‘Own Infrastructure’ (infr), is estimated as the average of road density, rail density, number of telephone lines per capita raised to the power of -0.3. The other index, ‘Transit Infrastructure’ (infr_trans), applies to landlocked countries only and it is the average infrastructure index of the transit countries that a country needs to pass through to reach the sea. For both infrastructure indexes, a higher value indicates worse infrastructure. The file data1980s_share_iso3.csv contains variables from another data set from Helpman et al. (2008) from which we use island status and common religion between each reporter and partner to supplement the data gravity_98_final.csv for the analysis in Online Appendix E. Island status (n_islands) is a dummy variable equal to 1 when both reporter and partner are island countries and zero otherwise. Common religion (religion_same_recoded) is a variable with values between 0 and 1 where larger values reflect that the countries’ religious affiliation is more similar. An additional note for these variables: Helpman et al. (2008) does not include data for Belgium, Luxembourg, and Botswana. For Belgium and Luxembourg we therefore use the data observations for Benelux replacing Benelux with either Belgium or Luxembourg. For the bilateral observation between Belgium - Luxembourg we use the bilateral observation Benelux - France whereby we replace Benelux with Belgium, and France with Luxembourg. The shares of religious affiliation in France is the most similar to that of Luxembourg, according to the CIA Factbook. For Botswana a duplicate of the observations for South Africa are used. The bilateral observation for South Africa – Botswana is a duplication of the bilateral observation for South Africa – Angola where we substitute Botswana for Angola. Again, the choice for Angola is based on the CIA Factbook; after South Africa, the most similar in terms of shares of religious affiliations of its population is Angola. The file gravdata_headetal_1985.csv contains a subset of variables from the data set from by Head et al. (2010) to add to gravity_98_final.csv for the analysis in Online Appendix E. Common legal (comleg) system is a dummy variable equal to 1 if country i’s and country j’s legal system have the same origin (for example English or French origin). The variables common language (comlang_off) and common currency (comcur) are dummy variables equal to 1 if both countries have an official language in common and the same currency, respectively. Regional trade agreement (rta) is a dummy variable equal to 1 if both countries are part of the same regional trade agreement. Data on WTO membership is used to construct two dummy variables. The first (d_wto_none) is a dummy equal to 1 if both countries in the pair are not members of the WTO; the second (d_wto_both) is a dummy equal to 1 if both countries are members. List of variables: gravity_98_final name variable_label year Year iso3_partner Iso3 of partner iso3_reporter Iso3 of reporter lbil_tsh Log of bil_tsh; =ln(bil_tsh) ldistw Log of distw; =ln(distw) lpop_r Log of pop_r; =ln(pop_r) lpop_p Log of pop_p; =ln(pop_p) lpop_r_bord Interaction lpop_r with border lpop_p_bord Interaction lpop_p with border larea_r Log of area_r; =ln(area_r) larea_p Log of area_p; =ln(area_p) larea_r_bord Interaction larea_r with border larea_p_bord Interaction larea_p with border sumll Sum of ll; equal to 2 if both cty's are ll, 1 if one cty is ll and 0 otherwise border 1 if country pair have a shared border ldistw_bord Interaction ldistw with border sumll_bord Interaction sumll with border distcap simple distance between capitals (capitals, km) distw weighted distance (pop-wt, km) distwces weighted distance (pop-wt, km) CES distances with theta=-1 pop_r Population, total; source PWT5.6 pop_p Population, total; source PWT5.6 area_r Area in sq. kms area_p Area in sq. kms ll_r 1 if landlocked ll_p 1 if landlocked pwtdata_p Dummy equal to 1 if pop data is from PWT5.6 dfr150_r Dummy equal to 1 if reporter is included in FR sample dfr98_r Dummy equal to 1 if reporter is included in FR98 sample pop2_p Population, total bil_trade Value of bilaterl trade, mlns US Dollars, current prices dnodots Dummy equal to one if country is not found in DOTS totgdp_r Total GDP = Nominal GDP; =rgdpch*pop*1000 bil_tsh Bilateral TSH ; =(bil_trade*1000*1000)/totgdp dfr150_p Dummy equal to 1 if country is included in full FR sample dfr98_p Dummy equal to 1 if country is included in FR98 sample i tag(iso3_partner) j tag(iso3_reporter) id_p Iso3 of partner id_r Iso3 of reporter bil_tsh_2 Bilateral trade share with missing replaced by zeros ind_tij 1 if bilateral trade is observed numpar_i Number of trading partners structural_98_final name variable_label iso3 ISO or ALPHA-3 code year Year rgdpch Real GDP per capita, in constant $ (Chain index) (1985 international prices) open Openness (Exports+Imports)/CGDP (current international prices) pop Population, total; Source PWT5.6; =original series*1000 area Area in sq. kms lrgdpch Log of rgdpch; =ln(rgdpch) lpop Log of pop; =ln(pop) larea Log of area; =ln(area) dfr150 Dummy equal to 1 if included in FR150 sample dfr98 Dummy equal to 1 if included in FR98 sample lat Latitude in degrees d_africa 1 if country is in Africa d_america 1 if country is in America d_asia 1 if country is in Asia d_europe 1 if country is in Europe d_pacific 1 if country is in Pacific latitude Latitude, =lat/90, [-1,1] dist2equ Distance to the equator, =abs(lat/90), [0,-1] tropicar % land in geographical tropics troppop % population in geographical tropics, 1994 d_subsahafrica 1 if country is in Sub-Saharan Africa; source CID d_latamerica 1 if country is in Latin America; source CID d_eseasia 1 if country is in East and Southeast Asia; source CID year_xconst Year of xconst score xconst PolityIV Executive constraint elf60 Index of ethnic-linguistic fractionalization, Source: Easterly & Levine (1997) legorigin Legal Origin as used by Noguer and Siscaart(2005) corrup Corruption in government, rescaled to [0,1] icrg_kk Index of Government Quality, ICRG conform Knack and Keefer (1995) numpar_i Number of trading partners Programs / .do-files All tables and figures were generated using Stata version 13.1. Stata SE or MP is required. You will also need to have the following Stata packages installed: estadd; ivreg2; mstore; outreg2; ppml(st0225.pkg); svmat2. The Stata .do-file called "2020_replicate_output.do" generates the set of results specified by listing the desired outputs at the start of .do-file in the global macro "output". To run the entire randomization exercise, please adjust the global macro "version" in "2020_replicate_output.do". Please note that replicating the entire randomization exercise takes several days for all 1000 iterations. Also note that the resulting generated output in Table 3, Table OD.1 and Table OD.2 may deviate slightly from the results reported in the paper because each set of new draws is different to the original set of draws. To lower the run time, a global macro "s" in "2020_replicate_output.do" allows for the number of iterations to be adjusted from the default of 1000. List of Tables and Figures with the programs and data sets used: Table 1 – Gravity estimates using OLS --> execute bilateral_equ.do --> using gravity_98_final --> creating T_hat_98_ols Table 2 – Estimates of the Income Equation using Actual Data and the "OLS instrument" (models 1 through 4) --> execute inc_equ.do --> using gravity_98_final, structural_98_final, T_hat_98_ols --> creating alt_T_hat_98_ols Table 3 – Estimates of the Income Equation using Randomized Instruments (1000 replications) (models 1 through 4, limited) --> execute simulation.do --> using gravity_98_final, structural_98_final --> creating vars_r_98_1000dr, T_hat_r_98_1000dr, results_r_98_1000dr Figure 1 – First-stage and reduced-form: Trade openness and income per capita versus T ̃_i^(Pos*) --> execute figure_1.do --> using structural_98_final Table 4 – Estimates of the Income Equation: Controlling for the number of trading partners (models 1 through 4) --> execute inc_equ_control_num.do --> using structural_98_final, T_hat_98_ols & alt_T_hat_98_ols Outputs in the ONLINE APPENDIX Table OB.1 – Estimates of the Income Equation using Actual Data and the “OLS instrument”: additional controls included (models 5 through 14) --> execute inc_equ.do --> using gravity_98_final, structural_98_final, T_hat_98_ols --> creating alt_T_hat_98_ols Table OC.1 – Gravity estimates using PPML --> execute bilateral_equ.do --> using gravity_98_final --> creating T_hat_98_ppml Table OC.2 – Estimates of the Income Equation using Actual Data and the "PPML instrument" (models 1 through 4) AND Table OC.3 – Estimates of the Income Equation using Actual Data and the "PPML instrument": additional controls included (models 5 through 14) --> execute inc_equ.do --> using gravity_98_final, structural_98_final, T_hat_98_ppml --> creating alt_T_hat_98_ppml Table OD.1 - Estimates of the Income Equation using Randomized Instruments (1000 replications) AND Table OD.2 Estimates of the Income Equation using Randomized Instruments: additional controls included --> execute simulation.do --> using gravity_98_final, structural_98_final --> creating vars_r_98_1000dr, T_hat_r_98_1000dr, results_r_98_1000dr Table OE.1 – Number of partners, trading costs and income per capita --> execute z_appendix_numpar_trade-costs.do --> using structural_98_final data_regulation_share & trans_infr_limao-venables Table OE.2 – Probability of positive bilateral trade versus zero or missing values --> execute z_appendix_zero-vs-missing.do --> using gravity_98_final, gravdata_headetal_1985, structural_98_final & data1980s_share_iso3 Figure OE.1 - Distribution of fitted bilateral trade shares for missing and zero bilateral trade --> execute z_appendix_figure_OE1.do --> using gravity_98_final Table OF.1 – Estimates of the Income Equation: Controlling for the number of trading partners additional controls included --> execute inc_equ_control_num.do --> using structural_98_final, T_hat_98_ols & alt_T_hat_98_ols Sabine Deij, Jakob B. Madsen and Laura Puzzello December 2020 Sabine Deij