Martin Huber and Blaise Melly, "A Test of the Conditional Independence
Assumption in Sample Selection Models", Journal of Applied
Econometrics, Vol. 30, No. 7, 2015, pp. 1144-1168.

The data used for the simulations in section 4 of the paper are in the
file simulations.txt, an ASCII file in DOS format. The same data are
also in the Stata dataset simulations.dta. The data used for the
application in section 5 of the paper are in the file application.txt,
also an ASCII file. The same data are also in the Stata dataset
application.dta.

The two text files are zipped in the file hm-data-ascii.zip.
Unix/Linux users should use "unzip -a". The two .dta files are zipped
in the file hm-data-stata.zip. Unix/Linux users should use "unzip".

The econometric results used in this study were implemented using the
programming language R. The codes used to generate Figure 1 are in
figure1.R. The codes used to generate Figure 2 are in figure2.R. The
codes used to generate the simulations results in Section 4 are in
simulations.R. The codes used to generate the empirical results in
Section 5 are in application.R.

All the R files are ASCII files in DOS format. They are zipped in the
file hm-codes.zip. Unix/Linux users should use "unzip -a".


Data Description:

The original source for both datasets is the "merged outgoing rotation
groups" extract of the CPS for the year 2011. This file (morg11.dta)
can be downloaded from

  http://data.nber.org/morg/annual/.

This dataset is very well documented at the same NBER website. The
Stata do file "CPS-ORG-2011-data-preparation.do", which generated
application.dta and simulations.dta starting from morg11.dta, is
provided in hm-codes.zip, but it is not strictly needed because its
only goal is to produce these two datasets. simulations.dta contains
63697 observations and 11 variables. application.dta contains 45,296
and 23 variables.