Adam Nowak and Patrick Smith, "Textual Analysis in Real Estate", Journal 
of Applied Econometrics, Vol. 32, No. 4, 2017, pp. 896-918.

In our study we employ two unique datasets, both of which include
confidential data.

The primary data source used in our study was provided by the Georgia
Multiple Listings Service (GAMLS). We were given consent to use the
GAMLS data solely for research purposes, but do not have permission to
disseminate the data.

We also collected data from tax assessor offices in the counties of
interest to our study. Although these data are available to the
public, we do not have permission to disseminate the data. Individuals
interested in obtaining the county tax assessor data have to contact
the offices directly and provide documentation that the data will be
used for research purposes (if used for commercial purposes the tax
assessor offices charge a fee).

Our contacts at the GAMLS and tax assessor offices are as follows:
Brian Chew (GAMLS), Lisa Ballouk (Gwinnett County), Karen Bess (DeKalb
County), Constance Mackey (Fulton County), Rodney McDaniel (Clayton
County), and Peggy Parker (Cobb County). 

Description of the Data

The GAMLS website states "Listing content includes, but is not
limited to, photographs, images, graphics, audio and video recordings,
virtual tours, drawings, descriptions, remarks, narratives, pricing
information, and other details or information related to listed
property."  Our study incorporates property attributes (square
footage and age), transaction attributes (short sale or agent owned
indicators), and location fields (address) from the GAMLS. 

The program "Import-Data.R" reads in the raw GAMLS data and creates
variables used in the estimation. All variable names are
self-explanatory. The output is "MLS_Atlanta.csv", a file of sale
prices and listing attributes names. 

Of particular interest in this study are the "public remarks" that are
entered by real estate agents that have toured and assessed the
charactertics, quality and condition of the properties.  The program
"Token-Maker-Program" is used to (1) clean the remarks, (2) create
tokens for each remark, and (3) save a list of bigrams and unigrams as
the R data objects "big.bigram.list" and "big.unigram.list",
respectively.  

Both R programs are ASCII files in DOS format. They are zipped in the
file ns-programs.zip. Unix/Linux users should use "unzip -a".