Andrew M. Jones, James Lomas, and Nigel Rice, "Applying Beta-type Size Distributions to Healthcare Cost Regressions", Journal of Applied Econometrics, Vol. 29, No. 4, 2014, pp. 649-670. The data used in the paper are based upon the Hospital Episode Statistics administered by the Health and Social Care Information Centre (formerly NHS Information Centre). This data source is confidential, and access has to be requested. Extraction carries a fee. For more information, see http://www.hscic.gov.uk/hes. The dependent variable is 'individual patient annual NHS hospital cost for all spells finishing in the financial year 2007-2008'. Unit costs were applied to inpatient activity data, excluding maternity and mental health care. The dominant HRG of each spell is costed using the 2008/9 tariff (where possible, or if not, then using 2005/6 reference costs or specialty function average costs). For more details see Dixon J, Asaria P, Georghiou T, Billings J, Gravelle H, Martin S, Rice N, Smith P, Wennberg D, DeLorenzo M, Siegal M, Russell R, Filipova N. 2009. Developing a person based resource allocation formula for general practices in england. Report to the Department of Health." In particular, see Appendix 5. Website: http://www.nuffieldtrust.org.uk/our-work/projects/person-based-resource-allocation-pbra. Note that maternity costs were also excluded in our study, and only inpatient costs included (and not outpatient costs). Independent variables were created using the following Stata code: program datasetup gen age = 2007 - yob gen age2 = age^2 gen age3 = age^3 gen female = sex - 1 gen fage = female*age gen fage2 = female*age2 gen fage3 = female*age3 egen epiA = rmax(A00A09_7epi - A90A99_7epi) egen epiB = rmax(B00B09_7epi - B85B99_7epi) egen epiC = rmax(C00C14_7epi - C81C96_7epi) egen epiD = rmax(D00D48_7epi - D65D89_7epi) egen epiE = rmax(E00E07_7epi - E15E90_7epi) egen epiF = rmax(F00F03_7epi - F80F99_7epi) egen epiG = rmax(G00G09_7epi - G80G83_7epi) egen epiH = rmax(H00H06H15H22H30H36H43H59_7epi - H60H95_7epi) egen epiI = rmax(I00I09_7epi - I95I99_7epi) egen epiJ = rmax(J00J06_7epi - J80J99_7epi) egen epiK = rmax(K00K14_7epi - K90K93_7epi) egen epiL = rmax(L00L14L55L99_7epi - L50L54_7epi) egen epiM = rmax(M00M25_7epi - M95M99_7epi) egen epiN = rmax(N00N08N10N16_7epi - N99_7epi) egen epiO = rmax(O00O08_7epi - O80O84_7epi) egen epiP = rmax(P00P04_7epi - P05P96_7epi) egen epiQ = rmax(Q00Q89_7epi - Q90Q99_7epi) egen epiR = rmax(R00R09_7epi - R95R99_7epi) egen epiS = rmax(S00S09_7epi - S90S99_7epi) egen epiT = rmax(T00T07_7epi - T90T98_7epi) gen epiU = UUU_7epi gen epiV = VVV_7epi gen epiW = WWW_7epi gen epiX = XXX_7epi gen epiY = YYY_7epi egen epiZ = rmax(Z00Z13_7epi - Z80Z99_7epi) gen epiOP = max(epiO,epiP) end and the samples were set up using the following Stata code: use "gb2estimationrnr.dta", clear drop if tincostXM07==0 set seed 45678 // set seed to enable replication gen split = runiform() // generate random numbers egen pos = rank(split) replace split = 1 replace split = 0 if pos < 3082058 drop pos saveold "gb2estimationrnrXM.dta", replace drop if split == 1 summ split save "gb2validationrnrXM.dta" , replace // save validation sample forvalue i= 1(1)100 { use "gb2estimationrnrXM.dta", clear set seed `i' drop if split==0 sample 5000, count gen sort = runiform() datasetup saveold "gb2rnrXM_5000_`i'.dta", replace } forvalue i= 1(1)100 { use "gb2estimationrnrXM.dta", clear set seed `i' drop if split==0 sample 10000, count gen sort = runiform() datasetup saveold "gb2rnrXM_10000_`i'.dta", replace } forvalue i= 1(1)100 { use "gb2estimationrnrXM.dta", clear set seed `i' drop if split==0 sample 50000, count gen sort = runiform() datasetup saveold "gb2rnrXM_50000_`i'.dta", replace } forvalue i= 1(1)100 { use "gb2estimationrnrXM.dta", clear set seed `i' drop if split==0 sample 100000, count gen sort = runiform() datasetup saveold "gb2rnrXM_100000_`i'.dta", replace }