## Wednesday, February 6, 2013

### Non-Parametric PDF Fit Test

* This is an idea that I decided to explore before inspecting how others have addressed the problem.

* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.

* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.

* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.

cap: program drop dist_check
program define dist_check, rclass

* The syntax command parses the input into useful macros used for later calculations.
syntax  varlist, Tgenerate(numlist >= 100) [Dist(string)]

if "`dist'"=="" local dist="normal"

if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"

di "Simulate `tgenerate' draws"

* Construct a t-vector in Mata to store estimation results.
mata: tdist = J(`tgenerate', 1, .)

*******
* This will randomly draw from the distribution being tested a number of times equal to tgenerate.
*******

preserve
qui: drop if `varlist' == .

forv i=1(1)`tgenerate' {

tempvar x zx z p d

if "`dist'"=="normal" qui: gen `x' = rnormal()

if "`dist'"=="poisson" {
qui: sum `varlist'
qui: gen `x' = rpoisson(r(mean))
}

if "`dist'"=="uniform" qui: gen `x' = runiform()

sort `x'
qui: gen `p' = (_n-.5)/_N

if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}

if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
* The invpoisson distribution in Stata is misspecified.  1-p is neccessary.
qui: gen `d' = (`z'-`x')^2
}

if "`dist'"=="uniform" {
qui: gen `z' = _n/_N
qui: gen `d' = (`z'-`x')^2
}

qui: reg `d'
local t = _b[_cons]/_se[_cons]
mata: tdist[`i', 1]=`t'

drop `x' `z' `p' `d'
cap drop `zx'
}

* From the above loop we should have a vector of t values.
* We can use that vector to construct a Confidence Interval used observed when true t values as cutoff points.

mata: tsorted = sort(tdist, 1)

* One tailed test
local c90 = floor(`tgenerate'*.90)+1
mata: st_local("t90", strofreal(tsorted[`c90']))

local c95 = floor(`tgenerate'*.95)+1
mata: st_local("t95", strofreal(tsorted[`c95']))

local c99 = floor(`tgenerate'*.99)+1
mata: st_local("t99", strofreal(tsorted[`c99']))

* Two Tailed
* 90% CI
local c90U = floor((`tgenerate'+.5)*.95)+1
mata: st_local("t90U", strofreal(tsorted[`c90U']))

local c90L = ceil((`tgenerate'+.5)*.05)-1
mata: st_local("t90L", strofreal(tsorted[`c90L']))

* 95% CI
local c95U = floor((`tgenerate'+.5)*.975)+1
mata: st_local("t95U", strofreal(tsorted[`c95U']))

local c95L = ceil((`tgenerate'+.5)*.025)-1
mata: st_local("t95L", strofreal(tsorted[`c95L']))

* 99% CI
local c99U = floor((`tgenerate'+.5)*.99)+1
mata: st_local("t99U", strofreal(tsorted[`c99U']))

local c99L = ceil((`tgenerate'+.5)*.01)-1
mata: st_local("t99L", strofreal(tsorted[`c99L']))

** Now we do the estimation
tempvar x zx z p d

qui: gen `x' = `varlist'

sort `x'
qui: gen `p' = (_n-.5)/_N

* We transform the data in different ways depending upon what distribution we are assuming.
if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}

if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
qui: gen `d' = (`z'-`x')^2
}

if "`dist'"=="uniform" {
qui: sum `x'
qui: gen `z' = (`x'-r(min))/(r(max)-r(min))
qui: gen `d' = (`z'-`x')^2
}

* This is the regression of interest.
qui: reg `d'
local t = _b[_cons]/_se[_cons]

* Now we compare our t with the ts that were drawn when random variables were drawn from our distribution.
di _newline "Estimated t:  `: di %9.3f `t''" _newline

di "One-way Analysis"
di "  CI   (1%) :    0.000   to `: di %9.3f `t99''"
di "  CI   (5%) :    0.000   to `: di %9.3f `t95''"
di "  CI   (10%):    0.000   to `: di %9.3f `t90''"

mata: psearch = abs(tsorted:-`t')
mata: a = 1::`tgenerate'
mata: b = psearch :== min(psearch)
mata: st_local("position", strofreal(sum(b:*a)))
local p1 = 1-`position'/`tgenerate'
local p2 = min(`position'/`tgenerate',1-`position'/`tgenerate')*2

di "One-sided  p:`: di %9.3f `p1''   (`position' out of `tgenerate')"

di _newline "Two-way Analysis"
di "  CI   (1%): `: di %9.3f `t99L''    to `: di %9.3f `t99U''"
di "  CI   (5%): `: di %9.3f `t95L''    to `: di %9.3f `t95U''"
di "  CI  (10%): `: di %9.3f `t90L''    to `: di %9.3f `t90U''"
di "Two-sided p: `: di %9.3f `p2''"

return scalar p1 = `p1'
return scalar p2 = `p2'

restore
end

* Let's see how this works on a sample draw.
clear
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.

* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
clear
set obs `obs'
gen x = `rdraw'
dist_check x, t(`reps') dist(`dist')
end

* Easily generates data
checker , obs(1000) rdraw(runiform()*30) reps(200) dist(normal)

* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum
* mean r1_10 and  mean r2_10 are rejection rates for the one-tailed and two-tailed test respectively.

* Becasue the distribution is truelly the null we are rejecting the null only 10% of the time at the 10% level.

* This looks great.

* Now let's see about the test's power to reject the null when the true distribution is not the null.

checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

* I think there might be some slight errors.  I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.