Wednesday, February 6, 2013
Non-Parametric PDF Fit Test
* This is an idea that I decided to explore before inspecting how others have addressed the problem.
* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.
* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.
* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.
cap: program drop dist_check
program define dist_check, rclass
* The syntax command parses the input into useful macros used for later calculations.
syntax varlist, Tgenerate(numlist >= 100) [Dist(string)]
if "`dist'"=="" local dist="normal"
if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"
di "Simulate `tgenerate' draws"
* Construct a t-vector in Mata to store estimation results.
mata: tdist = J(`tgenerate', 1, .)
*******
* This will randomly draw from the distribution being tested a number of times equal to tgenerate.
*******
preserve
qui: drop if `varlist' == .
forv i=1(1)`tgenerate' {
tempvar x zx z p d
if "`dist'"=="normal" qui: gen `x' = rnormal()
if "`dist'"=="poisson" {
qui: sum `varlist'
qui: gen `x' = rpoisson(r(mean))
}
if "`dist'"=="uniform" qui: gen `x' = runiform()
sort `x'
qui: gen `p' = (_n-.5)/_N
if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}
if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
* The invpoisson distribution in Stata is misspecified. 1-p is neccessary.
qui: gen `d' = (`z'-`x')^2
}
if "`dist'"=="uniform" {
qui: gen `z' = _n/_N
qui: gen `d' = (`z'-`x')^2
}
qui: reg `d'
local t = _b[_cons]/_se[_cons]
mata: tdist[`i', 1]=`t'
drop `x' `z' `p' `d'
cap drop `zx'
}
* From the above loop we should have a vector of t values.
* We can use that vector to construct a Confidence Interval used observed when true t values as cutoff points.
mata: tsorted = sort(tdist, 1)
* One tailed test
local c90 = floor(`tgenerate'*.90)+1
mata: st_local("t90", strofreal(tsorted[`c90']))
local c95 = floor(`tgenerate'*.95)+1
mata: st_local("t95", strofreal(tsorted[`c95']))
local c99 = floor(`tgenerate'*.99)+1
mata: st_local("t99", strofreal(tsorted[`c99']))
* Two Tailed
* 90% CI
local c90U = floor((`tgenerate'+.5)*.95)+1
mata: st_local("t90U", strofreal(tsorted[`c90U']))
local c90L = ceil((`tgenerate'+.5)*.05)-1
mata: st_local("t90L", strofreal(tsorted[`c90L']))
* 95% CI
local c95U = floor((`tgenerate'+.5)*.975)+1
mata: st_local("t95U", strofreal(tsorted[`c95U']))
local c95L = ceil((`tgenerate'+.5)*.025)-1
mata: st_local("t95L", strofreal(tsorted[`c95L']))
* 99% CI
local c99U = floor((`tgenerate'+.5)*.99)+1
mata: st_local("t99U", strofreal(tsorted[`c99U']))
local c99L = ceil((`tgenerate'+.5)*.01)-1
mata: st_local("t99L", strofreal(tsorted[`c99L']))
** Now we do the estimation
tempvar x zx z p d
qui: gen `x' = `varlist'
sort `x'
qui: gen `p' = (_n-.5)/_N
* We transform the data in different ways depending upon what distribution we are assuming.
if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}
if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
qui: gen `d' = (`z'-`x')^2
}
if "`dist'"=="uniform" {
qui: sum `x'
qui: gen `z' = (`x'-r(min))/(r(max)-r(min))
qui: gen `d' = (`z'-`x')^2
}
* This is the regression of interest.
qui: reg `d'
local t = _b[_cons]/_se[_cons]
* Now we compare our t with the ts that were drawn when random variables were drawn from our distribution.
di _newline "Estimated t: `: di %9.3f `t''" _newline
di "One-way Analysis"
di " CI (1%) : 0.000 to `: di %9.3f `t99''"
di " CI (5%) : 0.000 to `: di %9.3f `t95''"
di " CI (10%): 0.000 to `: di %9.3f `t90''"
mata: psearch = abs(tsorted:-`t')
mata: a = 1::`tgenerate'
mata: b = psearch :== min(psearch)
mata: st_local("position", strofreal(sum(b:*a)))
local p1 = 1-`position'/`tgenerate'
local p2 = min(`position'/`tgenerate',1-`position'/`tgenerate')*2
di "One-sided p:`: di %9.3f `p1'' (`position' out of `tgenerate')"
di _newline "Two-way Analysis"
di " CI (1%): `: di %9.3f `t99L'' to `: di %9.3f `t99U''"
di " CI (5%): `: di %9.3f `t95L'' to `: di %9.3f `t95U''"
di " CI (10%): `: di %9.3f `t90L'' to `: di %9.3f `t90U''"
di "Two-sided p: `: di %9.3f `p2''"
return scalar p1 = `p1'
return scalar p2 = `p2'
restore
end
* Let's see how this works on a sample draw.
clear
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.
* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
clear
set obs `obs'
gen x = `rdraw'
dist_check x, t(`reps') dist(`dist')
end
* Easily generates data
checker , obs(1000) rdraw(runiform()*30) reps(200) dist(normal)
* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1
sum
* mean r1_10 and mean r2_10 are rejection rates for the one-tailed and two-tailed test respectively.
* Becasue the distribution is truelly the null we are rejecting the null only 10% of the time at the 10% level.
* This looks great.
* Now let's see about the test's power to reject the null when the true distribution is not the null.
checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* This distribution is almost uniform
hist x
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1
sum
checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* This distribution is almost uniform
hist x
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1
sum
* I think there might be some slight errors. I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment