Econometrics By Simulation: Non-Parametric PDF Fit Test

Wednesday, February 6, 2013

Non-Parametric PDF Fit Test

* This is an idea that I decided to explore before inspecting how others have addressed the problem.

* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.

* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.

* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.

cap: program drop dist_check
program define dist_check, rclass

* The syntax command parses the input into useful macros used for later calculations.
syntax varlist, Tgenerate(numlist >= 100) [Dist(string)]

if "`dist'"=="" local dist="normal"

if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"

di "Simulate `tgenerate' draws"

* Construct a t-vector in Mata to store estimation results.
mata: tdist = J(`tgenerate', 1, .)

*******
* This will randomly draw from the distribution being tested a number of times equal to tgenerate.
*******

preserve
qui: drop if `varlist' == .

forv i=1(1)`tgenerate' {

tempvar x zx z p d

if "`dist'"=="normal" qui: gen `x' = rnormal()

if "`dist'"=="poisson" {
qui: sum `varlist'
qui: gen `x' = rpoisson(r(mean))
}

if "`dist'"=="uniform" qui: gen `x' = runiform()

sort `x'
qui: gen `p' = (_n-.5)/_N

if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}

if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
* The invpoisson distribution in Stata is misspecified. 1-p is neccessary.
qui: gen `d' = (`z'-`x')^2
}

if "`dist'"=="uniform" {
qui: gen `z' = _n/_N
qui: gen `d' = (`z'-`x')^2
}

qui: reg `d'
local t = _b[_cons]/_se[_cons]
mata: tdist[`i', 1]=`t'

drop `x' `z' `p' `d'
cap drop `zx'
}

* From the above loop we should have a vector of t values.
* We can use that vector to construct a Confidence Interval used observed when true t values as cutoff points.

mata: tsorted = sort(tdist, 1)

* One tailed test
local c90 = floor(`tgenerate'*.90)+1
mata: st_local("t90", strofreal(tsorted[`c90']))

local c95 = floor(`tgenerate'*.95)+1
mata: st_local("t95", strofreal(tsorted[`c95']))

local c99 = floor(`tgenerate'*.99)+1
mata: st_local("t99", strofreal(tsorted[`c99']))

* Two Tailed
* 90% CI
local c90U = floor((`tgenerate'+.5)*.95)+1
mata: st_local("t90U", strofreal(tsorted[`c90U']))

local c90L = ceil((`tgenerate'+.5)*.05)-1
mata: st_local("t90L", strofreal(tsorted[`c90L']))

* 95% CI
local c95U = floor((`tgenerate'+.5)*.975)+1
mata: st_local("t95U", strofreal(tsorted[`c95U']))

local c95L = ceil((`tgenerate'+.5)*.025)-1
mata: st_local("t95L", strofreal(tsorted[`c95L']))

* 99% CI
local c99U = floor((`tgenerate'+.5)*.99)+1
mata: st_local("t99U", strofreal(tsorted[`c99U']))

local c99L = ceil((`tgenerate'+.5)*.01)-1
mata: st_local("t99L", strofreal(tsorted[`c99L']))

** Now we do the estimation
tempvar x zx z p d

qui: gen `x' = `varlist'

sort `x'
qui: gen `p' = (_n-.5)/_N

* We transform the data in different ways depending upon what distribution we are assuming.
if "`dist'"=="normal" {
qui: egen `zx' = std(`x')
qui: gen `z' = invnormal(`p')
qui: gen `d' = (`z'-`zx')^2
}

if "`dist'"=="poisson" {
qui: sum `x'
qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
qui: gen `d' = (`z'-`x')^2
}

if "`dist'"=="uniform" {
qui: sum `x'
qui: gen `z' = (`x'-r(min))/(r(max)-r(min))
qui: gen `d' = (`z'-`x')^2
}

* This is the regression of interest.
qui: reg `d'
local t = _b[_cons]/_se[_cons]

* Now we compare our t with the ts that were drawn when random variables were drawn from our distribution.
di _newline "Estimated t: `: di %9.3f `t''" _newline

di "One-way Analysis"
di " CI (1%) : 0.000 to `: di %9.3f `t99''"
di " CI (5%) : 0.000 to `: di %9.3f `t95''"
di " CI (10%): 0.000 to `: di %9.3f `t90''"

mata: psearch = abs(tsorted:-`t')
mata: a = 1::`tgenerate'
mata: b = psearch :== min(psearch)
mata: st_local("position", strofreal(sum(b:*a)))
local p1 = 1-`position'/`tgenerate'
local p2 = min(`position'/`tgenerate',1-`position'/`tgenerate')*2

di "One-sided p:`: di %9.3f `p1'' (`position' out of `tgenerate')"

di _newline "Two-way Analysis"
di " CI (1%): `: di %9.3f `t99L'' to `: di %9.3f `t99U''"
di " CI (5%): `: di %9.3f `t95L'' to `: di %9.3f `t95U''"
di " CI (10%): `: di %9.3f `t90L'' to `: di %9.3f `t90U''"
di "Two-sided p: `: di %9.3f `p2''"

return scalar p1 = `p1'
return scalar p2 = `p2'

restore
end

* Let's see how this works on a sample draw.
clear
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.

* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
clear
set obs `obs'
gen x = `rdraw'
dist_check x, t(`reps') dist(`dist')
end

* Easily generates data
checker , obs(1000) rdraw(runiform()*30) reps(200) dist(normal)

* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum
* mean r1_10 and mean r2_10 are rejection rates for the one-tailed and two-tailed test respectively.

* Becasue the distribution is truelly the null we are rejecting the null only 10% of the time at the 10% level.

* This looks great.

* Now let's see about the test's power to reject the null when the true distribution is not the null.

checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.

gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

* I think there might be some slight errors. I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.

Econometrics By Simulation

Wednesday, February 6, 2013

Non-Parametric PDF Fit Test

No comments:

Post a Comment

Blog Archive