Wednesday, February 6, 2013

Non-Parametric PDF Fit Test


* This is an idea that I decided to explore before inspecting how others have addressed the problem.

* As noted by my previous post we cannot use standard independence based draw reasoning in order to test model fit.

* The following command will simulate random draws in the distribution being tested and see how closely they fit to exact pdf draws.

* It will then use that information to see if we can reject the null that the observed distribution is the same as the null distribution.

cap: program drop dist_check
program define dist_check, rclass

  * The syntax command parses the input into useful macros used for later calculations.
  syntax  varlist, Tgenerate(numlist >= 100) [Dist(string)]
 
 
  if "`dist'"=="" local dist="normal"
 
  if "`dist'"=="normal" di "Testing if `varlist' is normally distributed"
  if "`dist'"=="uniform" di "Testing if `varlist' is uniformly distributed"
  if "`dist'"=="poisson" di "Testing if `varlist' has a poisson distributed"

  di "Simulate `tgenerate' draws"
 
  * Construct a t-vector in Mata to store estimation results.
  mata: tdist = J(`tgenerate', 1, .)
 
  *******
  * This will randomly draw from the distribution being tested a number of times equal to tgenerate.
  *******
 
  preserve
  qui: drop if `varlist' == .
 
  forv i=1(1)`tgenerate' {
 
    tempvar x zx z p d

    if "`dist'"=="normal" qui: gen `x' = rnormal()

if "`dist'"=="poisson" {
      qui: sum `varlist'
      qui: gen `x' = rpoisson(r(mean))
    }

if "`dist'"=="uniform" qui: gen `x' = runiform()

    sort `x'
    qui: gen `p' = (_n-.5)/_N

if "`dist'"=="normal" {
      qui: egen `zx' = std(`x')
      qui: gen `z' = invnormal(`p')
      qui: gen `d' = (`z'-`zx')^2
   }

if "`dist'"=="poisson" {
      qui: sum `x'
      qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
 * The invpoisson distribution in Stata is misspecified.  1-p is neccessary.
      qui: gen `d' = (`z'-`x')^2
    }

if "`dist'"=="uniform" {
      qui: gen `z' = _n/_N
      qui: gen `d' = (`z'-`x')^2
    }
   
qui: reg `d'
    local t = _b[_cons]/_se[_cons]
    mata: tdist[`i', 1]=`t'

drop `x' `z' `p' `d'
cap drop `zx'
  }
 
  * From the above loop we should have a vector of t values.
  * We can use that vector to construct a Confidence Interval used observed when true t values as cutoff points.
 
  mata: tsorted = sort(tdist, 1)

  * One tailed test
  local c90 = floor(`tgenerate'*.90)+1
  mata: st_local("t90", strofreal(tsorted[`c90']))
 
  local c95 = floor(`tgenerate'*.95)+1
  mata: st_local("t95", strofreal(tsorted[`c95']))

  local c99 = floor(`tgenerate'*.99)+1
  mata: st_local("t99", strofreal(tsorted[`c99']))
 
  * Two Tailed
    * 90% CI
  local c90U = floor((`tgenerate'+.5)*.95)+1
  mata: st_local("t90U", strofreal(tsorted[`c90U']))
 
  local c90L = ceil((`tgenerate'+.5)*.05)-1
  mata: st_local("t90L", strofreal(tsorted[`c90L']))
 
    * 95% CI
  local c95U = floor((`tgenerate'+.5)*.975)+1
  mata: st_local("t95U", strofreal(tsorted[`c95U']))

  local c95L = ceil((`tgenerate'+.5)*.025)-1
  mata: st_local("t95L", strofreal(tsorted[`c95L']))

    * 99% CI
  local c99U = floor((`tgenerate'+.5)*.99)+1
  mata: st_local("t99U", strofreal(tsorted[`c99U']))

  local c99L = ceil((`tgenerate'+.5)*.01)-1
  mata: st_local("t99L", strofreal(tsorted[`c99L']))

  ** Now we do the estimation
    tempvar x zx z p d

qui: gen `x' = `varlist'

    sort `x'
    qui: gen `p' = (_n-.5)/_N

  * We transform the data in different ways depending upon what distribution we are assuming.
if "`dist'"=="normal" {
      qui: egen `zx' = std(`x')
      qui: gen `z' = invnormal(`p')
      qui: gen `d' = (`z'-`zx')^2
   }

if "`dist'"=="poisson" {
      qui: sum `x'
      qui: gen `z' = round(invpoisson(r(mean), 1-`p'))
      qui: gen `d' = (`z'-`x')^2
    }

if "`dist'"=="uniform" {
      qui: sum `x'
      qui: gen `z' = (`x'-r(min))/(r(max)-r(min))
      qui: gen `d' = (`z'-`x')^2
    }

* This is the regression of interest.
    qui: reg `d'
    local t = _b[_cons]/_se[_cons]

* Now we compare our t with the ts that were drawn when random variables were drawn from our distribution.
di _newline "Estimated t:  `: di %9.3f `t''" _newline

di "One-way Analysis"
di "  CI   (1%) :    0.000   to `: di %9.3f `t99''"
di "  CI   (5%) :    0.000   to `: di %9.3f `t95''"
di "  CI   (10%):    0.000   to `: di %9.3f `t90''"

    mata: psearch = abs(tsorted:-`t')
mata: a = 1::`tgenerate'
mata: b = psearch :== min(psearch)
    mata: st_local("position", strofreal(sum(b:*a)))
local p1 = 1-`position'/`tgenerate'
local p2 = min(`position'/`tgenerate',1-`position'/`tgenerate')*2

di "One-sided  p:`: di %9.3f `p1''   (`position' out of `tgenerate')"

di _newline "Two-way Analysis"
di "  CI   (1%): `: di %9.3f `t99L''    to `: di %9.3f `t99U''"
di "  CI   (5%): `: di %9.3f `t95L''    to `: di %9.3f `t95U''"
di "  CI  (10%): `: di %9.3f `t90L''    to `: di %9.3f `t90U''"
di "Two-sided p: `: di %9.3f `p2''"

    return scalar p1 = `p1'
    return scalar p2 = `p2'

  restore
end

* Let's see how this works on a sample draw.
clear
set obs 1000
gen x = rnormal()+1
dist_check x, t(100) dist(poisson)
* Interestingly, some data distributions seem to fit to fit "better" pdfs from different distributions.
* Thus the estimated t is smaller for some distributions.
* For this reason I have included a two sided confidence interval.

* The following small program will be used to construct a Monte Carlo simulation to see how well the dist_check command is working.
cap program drop checker
program define checker
  syntax [anything], obs(numlist > 0) rdraw(string) reps(numlist > 0) dist(string)
  clear
  set obs `obs'
  gen x = `rdraw'
  dist_check x, t(`reps') dist(`dist')
end

* Easily generates data
checker , obs(1000) rdraw(runiform()*30) reps(200) dist(normal)

* Now let's see it in action
simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(50) rdraw(rnormal()*30) reps(100) dist(normal)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
 
* We should reject at the 10% level for both one sided and two sided confidence intervals.
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum
* mean r1_10 and  mean r2_10 are rejection rates for the one-tailed and two-tailed test respectively.

* Becasue the distribution is truelly the null we are rejecting the null only 10% of the time at the 10% level.

* This looks great.

* Now let's see about the test's power to reject the null when the true distribution is not the null.

checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
  * This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(1000) rdraw(rnormal()) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
 
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
  * This distribution is almost uniform
hist x

simulate p1 = r(p1) p2 = r(p2) , reps(100): ///
  checker , obs(100) rdraw(rnormal()+runiform()*50) reps(100) dist(uniform)
* With the above simulation almost by definition we know that it is sized propertly but just as a reference.
 
gen r1_10 = p1<= .1
gen r2_10 = p2<= .1

sum

* I think there might be some slight errors.  I hope this post is useful though I would not recommend using this particular command as there are more effective and better established commands already developed for testing distribution fit assumptions.

No comments:

Post a Comment