Friday, November 2, 2012

Inverting a Binomial Distribution in Stata - harder than it sounds

Original Code

* I was recently trying to figure out how to invert the CDF of a binomial distribution in Stata and realized that there is not built in command.  There is an invbinomial command but it is inverting a different value than I expect.

* The description reads:
/* returns the inverse of the cumulative binomial; that is,
                         it returns the probability of success on one trial
                         such that the probability of observing floor(k) or
                         fewer successes in floor(n) trials is p.

* I might be confused but this is not the definition that I think of as an inverse CDF.

* Sure it takes a cumulative probability and returns a parameter but normally I think of a invCDF as taking a cumulative probability and parameters and returning a value.  For example invnormal takes the p value and returns the z value that corresponds to that p value.  Likewise, invgamma take an "a" and "p" value and returns an x value.

* invbinomial(n,k,p)

set obs 10
* We will set the number of trials to be 10.

gen inv=invbinomial(10,_n-1,.5)

gen pdf=binomialp(10,_n-1,.5)

* The standard CDF can be calculated as follows from the pdf:
gen cdf=sum(pdf)

* Instead the CDF seems to be giving us invCDF(P,k)=theta, which is certaily useful for some purposes but not what I want.

* So, I will attempt to write my own CDFinverse.  I don't know how to write functions in Stata so instead I will write a small program that may do what I want.

cap program drop CDFinv
program define CDFinv

  * Let's define the arguments of the program.
  * N is the total number of trials.
  * P is the CDF value that we would like inverted
  * p is the probability of one outcome being a success.
  * newvar is the name of the new variable to be generated.
  args newvar P N p

  * The challenging thing about trying to invert a CDF for a count distribution is that the CDF is a probability mass function with discontinuous regions.
  * However, the particularly nice thing about the binomial distribution is the the pool of outcomes is well defined. (1,2,3..N)
  * Thus we can simply count how many times the "steps" of the CDF fall below the target CDF level.
  cap drop `newvar'
  qui gen `newvar'=0
    label var `newvar' "Inverse CDF of binomial"

  forv k=1(1)`N' {
    * Add 1 to the variable value every time the CDF(n,k-1,p) is less than the target CDF value P.
qui replace `newvar'=`newvar'+1 if binomial(`N', `k'-1, `p')<(`P')


* Using the first cdf generated by the summing of the pdf, this command seems to be working pretty well.
CDFinv cdfinv cdf 10 .5
sum cdfinv

* Though I am not sure why it jumps from 1 to 3.  I suspect this is the result of rounding error.

* The true test is if I were to plug a uniform distribution into it if it would give me the same distribution as the random binomial distribution generator.

* Let's start with a bigger data set:
set obs 100000

* To establish our benchmark.
gen bin_benchmark = rbinomial(10,.75)
  label var bin_benchmark "Random Binomial Generated with rbinomial"

* Now let's generate our uniform
gen P = runiform()

CDFinv bin_gen P 10 .75
  label var bin_gen "Binomial Generated with CDFinv command"
sum bin*
* Both the means and standard deviations are very close.

* If the invCDF worked properly then P=CDF(invCDF(P))
gen P2 = binomial(10, bin_gen, .75)
  label var P2 "CDF(invCDF(P))"

cor P P2
* Close but not perfect.  Let's look at the differences graphically

* We can get a step graph to attempt to see the difference.
sort P
twoway (connected P2 bin_gen, msize(zero) color(red)) ///
       (connected P bin_gen, msize(zero)),     ///
  scheme(sj) title("CDF (N=10,p=.75)")

* Okay, this is does not look not bad.

* To further confirm that things are working properly we might try comparing the two histograms.
hist bin_benchmark, nodraw name(Benchmark, replace)
hist bin_gen, nodraw name(Generated, replace)

graph combine Benchmark Generated, title("Both Methods Produce Nearly Identical Results(N=10,p=.75)")
* I cannot tell the difference between the two distributions.  Thus I am finally satisfied.

No comments:

Post a Comment