Original Code

* I was recently trying to figure out how to invert the CDF of a binomial distribution in Stata and realized that there is not built in command. There is an invbinomial command but it is inverting a different value than I expect.

* The description reads:

/* returns the inverse of the cumulative binomial; that is,

it returns the probability of success on one trial

such that the probability of observing floor(k) or

fewer successes in floor(n) trials is p.

*/

* I might be confused but this is not the definition that I think of as an inverse CDF.

* Sure it takes a cumulative probability and returns a parameter but normally I think of a invCDF as taking a cumulative probability and parameters and returning a value. For example invnormal takes the p value and returns the z value that corresponds to that p value. Likewise, invgamma take an "a" and "p" value and returns an x value.

* invbinomial(n,k,p)

clear

set obs 10

* We will set the number of trials to be 10.

gen inv=invbinomial(10,_n-1,.5)

gen pdf=binomialp(10,_n-1,.5)

* The standard CDF can be calculated as follows from the pdf:

gen cdf=sum(pdf)

* Instead the CDF seems to be giving us invCDF(P,k)=theta, which is certaily useful for some purposes but not what I want.

* So, I will attempt to write my own CDFinverse. I don't know how to write functions in Stata so instead I will write a small program that may do what I want.

cap program drop CDFinv

program define CDFinv

* Let's define the arguments of the program.

* N is the total number of trials.

* P is the CDF value that we would like inverted

* p is the probability of one outcome being a success.

* newvar is the name of the new variable to be generated.

args newvar P N p

* The challenging thing about trying to invert a CDF for a count distribution is that the CDF is a probability mass function with discontinuous regions.

* However, the particularly nice thing about the binomial distribution is the the pool of outcomes is well defined. (1,2,3..N)

* Thus we can simply count how many times the "steps" of the CDF fall below the target CDF level.

cap drop `newvar'

qui gen `newvar'=0

label var `newvar' "Inverse CDF of binomial"

forv k=1(1)`N' {

* Add 1 to the variable value every time the CDF(n,k-1,p) is less than the target CDF value P.

qui replace `newvar'=`newvar'+1 if binomial(`N', `k'-1, `p')<(`P')

}

end

* Using the first cdf generated by the summing of the pdf, this command seems to be working pretty well.

CDFinv cdfinv cdf 10 .5

sum cdfinv

* Though I am not sure why it jumps from 1 to 3. I suspect this is the result of rounding error.

* The true test is if I were to plug a uniform distribution into it if it would give me the same distribution as the random binomial distribution generator.

* Let's start with a bigger data set:

clear

set obs 100000

* To establish our benchmark.

gen bin_benchmark = rbinomial(10,.75)

label var bin_benchmark "Random Binomial Generated with rbinomial"

* Now let's generate our uniform

gen P = runiform()

CDFinv bin_gen P 10 .75

label var bin_gen "Binomial Generated with CDFinv command"

sum bin*

* Both the means and standard deviations are very close.

* If the invCDF worked properly then P=CDF(invCDF(P))

gen P2 = binomial(10, bin_gen, .75)

label var P2 "CDF(invCDF(P))"

cor P P2

* Close but not perfect. Let's look at the differences graphically

* We can get a step graph to attempt to see the difference.

sort P

twoway (connected P2 bin_gen, msize(zero) color(red)) ///

(connected P bin_gen, msize(zero)), ///

scheme(sj) title("CDF (N=10,p=.75)")

* Okay, this is does not look not bad.

* To further confirm that things are working properly we might try comparing the two histograms.

hist bin_benchmark, nodraw name(Benchmark, replace)

hist bin_gen, nodraw name(Generated, replace)

graph combine Benchmark Generated, title("Both Methods Produce Nearly Identical Results(N=10,p=.75)")

* I cannot tell the difference between the two distributions. Thus I am finally satisfied.

## No comments:

## Post a Comment