Original Code
* I was recently trying to figure out how to invert the CDF of a binomial distribution in Stata and realized that there is not built in command. There is an invbinomial command but it is inverting a different value than I expect.
* The description reads:
/* returns the inverse of the cumulative binomial; that is,
it returns the probability of success on one trial
such that the probability of observing floor(k) or
fewer successes in floor(n) trials is p.
*/
* I might be confused but this is not the definition that I think of as an inverse CDF.
* Sure it takes a cumulative probability and returns a parameter but normally I think of a invCDF as taking a cumulative probability and parameters and returning a value. For example invnormal takes the p value and returns the z value that corresponds to that p value. Likewise, invgamma take an "a" and "p" value and returns an x value.
* invbinomial(n,k,p)
clear
set obs 10
* We will set the number of trials to be 10.
gen inv=invbinomial(10,_n-1,.5)
gen pdf=binomialp(10,_n-1,.5)
* The standard CDF can be calculated as follows from the pdf:
gen cdf=sum(pdf)
* Instead the CDF seems to be giving us invCDF(P,k)=theta, which is certaily useful for some purposes but not what I want.
* So, I will attempt to write my own CDFinverse. I don't know how to write functions in Stata so instead I will write a small program that may do what I want.
cap program drop CDFinv
program define CDFinv
* Let's define the arguments of the program.
* N is the total number of trials.
* P is the CDF value that we would like inverted
* p is the probability of one outcome being a success.
* newvar is the name of the new variable to be generated.
args newvar P N p
* The challenging thing about trying to invert a CDF for a count distribution is that the CDF is a probability mass function with discontinuous regions.
* However, the particularly nice thing about the binomial distribution is the the pool of outcomes is well defined. (1,2,3..N)
* Thus we can simply count how many times the "steps" of the CDF fall below the target CDF level.
cap drop `newvar'
qui gen `newvar'=0
label var `newvar' "Inverse CDF of binomial"
forv k=1(1)`N' {
* Add 1 to the variable value every time the CDF(n,k-1,p) is less than the target CDF value P.
qui replace `newvar'=`newvar'+1 if binomial(`N', `k'-1, `p')<(`P')
}
end
* Using the first cdf generated by the summing of the pdf, this command seems to be working pretty well.
CDFinv cdfinv cdf 10 .5
sum cdfinv
* Though I am not sure why it jumps from 1 to 3. I suspect this is the result of rounding error.
* The true test is if I were to plug a uniform distribution into it if it would give me the same distribution as the random binomial distribution generator.
* Let's start with a bigger data set:
clear
set obs 100000
* To establish our benchmark.
gen bin_benchmark = rbinomial(10,.75)
label var bin_benchmark "Random Binomial Generated with rbinomial"
* Now let's generate our uniform
gen P = runiform()
CDFinv bin_gen P 10 .75
label var bin_gen "Binomial Generated with CDFinv command"
sum bin*
* Both the means and standard deviations are very close.
* If the invCDF worked properly then P=CDF(invCDF(P))
gen P2 = binomial(10, bin_gen, .75)
label var P2 "CDF(invCDF(P))"
cor P P2
* Close but not perfect. Let's look at the differences graphically
* We can get a step graph to attempt to see the difference.
sort P
twoway (connected P2 bin_gen, msize(zero) color(red)) ///
(connected P bin_gen, msize(zero)), ///
scheme(sj) title("CDF (N=10,p=.75)")
* Okay, this is does not look not bad.
* To further confirm that things are working properly we might try comparing the two histograms.
hist bin_benchmark, nodraw name(Benchmark, replace)
hist bin_gen, nodraw name(Generated, replace)
graph combine Benchmark Generated, title("Both Methods Produce Nearly Identical Results(N=10,p=.75)")
* I cannot tell the difference between the two distributions. Thus I am finally satisfied.
No comments:
Post a Comment