I recently found myself in need of a function to sample randomly from an arbitrarily defined probability density function. An excellent post by Quantitations shows how to accomplish this using some of Rs fairly sophisticated functional approximation tools such as integrate and uniroot. The only problem with this excellent post was that the machine cost was enormous with samples of 1000 draws taking 5 seconds on my machine with repeated samples of 100,000+ draws (which I was after) clearly being unworkable.

Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows:

Assumptions

1. You can draw pseudo-random uniform variable u

2. You can integrate the pdf to construct a cdf

$$p = F(x) = \int_{-\infty}^\infty f(x) dx$$

3. You can invert the cdf in order to solve for p

$$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$

The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F).

Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The previously mentioned post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This process is computationally intensive for simple functions and extremely time consuming for complex functions.

In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an x point value for each cdf bin value. The gacdf returns a list of results from this process.

The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation, ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables.

As a matter of comparison, I define the funciton f as the pdf of the normal (dnorm) in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined at Quantitations] I graphically see how the three draws compare below.

On the graph, Black type=0, is rnorm. Dark blue type=1, is ricdf. Light blue, type=2 is samplepdf. Redrawing the graph several times, visually I could not tell the difference between the three methods.

Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used .8 seconds, and samplepdf used 5. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes).

For fun I have generated several other strange 'proportional' pdfs. I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary.

I define a piecewise pdf

And sample from it.

I also define a multimodal pdf

And sample from it.

Find the R code below or the gist on github.

Created by Pretty R at inside-R.org

Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows:

Assumptions

1. You can draw pseudo-random uniform variable u

2. You can integrate the pdf to construct a cdf

$$p = F(x) = \int_{-\infty}^\infty f(x) dx$$

3. You can invert the cdf in order to solve for p

$$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$

The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F).

Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The previously mentioned post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This process is computationally intensive for simple functions and extremely time consuming for complex functions.

In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an x point value for each cdf bin value. The gacdf returns a list of results from this process.

The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation, ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables.

As a matter of comparison, I define the funciton f as the pdf of the normal (dnorm) in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined at Quantitations] I graphically see how the three draws compare below.

Three random sampling procedures for the random normal. |

Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used .8 seconds, and samplepdf used 5. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes).

For fun I have generated several other strange 'proportional' pdfs. I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary.

I define a piecewise pdf

And sample from it.

I also define a multimodal pdf

And sample from it.

Find the R code below or the gist on github.

gicdf <- function(fun, min=-3.5, max=3.5, bins=1000, nqratio=10, grouping=mean, ...) { # Generate an inverse CDF of an arbitrary function fun <- match.fun(fun) grouping <- match.fun(grouping) # Number of points to draw nq=nqratio*bins # Draw indexes qdraw <- seq(min, max,length.out=nq) # Calculate proportional probability of each draw pdraw <- fun(qdraw,...) # Rescale probability sum to 1, rescale pdraw <- pdraw/sum(pdraw) # Calculate the cumulative probability at each qdraw cpdraw <- cumsum(pdraw) # Caculate the cumulative probability at each bin pbin <- (1:bins)/bins xbin <- NA*(1:bins) for (i in 1:bins) { xbin[i] <- grouping(qdraw[cpdraw<pbin[i]&cpdraw>0], na.rm = TRUE) cpdraw[cpdraw<pbin[i]] <- 2 } (draw.set <- list(digits=floor(log10(bins)), xbin=xbin, pbin=pbin)) } # Draw from acdf ricdf <- function(N, draw.set) { digits <- draw.set$digits pdraws <- ceiling(runif(N)*10^digits)/10^digits draw.set$xbin[match(pdraws,draw.set$pbin)] } # Define an arbitrary pdf f: to compare lets start with normal pdf f <- function(x) dnorm(x) library(ggplot2) # set the number to sample nsamples <- 1000 # Draw from rnorm sample0 <- cbind(draw=rnorm(nsamples), type=0) # Draw inverted cdf distribution information for range between -4 and 4 # using mean approach sample1 <- cbind(draw=ricdf(nsamples, gicdf(f,min=-4,max=4)), type=1) # samplepdf and endsign defined on Quantitations blog # begin Quantitations code endsign <- function(f, sign = 1) { b <- sign while (sign * f(b) < 0) b <- 10 * b return(b) } samplepdf <- function(n, pdf, ..., spdf.lower = -Inf, spdf.upper = Inf) { vpdf <- function(v) sapply(v, pdf, ...) # vectorize cdf <- function(x) integrate(vpdf, spdf.lower, x)$value invcdf <- function(u) { subcdf <- function(t) cdf(t) - u if (spdf.lower == -Inf) spdf.lower <- endsign(subcdf, -1) if (spdf.upper == Inf) spdf.upper <- endsign(subcdf) return(uniroot(subcdf, lower=spdf.lower, upper=spdf.upper)$root) } sapply(runif(n), invcdf) } # end Quantitations code sample2 <- cbind(draw=samplepdf(nsamples, f), type=2) # join the three samples together samples <- as.data.frame(rbind(sample0, sample1, sample2)) ggplot(samples, aes(x=draw, color=type))+ geom_density(aes(group=type, size=-type)) library(rbenchmark) # Let us do some speed comparisons benchmark(rnorm(nsamples), ricdf(nsamples, gicdf(f,min=-4,max=4)), samplepdf(nsamples, f), replications=10) # test replications elapsed relative user.self #2 ricdf(nsamples, gicdf(f, min = -4, max = 4)) 10 7.97 #1 rnorm(nsamples) 10 0.00 #3 samplepdf(nsamples, f) 10 52.39 # This sets the entire process of generating and drawing from ricdf as about 7 times # faster than samplepdf. However, if we seperate the gicdf function from the # ricdf function which only need be run each time a new pdf is input, then we can # considerably increase run speeds. myicdf <- gicdf(f,min=-4,max=4) # Sets up the initial icdf to draw from. benchmark(rnorm(nsamples), ricdf(nsamples, myicdf), replications=1000) # elapsed time for rnorm = .10 seconds and ricdf = .11 # thus drawing from ricdf can be extremely fast # Now let's look at an unusual proportional pdf distribution f <- function(x) { p <- rep(0, length(x)) p[x>0&x<1] <- x[x>0&x<1]^2 p[x>2&x<3] <- 3-x[x>2&x<3]^.5 p[x>4&x<5] <- x[x>4&x<5] p } x <- seq(-1,6,.01) plot(x,f(x), type='l', xlab='Proportional Probability', main='Proportional pdf') # A key thing to note is that the pdf does not need to integrate to 1. # The gicdf function will rescale it to ensure it does. # However, it should all be positive # I define bins as equal to 10,000 when normally they are equal to 1000 myicdf <- gicdf(f,min=-1,max=6, bins=1000) samples <- data.frame(draws=ricdf(100000,myicdf)) ggplot(samples, aes(x=draws))+ geom_histogram(binwidth=.1)+ aes(y = ..density..) # Okay here is one more probability f <- function(x) dnorm(x)*abs(1+x^4) x <- seq(-5,5,.01) plot(x,f(x), type='l', xlab='Proportional Probability', main='Proportional pdf2') myicdf <- gicdf(f,min=-5,max=5, bins=1000) samples <- data.frame(draws=ricdf(100000,myicdf)) ggplot(samples, aes(x=draws))+ geom_histogram(binwidth=.2)+ aes(y = ..density..) # I will leave you to define your own pdfs to draw from. # If you end up using this method, please cite!

## No comments:

## Post a Comment