Been a little light on the postings recently. Here is why:

# Econometrics by Simulation

Simulations in *Stata* and #R# - github.com/EconometricsBySimulation/

## Thursday, August 21, 2014

## Tuesday, August 5, 2014

### Stata: Generate a Spatial Moving Average

Often times we may be interested in generating a spatial moving average of a characteristic X. We

may use this moving average to help control for heterogeneity in the population which may be related to the spatial distribution of observations. In order to do that we need to have a method of generating a spatial mean.

I code this manually because I do not have experience with spatial data in Stata and do not know what the built in command is (assuming there is one). If you are just looking for the spatial mean then you may favor the built in command. However, this method is flexible and easily modifiable if for instance you would like to use measures beyond the Euclidean 2D distance formula and would instead prefer the 3D formula or nD formula really. Likewise moving average statistic might easily be replaced by moving variance or any other statistic that could be generated via the egen command. Thus this exercise might be useful to examine even if redundant.

global Nobs = 1000

clear

set obs $Nobs

* Generate 2D coordinates

gen latt = runiform()*100

gen longg = runiform()*100

* Generate the variable of interest. The variable will

* have a random component and a spatially dependent

* component.

gen X = (latt+longg)/100+rnormal()

two (scatter latt X) (scatter longg X)

* We can see that though there is a general trend to larger values as longitude or latitude increase it is hard to identify any strong pattern.

* Now let's calculate the moving average of X for each

* observation. (There is probably a command for this

* which I do not know).

global meanrange=30

gen Xave = .

gen dist = .

forv i=1/$Nobs {

* Calculate the distance of all points from obs i

replace dist = ((latt-latt[`i'])^2+(longg-longg[`i'])^2)^.5

* Calculate the mean of X if distance is within the range of interest

egen tempx = mean(X) if dist<$meanrange

replace Xave = tempx if _n==`i'

drop tempx

}

drop dist

two (scatter latt Xave) (scatter longg Xave)

* Now, looking at the moving average we can easily visually identify the effect of location on the expected value of X.

may use this moving average to help control for heterogeneity in the population which may be related to the spatial distribution of observations. In order to do that we need to have a method of generating a spatial mean.

I code this manually because I do not have experience with spatial data in Stata and do not know what the built in command is (assuming there is one). If you are just looking for the spatial mean then you may favor the built in command. However, this method is flexible and easily modifiable if for instance you would like to use measures beyond the Euclidean 2D distance formula and would instead prefer the 3D formula or nD formula really. Likewise moving average statistic might easily be replaced by moving variance or any other statistic that could be generated via the egen command. Thus this exercise might be useful to examine even if redundant.

global Nobs = 1000

clear

set obs $Nobs

* Generate 2D coordinates

gen latt = runiform()*100

gen longg = runiform()*100

* Generate the variable of interest. The variable will

* have a random component and a spatially dependent

* component.

gen X = (latt+longg)/100+rnormal()

two (scatter latt X) (scatter longg X)

* We can see that though there is a general trend to larger values as longitude or latitude increase it is hard to identify any strong pattern.

* Now let's calculate the moving average of X for each

* observation. (There is probably a command for this

* which I do not know).

global meanrange=30

gen Xave = .

gen dist = .

forv i=1/$Nobs {

* Calculate the distance of all points from obs i

replace dist = ((latt-latt[`i'])^2+(longg-longg[`i'])^2)^.5

* Calculate the mean of X if distance is within the range of interest

egen tempx = mean(X) if dist<$meanrange

replace Xave = tempx if _n==`i'

drop tempx

}

drop dist

two (scatter latt Xave) (scatter longg Xave)

* Now, looking at the moving average we can easily visually identify the effect of location on the expected value of X.

## Wednesday, July 30, 2014

### More Readable Code with Pipes in R

Several blog posts have made mention of the 'magrittr' package which allows functional arguments to be passed to functions in a pipes style fashion (David Smith ).

This stylistic option has several advantages:

1. Reduced requirements of nested parenthesizes

2. Order of functional operations now read from left to right

3. Organizational style of the code may be improved

The library uses a new operator %>% which basically tells R to take the value of that which is to the left and pass it to the right as an argument. Let us see this in action with some text functions.

Created by Pretty R at inside-R.org

This stylistic option has several advantages:

1. Reduced requirements of nested parenthesizes

2. Order of functional operations now read from left to right

3. Organizational style of the code may be improved

The library uses a new operator %>% which basically tells R to take the value of that which is to the left and pass it to the right as an argument. Let us see this in action with some text functions.

require('magrittr') # Let's play with some strings str1 = "A scratch? Your arm's off." str2 = "I've had worse." str1 %>% substr(3,9) #[1]Evaluates to "scratch" str1 %>% strsplit('?',fixed=TRUE) #[[1]] #[1] "A scratch" " Your arm's off." # Pipes can be chained as well str1 %>% paste(str2) %>% toupper() # [1] "A SCRATCH? YOUR ARM'S OFF. I'VE HAD WORSE." # Let's see how pipes might work with drawing random variables # I like to define a function that allows an element by element maximization vmax <- function(x, maximum=0) x %>% cbind(0) %>% apply(1, max) -5:5 %>% vmax # [1] 0 0 0 0 0 0 1 2 3 4 5 # This is identical to defining the function as: vmax <- function(x, maximum=0) apply(cbind(x,0), 1, max) vmax(-5:5) # Notice that the latter formation uses the same number of parenthsize # and be more readable. # However recently I was drawing data for a simulation in which I wanted to # draw Nitem values from the quantiles of the normal distribution, censor the # values at 0 and then randomize their order. Nitem <- 100 ctmean <- 1 ctsd <- .5 draws <- seq(0, 1, length.out = Nitem+2)[-c(1,Nitem+2)] %>% qnorm(ctmean,ctsd) %>% vmax %>% sample(Nitem) # While this looks ugly, let's see how worse it would have been without pipes draws <- sample(vmax(qnorm(seq(0, 1, length.out = Nitem+2)[-c(1,Nitem+2)] ,ctmean,ctsd)),Nitem) # Both functional sequences are ugly though I think I prefer the first which # I can easily read as seq is passed to qnorm passed to vmax passed to sample # A few things to note with the %>% operator. If you want to send the value to # an argument which is not the first or is a named value, use the '.' mydata <- seq(0, 1, length.out = Nitem+2)[-c(1,Nitem+2)] %>% qnorm(ctmean,ctsd) %>% vmax %>% sample(Nitem) %>% data.frame(index = 1:Nitem , theta = .) # Also not that the operator is not as slow as you might think it should be. # Thus: 1 + 8 %>% sqrt # Returns 3.828427 # Rather than (1 + 8) %>% sqrt # [1] 3

## Wednesday, July 23, 2014

### Making random draws from an arbitrarily defined pdf

I recently found myself in need of a function to sample randomly from an arbitrarily defined probability density function. An excellent post by Quantitations shows how to accomplish this using some of Rs fairly sophisticated functional approximation tools such as integrate and uniroot. The only problem with this excellent post was that the machine cost was enormous with samples of 1000 draws taking 5 seconds on my machine with repeated samples of 100,000+ draws (which I was after) clearly being unworkable.

Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows:

Assumptions

1. You can draw pseudo-random uniform variable u

2. You can integrate the pdf to construct a cdf

$$p = F(x) = \int_{-\infty}^\infty f(x) dx$$

3. You can invert the cdf in order to solve for p

$$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$

The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F).

Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The previously mentioned post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This process is computationally intensive for simple functions and extremely time consuming for complex functions.

In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an x point value for each cdf bin value. The gacdf returns a list of results from this process.

The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation, ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables.

As a matter of comparison, I define the funciton f as the pdf of the normal (dnorm) in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined at Quantitations] I graphically see how the three draws compare below.

On the graph, Black type=0, is rnorm. Dark blue type=1, is ricdf. Light blue, type=2 is samplepdf. Redrawing the graph several times, visually I could not tell the difference between the three methods.

Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used .8 seconds, and samplepdf used 5. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes).

For fun I have generated several other strange 'proportional' pdfs. I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary.

I define a piecewise pdf

And sample from it.

I also define a multimodal pdf

And sample from it.

Find the R code below or the gist on github.

Created by Pretty R at inside-R.org

Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows:

Assumptions

1. You can draw pseudo-random uniform variable u

2. You can integrate the pdf to construct a cdf

$$p = F(x) = \int_{-\infty}^\infty f(x) dx$$

3. You can invert the cdf in order to solve for p

$$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$

The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F).

Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The previously mentioned post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This process is computationally intensive for simple functions and extremely time consuming for complex functions.

In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an x point value for each cdf bin value. The gacdf returns a list of results from this process.

The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation, ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables.

As a matter of comparison, I define the funciton f as the pdf of the normal (dnorm) in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined at Quantitations] I graphically see how the three draws compare below.

Three random sampling procedures for the random normal. |

Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used .8 seconds, and samplepdf used 5. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes).

For fun I have generated several other strange 'proportional' pdfs. I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary.

I define a piecewise pdf

And sample from it.

I also define a multimodal pdf

And sample from it.

Find the R code below or the gist on github.

gicdf <- function(fun, min=-3.5, max=3.5, bins=1000, nqratio=10, grouping=mean, ...) { # Generate an inverse CDF of an arbitrary function fun <- match.fun(fun) grouping <- match.fun(grouping) # Number of points to draw nq=nqratio*bins # Draw indexes qdraw <- seq(min, max,length.out=nq) # Calculate proportional probability of each draw pdraw <- fun(qdraw,...) # Rescale probability sum to 1, rescale pdraw <- pdraw/sum(pdraw) # Calculate the cumulative probability at each qdraw cpdraw <- cumsum(pdraw) # Caculate the cumulative probability at each bin pbin <- (1:bins)/bins xbin <- NA*(1:bins) for (i in 1:bins) { xbin[i] <- grouping(qdraw[cpdraw<pbin[i]&cpdraw>0], na.rm = TRUE) cpdraw[cpdraw<pbin[i]] <- 2 } (draw.set <- list(digits=floor(log10(bins)), xbin=xbin, pbin=pbin)) } # Draw from acdf ricdf <- function(N, draw.set) { digits <- draw.set$digits pdraws <- ceiling(runif(N)*10^digits)/10^digits draw.set$xbin[match(pdraws,draw.set$pbin)] } # Define an arbitrary pdf f: to compare lets start with normal pdf f <- function(x) dnorm(x) library(ggplot2) # set the number to sample nsamples <- 1000 # Draw from rnorm sample0 <- cbind(draw=rnorm(nsamples), type=0) # Draw inverted cdf distribution information for range between -4 and 4 # using mean approach sample1 <- cbind(draw=ricdf(nsamples, gicdf(f,min=-4,max=4)), type=1) # samplepdf and endsign defined on Quantitations blog # begin Quantitations code endsign <- function(f, sign = 1) { b <- sign while (sign * f(b) < 0) b <- 10 * b return(b) } samplepdf <- function(n, pdf, ..., spdf.lower = -Inf, spdf.upper = Inf) { vpdf <- function(v) sapply(v, pdf, ...) # vectorize cdf <- function(x) integrate(vpdf, spdf.lower, x)$value invcdf <- function(u) { subcdf <- function(t) cdf(t) - u if (spdf.lower == -Inf) spdf.lower <- endsign(subcdf, -1) if (spdf.upper == Inf) spdf.upper <- endsign(subcdf) return(uniroot(subcdf, lower=spdf.lower, upper=spdf.upper)$root) } sapply(runif(n), invcdf) } # end Quantitations code sample2 <- cbind(draw=samplepdf(nsamples, f), type=2) # join the three samples together samples <- as.data.frame(rbind(sample0, sample1, sample2)) ggplot(samples, aes(x=draw, color=type))+ geom_density(aes(group=type, size=-type)) library(rbenchmark) # Let us do some speed comparisons benchmark(rnorm(nsamples), ricdf(nsamples, gicdf(f,min=-4,max=4)), samplepdf(nsamples, f), replications=10) # test replications elapsed relative user.self #2 ricdf(nsamples, gicdf(f, min = -4, max = 4)) 10 7.97 #1 rnorm(nsamples) 10 0.00 #3 samplepdf(nsamples, f) 10 52.39 # This sets the entire process of generating and drawing from ricdf as about 7 times # faster than samplepdf. However, if we seperate the gicdf function from the # ricdf function which only need be run each time a new pdf is input, then we can # considerably increase run speeds. myicdf <- gicdf(f,min=-4,max=4) # Sets up the initial icdf to draw from. benchmark(rnorm(nsamples), ricdf(nsamples, myicdf), replications=1000) # elapsed time for rnorm = .10 seconds and ricdf = .11 # thus drawing from ricdf can be extremely fast # Now let's look at an unusual proportional pdf distribution f <- function(x) { p <- rep(0, length(x)) p[x>0&x<1] <- x[x>0&x<1]^2 p[x>2&x<3] <- 3-x[x>2&x<3]^.5 p[x>4&x<5] <- x[x>4&x<5] p } x <- seq(-1,6,.01) plot(x,f(x), type='l', xlab='Proportional Probability', main='Proportional pdf') # A key thing to note is that the pdf does not need to integrate to 1. # The gicdf function will rescale it to ensure it does. # However, it should all be positive # I define bins as equal to 10,000 when normally they are equal to 1000 myicdf <- gicdf(f,min=-1,max=6, bins=1000) samples <- data.frame(draws=ricdf(100000,myicdf)) ggplot(samples, aes(x=draws))+ geom_histogram(binwidth=.1)+ aes(y = ..density..) # Okay here is one more probability f <- function(x) dnorm(x)*abs(1+x^4) x <- seq(-5,5,.01) plot(x,f(x), type='l', xlab='Proportional Probability', main='Proportional pdf2') myicdf <- gicdf(f,min=-5,max=5, bins=1000) samples <- data.frame(draws=ricdf(100000,myicdf)) ggplot(samples, aes(x=draws))+ geom_histogram(binwidth=.2)+ aes(y = ..density..) # I will leave you to define your own pdfs to draw from. # If you end up using this method, please cite!

## Saturday, July 19, 2014

### The Rise and Fall of the name of Jennifer

A recent fascinating post by Analysis at Large got me interested in naming trends analysis. The original post used Social Security data by state to map out the more frequent names by state. While this sounded interesting it turned out a single name (you can read the post yourself to discover that name) dominated the map for all but two states (California and Nevada) which had Jennifer as their most frequent name. The male name distribution was more heterogeneous but still not very informative.

Looking at the data from SS I wanted to see not only what were the most popular names since 1880 (as the previous post had found) but also how the popularity of the names changed over time. First looking at the Figure 1 and 2 show the most popular Male and Female names in terms of the total number of Americans named a particular name. For males there were over 5 million Johns and James and for females there was a little over 4 million Marys.

Figures 3 and 4 show how the top 7 male and female names have risen and fallen in count values over time. It is interesting to note the distinctly different pattern among male names as female names. Male name popularity among the top 7 seems generally unimodal while female names have distinctly different peaks at different places.

In order to cut down on extraneous noise caused from rises and falls of birth rates I also included measurements of what proportion of the total names listed (in the entire SS list) each of the top 7 names represented (Figure 5 and 6). I find these figures much more interesting as they show how some names such as John and William which extremely popular in the 1880s have since fallen dramatically in popularity. Likewise among women the name Mary has fallen almost constantly from a unique peak to a dominative level among female names. It is interesting that among women female names seem to peak with high popularity from very low levels of usage such as the names Jennifer, Margarette, Linda, and Barbara then once again fall in popularity.

Finally in an attempt to get at the right most tails of these trends which uniformly seem to be trending down, I ranked the names among the total number of instances per year for each name. Figure 7 and 8 show the results. In both figures the y axis is scaled by log 10. For men the name Micheal is particularly interesting with very low ranking on usage in the early 20th century and emerging as the most popular name in the decades between 1960 and 2000. Even more interesting, the among the female names the name Jennifer first appeared around 1918 with a usage ranking around 5000 only to steadily gain popularity until being the most popular name is use in the 1970s to early 1980s. Other popular names such as Patricia, Margette, and Barbara followed similar patterns though none so stark.

The total rankings of female names confirms a lose in popularity of traditionally popular names with perhaps the slight exception of Elizabeth which while never extremely popular has managed to stay around the top 10 most popular names in recent decades. Among male names there does not seem to be as much of a faddish behavior with all of the 7 names observed remaining within the top 60 names in total of those chosen.

To see how these tables are produced you may find my R code below the figures or on github.

Created by Pretty R at inside-R.org

Looking at the data from SS I wanted to see not only what were the most popular names since 1880 (as the previous post had found) but also how the popularity of the names changed over time. First looking at the Figure 1 and 2 show the most popular Male and Female names in terms of the total number of Americans named a particular name. For males there were over 5 million Johns and James and for females there was a little over 4 million Marys.

Figures 3 and 4 show how the top 7 male and female names have risen and fallen in count values over time. It is interesting to note the distinctly different pattern among male names as female names. Male name popularity among the top 7 seems generally unimodal while female names have distinctly different peaks at different places.

In order to cut down on extraneous noise caused from rises and falls of birth rates I also included measurements of what proportion of the total names listed (in the entire SS list) each of the top 7 names represented (Figure 5 and 6). I find these figures much more interesting as they show how some names such as John and William which extremely popular in the 1880s have since fallen dramatically in popularity. Likewise among women the name Mary has fallen almost constantly from a unique peak to a dominative level among female names. It is interesting that among women female names seem to peak with high popularity from very low levels of usage such as the names Jennifer, Margarette, Linda, and Barbara then once again fall in popularity.

Finally in an attempt to get at the right most tails of these trends which uniformly seem to be trending down, I ranked the names among the total number of instances per year for each name. Figure 7 and 8 show the results. In both figures the y axis is scaled by log 10. For men the name Micheal is particularly interesting with very low ranking on usage in the early 20th century and emerging as the most popular name in the decades between 1960 and 2000. Even more interesting, the among the female names the name Jennifer first appeared around 1918 with a usage ranking around 5000 only to steadily gain popularity until being the most popular name is use in the 1970s to early 1980s. Other popular names such as Patricia, Margette, and Barbara followed similar patterns though none so stark.

The total rankings of female names confirms a lose in popularity of traditionally popular names with perhaps the slight exception of Elizabeth which while never extremely popular has managed to stay around the top 10 most popular names in recent decades. Among male names there does not seem to be as much of a faddish behavior with all of the 7 names observed remaining within the top 60 names in total of those chosen.

To see how these tables are produced you may find my R code below the figures or on github.

Figure 1

Figure 2
Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

R Code

require(plyr) require(ggplot2) require(scales) # Download data from: # http://www.ssa.gov/oact/babynames/names.zip setwd("C:/Data/SS-names/") files<-list.files() files<-files[grepl(".txt",files)] ###### Reading files namedata <- matrix(0,ncol=4,nrow=0) for (i in 1:length(files)) namedata<-rbind(namedata, cbind(read.csv(files[i],header=F), substr(files[i],4,7))) colnames(namedata)<-c("name","gender","count", "year") dim(namedata) # 1.8 million rows Mdata<-namedata[namedata$gender=="M",] Fdata<-namedata[namedata$gender=="F",] Msums <- ddply(Mdata, .(name), summarize, sum=sum(count)) Fsums <- ddply(Fdata, .(name), summarize, sum=sum(count)) nrow(Msums); nrow(Fsums) # There are 38601 male names and 64089 female names Morder <- Msums[order(Msums[,2], decreasing = TRUE),] Forder <- Fsums[order(Fsums[,2], decreasing = TRUE),] c <- ggplot(Morder[1:20,], aes(x = name, y = sum, size=sum)) c + geom_point() + coord_flip() + theme(legend.position="none")+ ggtitle("20 Most Popular Male Names Since 1880")+ xlab("")+

scale_y_continuous(name="Names Recorded With Social Security Administration",

labels = comma) # Figure 1 c <- ggplot(Forder[1:20,], aes(x = name, y = sum, size=sum)) c + geom_point() + coord_flip() + theme(legend.position="none")+ ggtitle("20 Most Popular Female Names Since 1880")+ xlab("")+

scale_y_continuous(name="Names Recorded With Social Security Administration",

labels = comma) # Figure 2 Mdata$order <- Fdata$torder <- NA # Create a variable for Mdata$prop <- Fdata$prop <- NA for (i in 1880:2013) { Mdata[Mdata$year==i, "torder"] <- order(-Mdata[Mdata$year==i, "count"]) Mdata[Mdata$year==i, "prop"] <- (Mdata[Mdata$year==i, "count"])/ sum((Mdata[Mdata$year==i, "count"])) Fdata[Fdata$year==i, "torder"] <- order(-Fdata[Fdata$year==i, "count"]) Fdata[Fdata$year==i, "prop"] <- (Fdata[Fdata$year==i, "count"])/ sum((Fdata[Fdata$year==i, "count"])) } top <- 7 Mrestricted <- Mdata[Mdata$name%in%Morder[1:top,1],] Frestricted <- Fdata[Fdata$name%in%Forder[1:top,1],] ggplot(Mrestricted, aes(x=year, y=count, group=name, color=name))+ geom_line(size=1)+scale_x_discrete(breaks=seq(1880,2010,20)) ggplot(Mrestricted, aes(x=year, y=prop, group=name, color=name))+ geom_line(size=1)+scale_x_discrete(breaks=seq(1880,2010,20))+ ylab("Proportion of Total Names") ggplot(Mrestricted, aes(x=year, y=torder, group=name, color=name, size=torder))+ geom_line()+scale_x_discrete(breaks=seq(1880,2010,20))+ ylab("Order of Total Names That Year (log10)")+scale_y_log10() ggplot(Frestricted, aes(x=year, y=count, group=name, color=name))+ geom_line(size=1)+scale_x_discrete(breaks=seq(1880,2010,20)) ggplot(Frestricted, aes(x=year, y=prop, group=name, color=name))+ geom_line(size=1)+scale_x_discrete(breaks=seq(1880,2010,20))+ ylab("Proportion of Total Names") ggplot(Frestricted, aes(x=year, y=torder, group=name, color=name, size=torder))+ geom_line()+scale_x_discrete(breaks=seq(1880,2010,20))+ ylab("Order of Total Names That Year (log10)")+scale_y_log10()

## Thursday, July 17, 2014

### RStudio Webinar with Hadley Wickham: The Grammar and Graphics of Data Science

RStudio has recently announced a series of free webinars open to the public. The first of these

seminars is given by Hadley Wickham, Rice University Professor, RStudio Chief Scientist, and general super-star of the R development world. Contributing author of the popular R packages ggplot2, plyr, testhat, reshape, and several of the other most innovative and widely used R packages currently available.

The first seminar is entitled

Upcoming seminars include one focused on "Reproducible Reporting" for which I believe will lead to a revolution in the Social Sciences in the next decade in terms of a tremendous increase in the quality and quantity of scientific work.

The third upcoming seminar is entitled "Interactive Reporting" and is focused on the development of tools to create dynamic reporting, in particular the popular package and server environment named "shiny".

To find a link to the registration site for these webinars go to:

pages.rstudio.net/Webniar-Series-Essential-Tools-for-R.html

seminars is given by Hadley Wickham, Rice University Professor, RStudio Chief Scientist, and general super-star of the R development world. Contributing author of the popular R packages ggplot2, plyr, testhat, reshape, and several of the other most innovative and widely used R packages currently available.

The first seminar is entitled

__The Grammar and Graphics of Data Science__and I would imagine it involving much information with regards to ggplot2 as well as many other essential graphical tools in the R environment.Upcoming seminars include one focused on "Reproducible Reporting" for which I believe will lead to a revolution in the Social Sciences in the next decade in terms of a tremendous increase in the quality and quantity of scientific work.

The third upcoming seminar is entitled "Interactive Reporting" and is focused on the development of tools to create dynamic reporting, in particular the popular package and server environment named "shiny".

To find a link to the registration site for these webinars go to:

pages.rstudio.net/Webniar-Series-Essential-Tools-for-R.html

## Thursday, July 10, 2014

### Item Response Theory and Item Information Exploration

In the following post I will map out some item information functions for item response theory (IRT) models using the common 3 parameter logistic model for binary responses. The model takes three parameters (obviously) which relate to the item features easily named a, b, and c. a refers to item discrimination or ability to tell between subjects with high ability and low ability while b is connected with item difficulty and c is commonly known as the guessing parameter. This is a value which represents the minimum probability a subject should have of getting the answer correct if the subject knows nothing.

The 3PL takes the form of a logistic model. The probability of observing a correct outcome given a subject with ability theta is defined as follows

$$P(u=1|\theta,a,b,c)=c+\frac{1-c}{1+\exp(-Da(\theta-b))}$$

See below to see this probability mapped out. This post is not so much about the probability of correct responses using the 3PL but also the information functions associated with the 3PL model.

Information is a tricky concept as it seems strange that there is a functions that produces information. In a sense the information is a measure of how much certainty we know about a specific parameter. In IRT we are primarily concerned with how much information we know about the ability scores of our test takers. Typical computerized adaptive tests (CAT) incorporate algorithms that select items which are meant to maximize postential information or some other criteria related to information. Information for a test item with regards to subject ability estimates theta is defined by Lord (1980) as

$$I_{\theta}=a^2 *\frac{(pi-c)^2}{(1-c)^2}\frac{Q(\theta)}{P(\theta)}$$

with Q defined as (1-P. In order to make informed analysis of our test takers we must also know how much information we have with regards to our item parameters. Stocking (1990) derives the item parameter information functions for the parameters a,b, and c

$$I_a = \frac{D^2}{(1-c)^2}(\theta-b)^2(P-c)^2 \frac{Q}{P}$$

$$I_b = \frac{D^2a^2}{(1-c)^2}(P-c)^2 \frac{Q}{P}$$

$$I_c = \frac{1}{(1-c)^2}\frac{Q}{P}$$

Now let us see how they look. I will not explain the components of each graph as such things have been explained by many others. Reading the wikipedia article on Item Response Theory and the Stocking 1990 paper "Specifying Optimum Examinees for Item Parameter Estimation in Item Response Theory" will give the curious fairly detailed sources to learn from.

Okay, okay, I will add some commentary. With regards to item information about subject ability it is clear that we want to target items at the point closest to where the subject has a equal or near equal probability of answering correctly or incorrectly. The more questions we drop on that location the more likely the response pattern will be something like 1,0,1,1,0,0,0,1 which will tell use that the subject's ability is at or very close to that level. As we move away from that point the information we have with regards to subject ability begins to decline.

Notice how items which have a higher a parameter not only have steeper information functions but also considerably larger values (in terms of magnitude) which is significant. For this reason computerized adaptive tests must be carefully controlled through various exposure mechanisms so as to prevent those items which have the most discrimination from always being given.

Note how the information with regards to the b parameter and theta is maximized at the same point when c=0 (which is b=theta). When c>0 the optimal point for getting information for theta is offset to the right but so it is for theta (though not depicted in a graph). Thus regardless of which information you are seeking, item difficulty or test taker ability, you face a similar optimization when delivering items.

This is in stark contrast to item parameter a which has information minimized when b=theta. Information with regards to a parameter is best garnished when the question is taken by a test taker with considerably higher or lower ability (when c=0) and only with considerably higher ability (when c=.15). This is in contrast yet again with the information regarding the c parameter which is optimized when a question is taken by a subject who has considerably lower ability than the item difficulty.

The 3PL takes the form of a logistic model. The probability of observing a correct outcome given a subject with ability theta is defined as follows

$$P(u=1|\theta,a,b,c)=c+\frac{1-c}{1+\exp(-Da(\theta-b))}$$

See below to see this probability mapped out. This post is not so much about the probability of correct responses using the 3PL but also the information functions associated with the 3PL model.

Information is a tricky concept as it seems strange that there is a functions that produces information. In a sense the information is a measure of how much certainty we know about a specific parameter. In IRT we are primarily concerned with how much information we know about the ability scores of our test takers. Typical computerized adaptive tests (CAT) incorporate algorithms that select items which are meant to maximize postential information or some other criteria related to information. Information for a test item with regards to subject ability estimates theta is defined by Lord (1980) as

$$I_{\theta}=a^2 *\frac{(pi-c)^2}{(1-c)^2}\frac{Q(\theta)}{P(\theta)}$$

with Q defined as (1-P. In order to make informed analysis of our test takers we must also know how much information we have with regards to our item parameters. Stocking (1990) derives the item parameter information functions for the parameters a,b, and c

$$I_a = \frac{D^2}{(1-c)^2}(\theta-b)^2(P-c)^2 \frac{Q}{P}$$

$$I_b = \frac{D^2a^2}{(1-c)^2}(P-c)^2 \frac{Q}{P}$$

$$I_c = \frac{1}{(1-c)^2}\frac{Q}{P}$$

Now let us see how they look. I will not explain the components of each graph as such things have been explained by many others. Reading the wikipedia article on Item Response Theory and the Stocking 1990 paper "Specifying Optimum Examinees for Item Parameter Estimation in Item Response Theory" will give the curious fairly detailed sources to learn from.

Okay, okay, I will add some commentary. With regards to item information about subject ability it is clear that we want to target items at the point closest to where the subject has a equal or near equal probability of answering correctly or incorrectly. The more questions we drop on that location the more likely the response pattern will be something like 1,0,1,1,0,0,0,1 which will tell use that the subject's ability is at or very close to that level. As we move away from that point the information we have with regards to subject ability begins to decline.

Notice how items which have a higher a parameter not only have steeper information functions but also considerably larger values (in terms of magnitude) which is significant. For this reason computerized adaptive tests must be carefully controlled through various exposure mechanisms so as to prevent those items which have the most discrimination from always being given.

Note how the information with regards to the b parameter and theta is maximized at the same point when c=0 (which is b=theta). When c>0 the optimal point for getting information for theta is offset to the right but so it is for theta (though not depicted in a graph). Thus regardless of which information you are seeking, item difficulty or test taker ability, you face a similar optimization when delivering items.

This is in stark contrast to item parameter a which has information minimized when b=theta. Information with regards to a parameter is best garnished when the question is taken by a test taker with considerably higher or lower ability (when c=0) and only with considerably higher ability (when c=.15). This is in contrast yet again with the information regarding the c parameter which is optimized when a question is taken by a subject who has considerably lower ability than the item difficulty.

Pr <- function(theta, it, D=1) { a <- it[,1] b <- it[,2] c <- it[,3] e <- exp(D * a * (theta - b)) c + (1 - c) * e/(1 + e) } ### Set parameters to be mapped over theta <- seq(-3.5,3.5,.1) require(ggplot2) ### Plotting Item response functions cfix=0 afix=1 presp <- data.frame(theta=rep(theta,3), prob=c(Pr(theta, cbind(afix, -2, c=cfix)), Pr(theta, cbind(afix, 0, c=cfix)), Pr(theta, cbind(afix, 2, c=cfix))), b=rep(c(-2,0,2),each=length(theta))) ggplot(presp,aes(x=theta, y=prob, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information response function when c=",cfix, " and a=", afix)) + xlab(expression(theta))

### Plotting Item response functions cfix=.2 afix=2 presp <- data.frame(theta=rep(theta,3), prob=c(Pr(theta, cbind(afix, -2, c=cfix)), Pr(theta, cbind(afix, 0, c=cfix)), Pr(theta, cbind(afix, 2, c=cfix))), b=rep(c(-2,0,2),each=length(theta))) ggplot(presp,aes(x=theta, y=prob, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information response function when c=",cfix, " and a=", afix)) + xlab(expression(theta))

# Item information with regards to the estimate of person # is a well known equation. Itheta <- function(theta=theta, a=1,b=0,c=0, D=1) { pi <- Pr(theta=theta, it=cbind(a,b,c),D) a^2 * (pi-c)^2 * (1-pi) / ((1-c)^2*pi) } # Calculate the information about each item parameter Ia <- function(theta=theta, a=1,b=0,c=0, D=1) { pi <- Pr(theta=theta, it=cbind(a,b,c),D) D^2/(1-c)^2*(theta-b)^2*(pi-c)^2*(1-pi)/pi } Ib <- function(theta=0, a=1,b=0,c=0, D=1) { pi <- Pr(theta=theta, it=cbind(a,b,c),D) (D*a*(pi-c)/(1-c))^2*(1-pi)/pi } Ic <- function(theta=0, a=1,b=0,c=0, D=1) { pi <- Pr(theta=theta, it=cbind(a,b,c),D) 1/(1-c)^2*(1-pi)/pi } # Let's see these item information functions mapped out ### ---- Map out item information for latent ability theta cfix=0 afix=1 ainfo <- data.frame(theta=rep(theta,3), information=c(Itheta(theta, b=-2, a=afix, c=cfix), Itheta(theta, b=0 , a=afix,c=cfix), Itheta(theta, b=2 , a=afix,c=cfix)), b=rep(c(-2,0,2),each=length(theta))) ggplot(ainfo,aes(x=theta, y=information, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information function for a when c=",cfix, " and a=", afix)) + xlab(expression(theta)) ### ---- Map out item information for latent ability theta cfix=0 afix=2 ainfo <- data.frame(theta=rep(theta,3), information=c(Itheta(theta, b=-2, a=afix, c=cfix), Itheta(theta, b=0 , a=afix,c=cfix), Itheta(theta, b=2 , a=afix,c=cfix)), b=rep(c(-2,0,2),each=length(theta)))

ggplot(ainfo,aes(x=theta, y=information, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information function for a when c=",cfix, " and a=", afix)) + xlab(expression(theta)) ### ---- Map out item information for latent ability theta cfix=.5 afix=1 ainfo <- data.frame(theta=rep(theta,3), information=c(Itheta(theta, b=-2, a=afix, c=cfix), Itheta(theta, b=0 , a=afix,c=cfix), Itheta(theta, b=2 , a=afix,c=cfix)), b=rep(c(-2,0,2),each=length(theta))) ggplot(ainfo,aes(x=theta, y=information, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information function for a when c=",cfix, " and a=", afix)) + xlab(expression(theta))

### ---- Map out item information for parameter a # c=0 cfix=0 ainfo <- data.frame(theta=rep(theta,3), information=c(Ia(theta, b=-2, c=cfix), Ia(theta, b=0 , c=cfix), Ia(theta, b=2 , c=cfix)), b=rep(c(-2,0,2),each=length(theta))) ggplot(ainfo,aes(x=theta, y=information, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information function for a when c=",cfix)) + xlab(expression(theta))

### ---- Map out item information for parameter a # c=0.15 cfix=.15 ainfo <- data.frame(theta=rep(theta,3), information=c(Ia(theta, b=-2, c=cfix), Ia(theta, b=0 , c=cfix), Ia(theta, b=2 , c=cfix)), b=rep(c(-2,0,2),each=length(theta)))

ggplot(ainfo,aes(x=theta, y=information, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information function for parameter a when c=",cfix)) + xlab(expression(theta)) ### ---- Map out item information for parameter b # c=0 cfix=0 binfo <- data.frame(theta=rep(theta,3), information=c(Ib(theta, b=-2, c=cfix), Ib(theta, b=0 , c=cfix), Ib(theta, b=2 , c=cfix)), b=rep(c(-2,0,2),each=length(theta))) ggplot(binfo,aes(x=theta, y=information, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information function for parameter b when c=",cfix)) + xlab(expression(theta))

### ---- Map out item information for parameter b # c=0.15 cfix=.15 binfo <- data.frame(theta=rep(theta,3), information=c(Ib(theta, b=-2, c=cfix), Ib(theta, b=0 , c=cfix), Ib(theta, b=2 , c=cfix)), b=rep(c(-2,0,2),each=length(theta))) ggplot(binfo,aes(x=theta, y=information, group=b, colour = b))+ geom_line(size=2) + ggtitle(paste0("Information function for parameter b when c=",cfix)) + xlab(expression(theta))

### ---- Map out item information for parameter c # a=1, b=0 afix=1 bfix=0 cinfo <- data.frame(theta=rep(theta,3), information=c(Ic(theta, c=.15 , a=afix), Ic(theta, c=.3 , a=afix), Ic(theta, c=.45 , a=afix)), c=rep(c(.15,.3,.45),each=length(theta))) ggplot(cinfo,aes(x=theta, y=information, group=c, colour = c))+ geom_line(size=2) + ggtitle( paste0("Information function for parameter c when a=",afix, " and b=", bfix)) + xlab(expression(theta))

### ---- Map out item information for parameter c # a=2, b=0 afix=2 bfix=0 cinfo <- data.frame(theta=rep(theta,3), information=c(Ic(theta, c=.15 , a=afix), Ic(theta, c=.3 , a=afix), Ic(theta, c=.45 , a=afix)), c=rep(c(.15,.3,.45),each=length(theta))) ggplot(cinfo,aes(x=theta, y=information, group=c, colour = c))+ geom_line(size=2) + ggtitle( paste0("Information function for parameter c when a=",afix, " and b=", bfix)) + xlab(expression(theta))

` `

` `

Subscribe to:
Posts
(
Atom
)