Friday, August 30, 2013

Making regex examples work for you!

One of the most frequently used string recognition algorithms out there is regex and R implements regex.  However, users can often be frustrated with how despite taking examples verbatim from many sources such as stackoverflow they do not seem to work.  From my own experience, I have found that the largest issue is really about what characters need to be escaped from R.

For example:

Listing all files whose names match a simple pattern.

Looking at "/^.*icon.*\.png$/i" from

I was able to get "^.*icon.*.png$" to work in R though I lost the case insensitivity.  I think including the "^." ensures that only files in the current directory, not subdirectory are matched but I am not sure.

So, the following code will return a list of file names from the folder Clipart which match the pattern [anything]icon.png

list.files("C:/Clipart/", pattern="^.*icon.*.png$")
[1] "manicon.png"     "handicon.png"     "bookicon.png"

Looking at the original entry we can see that what was causing us problems was the attempt to escape the "^" which does not need to be escaped in R.

Before looking at another example lets modify the previous command slightly to show how we can make it match differently.

list.files("C:/Clipart/", pattern="^.*icon*.*.png$")
[1] "manicon.png"     "handicon.png"     "bookicon.png"    "iconnew.png"    

There are a lot of resources available for regex since it is really its own text matching language supported by many different programming languages.  A good introductory guide can be found:


Tuesday, August 27, 2013

Rstylizer - Shiny, Stata HTML Syntax Highlighter

I have released a shiny app through the Rstudio Spark server that allows for users to copy and paste Stata code into the app and it returns html formatted code.  The syntax structure is very easy to modify and I am looking forward to updating it when I have time, permitting extensions of the app to other languages as well as allowing users to customize the particular aspects of of the formatting such as making some commands bold or changing the color or how comments are identified.  In addition, I also hope to include an input field for users to add to language definitions, allowing for the creation of custom definitions.

[[The app can be found at:]]

The git can be found at:

In general the html generated is not very efficient since tags are repeated when css embedded styles is probably the preferred option.  However, since it makes very little difference in terms of load time, I figure this should work fine.  In addition, Blogger's gui seems to handle the tags better than embedded style sheets.

(I created an outdated git: as well but could not figure out how to change names so sorry for the messiness.)

Friday, August 23, 2013

The Delta Method Revisited

* In a previous post or two I have attempted to explore how to use the delta method to derive standard errors.  The purpose of those posts as with many of my posts was to understand the method better.  

* Unsurprisingly I made a fairly significant mistake.  Fortunately, Richard Hofler recently caught the mistake and has emailed me corrected code.  I have included the code (below).

* To see the original post: the-delta-method-to-estimate-standard

set seed 101

set obs 1000

gen x1 = rnormal()
gen x2 = rnormal() * 4

global b0 = 1
global b1 = 1.5
global b2 = .3

*          G = (3*b0-b1)*b2^2 
local true_G = (3*${b0}-${b1})*${b2}^2

di `true_G'

gen y = ${b0} + ${b1}*x1 + ${b2}*x2 + rnormal()*8

reg y x1 x2

* now retrieve the variance-covariance mtx for the estimated coefficients
* need this for var_hat(G) 
matrix Varcov = e(V)
* display the entire matrix
* note the lower diagonal form
matrix list Varcov

* put elements of Varcov into scalars for variances and covariances
scalar varb0=Varcov[3,3]
scalar varb1=Varcov[1,1]
scalar varb2=Varcov[2,2]
scalar covb1b2=Varcov[2,1]
scalar covb0b1=Varcov[3,1]
scalar covb0b2=Varcov[3,2]

*        G = (3*b0-b1)*b2^2 
local Ghat = (3*_b[_cons]-_b[x1])*_b[x2]^2

di "Ghat = `Ghat' is our estimate (true = `true_G')"

* Let's see if we can use the delta method to derive a standard error.
* correct expression 
* var_hat(G) = b2^2*(4*(3*b0 - b1)*b2*(3*covb0b2 - covb1b2) + 
* b2^2*(-6*covb0b1 + 9*varb0 + varb1) + 4*(-3*b0 + b1)^2*varb2)
scalar var_Ghat = _b[x2]^2*(4*(3*_b[_cons] - _b[x1])*_b[x2]*    /// 
(3*covb0b2 - covb1b2) + _b[x2]^2*(-6*covb0b1 + 9*varb0 + varb1) +     /// 
4*(-3*_b[_cons] + _b[x1])^2*varb2)

* take sq. root of var_Ghat to get standard error of G-hat
scalar se_Ghat = sqrt(var_Ghat)

di "Standard error by delta method is " se_Ghat

* Francis Smart - Coding again

* We can also easily calculate the standard error via bootstrap.

cap program drop gest
program define gest

  reg y x1 x2
  tempvar G
  gen `G' = (3*_b[_cons]-_b[x1])*_b[x2]^2
  sum `G'


bs r(mean), rep(200): gest
* I get a standard error of .105 which is quite close to the standard error derived by
* the delta method.

* However what we really want to be comparing our standard error estimates to is the true
* standard deviation of the estimate of G.

* This we can do because we are simulating our data.  Thus we can estimate G repeatedly
* by repeating the simulation.

cap program drop ggen
program define ggen

  set obs 1000

  gen x1 = rnormal()
  gen x2 = rnormal() * 4

  gen y = ${b0} + ${b1}*x1 + ${b2}*x2 + rnormal()*8
  reg y x1 x2

  gen G = (3*_b[_cons]-_b[x1])*_b[x2]^2
  sum G


simulate  r(mean), rep(200): ggen
* I get a standard deviation of G estimate of .097 which is quite close to our other methods of estimating G.

Formatted By Econometrics by Simulation

Wednesday, August 21, 2013

Doodling in R!

# I am working on creating some functions that will be capable of creating shapes and plots that look hand drawn.
# I have made some progress in this goal.

# In that process I have also discovered that I can make some doodles that look hand drawn as well.
# In order to accomplish the goal of simulating hand drawing I want to simulate the momentum of hand writing.
# In order to do that I will break the task down into a goal oriented system where each end point is a target.
doodle <- function(
  targets = rbind(c(0,10),c(10,10), c(10,0), c(0,0)) ,
  tdist = .25,
  speed = c(0,0),
  accel = .1,
  resis = .005,
  jitter = .0005,
  chncStp = 0) {
  # start - We start with the starting position
  # targ - Points that will be pursued (initially just a square)
  # tdist - How close we need to get to each point before moving on
  # speed - Initial speed
  # accel - How fast does the drawer accelerate towards that point
  # resis - What percentage of speed is lost each round
  # jitter - A normal draw random jitter that moves the writing tool in an unexpected direction.
  # chncStp - There is some chance that the drawing tool will kill all momentum and stop.
  # First off I define a function uvect to convert any two sets of points 
  # into a unit vector and measure the distance between the two points.
  uvect <- function(p1,p2=NULL) {
    if (is.null(p2)) {
      p2 <- p1[[2]]
      p1 <- p1[[1]]
    list(vect=(p2-p1)/sqrt(sum((p1-p2)^2)), dist=sqrt(sum((p1-p2)^2)))
  # Starup parameters
  i <- 1
  plist <- position <- start # plist saves all of the points that the drawing tool has passed through
  vect <- uvect(position,targets[i,])
  while(i<=nrow(targets)) {
    # Calculate the appropriate unit vector and distance from end point
    vect <- uvect(position,targets[i,])
    # Remove some amount of speed from previous velocity
    speed <- speed*(1-resis)
    # IF drawer randomly stops remove all speed
    if (rbinom(1,1,chncStp)) speed<-0
    speed <- speed + accel*vect[[1]] + rnorm(2)*jitter
    position <- position + speed
    plist <- rbind(plist,position)
    vect <- uvect(position,targets[i,])
    if (vect[[2]]<tdist) i <- i+1
plist <- doodle()
plot(plist, type="n", lwd=3)
lcol <- rainbow(nrow(plist-1))
for (i in 1:(nrow(plist)-1)) lines(plist[c(i:(i+1)),], type="l", lwd=3, col=lcol[i])

# However this was not the primary intention of this function.
# The main intention is to be able to make plots that look hand drawn.
shape1 <- doodle(cbind(c(0,10,10,0),c(10,10,0,0)),resis=.25)
plot(shape1, type="l", lwd=1)

shape2 <- doodle(cbind(c(0,-2,5,15,10,0),c(5,9,10,5,2,0)),resis=.25)
plot(shape2, type="l", lwd=1)
Created by Pretty R at

# To tell you the truth.  I don't know what is going on some I am going to have to debug this function for a while.  In the mean time it is making unexpected shapes which look pretty crazy.

Monday, August 19, 2013

Export R Results Tables to Excel - Please don't kick me out of your club

This post is written as a result of finding the following exchange on one of the R mailing lists:


[Make sure to check out the many great comments on the bottom of the post.  That is where some better answers to this problem can be found.]

Question: Is there a way to export regression output to an excel spreadsheet?
Translation: I would like to be able to do a very simple thing that almost any statistical programming language can easily do, please suggest a basic command to do that.

Response1: ?lm ?coef ?write.csv ...
Translation: Read the manual and try this bit of incomplete code.

Questioner: I am very very new with R... Is there some simple code I could just paste?
Translation: Really? Isn't there anything you could suggest?

Response2: This is the help-you-learn-R mailing list, not the do-my-work-for-me mailing list...
Translation: Go F%$# yourself, freeloader. We only answer interesting questions.

I just wanted to say that it is just this type of response that gives R-users a bad reputation. I am an active R user and very happy to contribute to R in whatever way is possible but when I see posts like this, it makes me want to switch to a language in which the users are NICE people.  Okay, I know, I know.  A few bad apples should not spoil the basket, but sometimes things just taste rotten.

First thing, the reason I even stumbled across this post was because I had the same very similar question. Looking at the hits on the bottom of the page I can see that there are over 400 people who have viewed this discussion I am guessing most of them because they were looking for a specific solution rather than being interested in seeing how quickly experienced R users could could offend new users (for which there are numerous other examples).

In all likelihood, a lot of other new R users have come across this same post and been equally confounded but this rude and ridiculous response.

The original user who asked this question asked a very simple question for which any statistical language should have a very simple canned response.  Something along the lines:

lmOut(mylm, file="results", filetype="csv")

Yet the response that was instead produced was one which was overly complex, patronizing, and ultimately needlessly insulting.

I have written a little program to help format regression summary statistics into spreadsheet formats easily read by excel. Sorry if this is redundant. I am sure hundreds of people have programmed similar solutions. But I think it might be useful to many users who are not very familiar with how R constructs results.

Find the code on this gist:

Output looks something like this:

Looking over the original exchange it does not look like this code would even work for a logit for which it was originally needed.  However, I will post it on github and perhaps others will find the concept useful enough to make revisions (unlikely).  I might take another go at making it more general in the future though in all likelihood some user will send me an angry message saying "this has already been done by ...".

Question and Answer: Generating Binary and Discrete Response Data

I was recently contacted by a reader with two very specific questions and I thought that this would be a good topic to publicity respond to. He would like to simulate his data:
I have firm level data and the model is discrete choice with the main explanatory variable also a binary choice:  First question is how can I calibrate the data generation model? 


This is a fundamental question for any kind of econometric model.  How you calibrate your data implies the inherent structure of your data which in term implies what method you should use to attempt to recover your parameters.  Now some data generating processes exist out there which do not yet have econometric solutions to.  Yet there are many that do.

In general you can calibrate your data by i. modifying the parameters, ii. the distribution of explanatory variables, or iii. the distribution of the errors.

In a binary response case the most common models are probit/logit in which case in order to simulate data you would generate your underlying model and overlay the appropriate CDF over it which gives you probabilities of a success.  Finally you would make a random draw based on those probabilities for each outcome being simulated.

I have numerous example code demonstrating this:
Stata: (Reverse Engineering a Probit) (Probit vs Logit)
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
B <- c(B0=-.2, B1=-.1,B2=0,B3=-.2)
P <- pnorm(X%*%B)
SData <-,1,P), X))
summary(glm(Y ~ X1 + X2 + X3, family = binomial(link = "probit"), data = SData))

Discrete Data
As for discrete data, it is less clear what the optimal choice is. I prefer the logistic regression which is basically an extension of the Logit model with a few interesting caveats.

Stata: (Simulating Multinomial Logit)
R: (here is an article dealing specifically with using R to create discrete response data

Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
# Coefficients, each input vector (c) is associated with a different outcome
B <- cbind(0, c(B0=-.2, B1=-.1,B2=0,B3=-.2), c(B0=.3, B1=0,B2=.6,B3=.4))
# Everything is relative to option 1 which is the default
num <- exp(X%*%B) # Numerator
den <- apply(num,1,sum) # Denominator
P <- num * 1/cbind(den,den,den) # Probability
CP <- cbind(P[,1],P[,1]+P[,2]) # Cumulative probabilities
U <- runif(Nobs) # Draw from the uniform draw
Y <- rep(0,Nobs) ; Y[U>CP[,1]]<-1; Y[U>CP[,2]]<-2 # Calculate outcome

SData <-, X)) # Combine Datarequire("nnet")
summary(Mlogit <- multinom(Y ~ X1 + X2 + X3, data = SData))

Friday, August 16, 2013

Study design - test power, test rejection rate, permutation test

In an experiment involving social policy, a question has arisen as to what statistical method to use. The outcome is 0/1 -- let's say, whether a criminal defendant shows up for trial or not. Then there are a control group and 4 treatment groups (thus, 5 indicator variables showing which group each defendant is in). So the point is to predict the 0/1 outcome based on which treatment group the defendant is in.

One possibility is to use ANOVA, but I'm not sure of that due to the assumption that the outcome is normally distributed, which an indicator variable is not.

I'd like to simulate a dataset such that the outcome (0/1) is as follows: Control group, 20% chance of not showing up (label that as 1, so 20% of the controls should show 1 in the outcome column and 80% should show zeros); treatment group 1, 18% chance of not showing up), treatment group 2, 16% chance of not showing up; treatment group 3, 14% chance of not showing up; and treatment group 4, 12% chance of not showing up. Then I'd like to be able to compare the results from ANOVA as well as a permutation test's

Stata Code:

* Define a program called sampler to generate our data and run our tests.
cap program drop sampler
program define sampler , rclass

    * The first argument of the sampler command is the sample
 * size to be generated.
    gl samp_size = `1'

 * Start generating data
 set obs 5

 * A token observation for each group defined by treatment status
 gen treat = _n

 * Generate the numbers in each group size
 gen     grp_size = ${samp_size} * ${ntreat1}  if treat==1
 replace grp_size = ${samp_size} * ${ntreat2}  if treat==2
 replace grp_size = ${samp_size} * ${ntreat3}  if treat==3
 replace grp_size = ${samp_size} * ${ntreat4}  if treat==4
 replace grp_size = ${samp_size} * ${ntreat5}  if treat==5
 tab treat, gen(treat)

 * Generate a variable to store the true probability that a defendent 
 *   will show up for trial.
 gen     p_show_up = ${treat1_trial}  if treat==1
 replace p_show_up = ${treat2_trial}  if treat==2
 replace p_show_up = ${treat3_trial}  if treat==3
 replace p_show_up = ${treat4_trial}  if treat==4
 replace p_show_up = ${treat5_trial}  if treat==5

 * Now let's expand our token observations to a full data set
 expand grp_size

 * Now we have the individuals make the actual decisions to show up
 gen     show_up = rbinomial(1,p_show_up)
 * We can do an simple anova by using a multiple regression
    reg show_up treat2-treat5 
 test treat2=treat3=treat4=treat5=0
 return scalar F05 = `=r(p)<.05'
 return scalar F10 = `=r(p)<.1'
    test treat2=0
 return scalar t2_05 = `=r(p)<.05'
 return scalar t2_1  = `=r(p)<.1'

    test treat3=0
 return scalar t3_05 = `=r(p)<.05'
 return scalar t3_1  = `=r(p)<.1'

 test treat4=0
 return scalar t4_05 = `=r(p)<.05'
 return scalar t4_1  = `=r(p)<.1'

    test treat5=0
 return scalar t5_05 = `=r(p)<.05'
 return scalar t5_1  = `=r(p)<.1'


* Define a program for defining an exact statistics for false rejection by
* randomly sorting the dependent variable.
cap program drop perm
program define perm, rclass
* Perm takes three arguments
* command: the command to be run after the dependent variables
*   have been scrambled
* dep_var: a list of dependent variables to be scrambled
* tests: a list of tests to be run after running the command 

  syntax [anything], COmmand(string) DEp_var(varlist) tests(string)

  foreach v in `dep_var' {
    tempvar sorter orderer
    gen `sorter' = rnormal()
    gen `orderer' = _n
 sort `sorter'
 replace `v' = `v'[`orderer']
 di "reorder `v'"
  local t=1
  gl return_list
  foreach v in `tests' {
    test `v'
    return scalar t`t'_05 = `=r(p)<.05'
 return scalar t`t'_10  = `=r(p)<.1'
    gl return_list ${return_list} t`t'_05=r(t`t'_05) t`t'_10=r(t`t'_10)
    local t=1+`t'

* Begin Estimation

* Treatment one is the control group
* Define the fraction of subjects in each group
gl ntreat1 = 1/5
gl ntreat2 = 1/5
gl ntreat3 = 1/5
gl ntreat4 = 1/5
gl ntreat5 = 1/5

* Probability that a criminal defendent shows up for trial
gl treat1_trial = .20
gl treat2_trial = .18
gl treat3_trial = .16
gl treat4_trial = .14
gl treat5_trial = .12

sampler 5500

* Let's see how the permutation test works
perm, co(reg show_up treat2-treat5) de(show_up)         /// 
      tests("treat2=treat3=treat4=treat5=0" "treat2=0"  /// 
         "treat3=0" "treat4=0" "treat5=0")
simulate ${return_list}, rep(10000):  /// 
    perm, co(reg show_up treat2-treat5) de(show_up)   /// 
    tests("treat2=treat3=treat4=treat5=0" "treat2=0"  /// 
       "treat3=0" "treat4=0" "treat5=0")

/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
       t1_05 |     10000       .0507    .2193954          0          1
       t1_10 |     10000       .1002    .3002815          0          1
       t2_05 |     10000       .0524    .2228435          0          1
       t2_10 |     10000       .1015    .3020048          0          1
       t3_05 |     10000        .054     .226029          0          1
       t3_10 |     10000       .1049    .3064398          0          1
       t4_05 |     10000       .0519    .2218362          0          1
       t4_10 |     10000       .1008    .3010788          0          1
       t5_05 |     10000       .0514    .2208233          0          1
       t5_10 |     10000       .1018    .3024002          0          1 

These numbers are looking pretty good.  I would suspect any bias in rejection
rates is very small from these estimates.                             */

gl save_values F05=r(F05) F10=r(F10) t2_05=r(t2_05) t2_1=r(t2_1)  /// 
         t3_05=r(t3_05) t3_1=r(t3_1) t4_05=r(t4_05)               /// 
   t4_1=r(t4_1) t5_05=r(t5_05)                              /// 
   b2=_b[treat2] b3=_b[treat3] b4=_b[treat4] b5=_b[treat5]

* Now let's use a Monte Carlo resampler to see how frequently we reject the null.
simulate  ${save_values} , rep(5000): sampler 5500
/*  Variable |       Obs        Mean    Std. Dev.       Min        Max
         F05 |      5000        .999    .0316101          0          1
         F10 |      5000       .9994      .02449          0          1
       t2_05 |      5000       .2672    .4425419          0          1
        t2_1 |      5000        .381    .4856811          0          1
       t3_05 |      5000       .7202    .4489457          0          1
        t3_1 |      5000       .8114    .3912297          0          1
       t4_05 |      5000       .9686    .1744137          0          1
        t4_1 |      5000       .9846    .1231498          0          1
       t5_05 |      5000       .9994      .02449          0          1
          b2 |      5000   -.0200947    .0163725  -.0754545   .0527273
          b3 |      5000   -.0399916    .0161132  -.1045455   .0254545
          b4 |      5000   -.0601151    .0158372  -.1154545  -.0045455
          b5 |      5000   -.0801233    .0155707  -.1354546  -.0254545

Looking at the estiamates of the coeficients b2-b5 the estimates seem
to be unbiased since their divergence from -.02, -.04, -.06, and -.08
is very small.  However, we can test this as well.

reg b2
test _cons==-.02

reg b3
test _cons==-.04

reg b4
test _cons==-.06

reg b5
test _cons==-.08

* All of the p values are well above the 10% rejection rate.

Formatted By

Monday, August 12, 2013

A Stata HTML syntax highlighter in R

So I have been having difficulty getting my Stata code to look the way I want it to look when I post it to my blog.  To alleviate this condition I have written a html encoder in R.  I don't know much about html so it is likely to be a little clunkier in terms of tags than it need be.  It still needs some work but I thought to post what I have so far in case others would like to use the code to format their Stata code or modify it to format any language of their choosing.

I would like to build a Shiny app in which all the user need do is paste the code and submit it.  But that will be for a future post.  Here is the code and the example using my Stata post from July 30th.  You can also find the code on github.  Please feel free to submit possible solutions to the two technical hurdles I discuss in my code (the inability to find and format numbers and the difficulty of finding and formatting punctuation).


# A Stata HTML formatter in R
# Load up your Stata do file.
txt <- readLines(
# First subsititute out all of the < and > which can be misinterpretted as 
# tags in HTML.
txt <- gsub("<","&lt;",txt) 
txt <- gsub(">","&gt;",txt) 
# Choose the formatting tags you would like applied to each field type.
comment.start <- '<span style="color: #669933">'
comment.end   <- '</span>'
# I would like to auto format all numbers but I have nto yet been able to figure
# out how to do this.
num.start <- '<span style="color: #990000"><b>'
num.end   <- '</b></span>'
punc.start <- '<span style="color: #0000FF">'
punc.end   <- '</span>'
command1.start <- '<span style="color: #0000CC"><b>'
command1.end   <- '</b></span>'
command2.start <- '<span style="color: #9900FF">'
command2.end   <- '</span>'
command3.start <- '<span style="color: #990033">'
command3.end   <- '</span>'
# I am not sure where exactly I got this 
stata.commands1 <- unlist(strsplit(readLines(
                            , split=" "))
stata.commands2 <- unlist(strsplit(readLines(
                            , split=" "))
stata.commands3 <- unlist(strsplit(readLines(
                            , split=" "))
punc <- unlist(strsplit(readLines(
"")                          , split=" "))
# I want to figure out how to highlight the puncuation as well but I am having trouble
# with that.
# for (v in punc) txt<-  gsub(v, 
#        paste0(punc.start,v,punc.end), txt)
# Create a vector to tell R to ignore entire lines.
comment <- (1:length(txt))*0
# '*' Star comment recognizer
for (i in grep("[:*:]", txt)) {
  # Break each line to discover is the first symbol which is not a space is a *
  txt2 <- strsplit(txt[i], split=" ")[[1]]
  if (txt2[txt2!=""][1]=="*") {
    txt.rep <- paste(c(comment.start,txt[[i]],comment.end), collapse="")
    txt[[i]] <- txt.rep
    comment[i] <- 1
# '//' Comment recognizer
for (i in (grep("//", txt))) if (comment[i]==0) {
  txt2 <- strsplit(txt[i], split=" ")[[1]] <- grep("//", txt2)[1]-1
  txt.rep <- paste(c(txt2[], comment.start, 
                     txt2[-(],comment.end), collapse=" ")
    txt[[i]] <- txt.rep
# Format stata commands that match each list
# "\\<",v,"\\>" ensures only entire word matches
# are used.
for (v in stata.commands1) txt[comment==0]<-
for (v in stata.commands2) txt[comment==0]<-
for (v in stata.commands3) txt[comment==0]<-
# This is my attempt at highlighting all numbers that are not words.
# It did not work.  
# <a href ="">stackoverflow topic</a>
# txt <- gsub(".*([[:digit:]]+).*", paste0(num.start,"\\1",num.end), txt)
# Add tags to the end and beginning to help control the general format.
txt <- c('<pre><span style="font-family: monospace',txt,
         '\nFormatted By <a href=""></a>',
# Copy formatted HTML to the clipboard.
writeClipboard(paste(txt, collapse="\n"))
Formatted by Pretty R at

Stata code formatting example:


set obs 4000

gen id = _n

gen eta1 = rnormal()
gen eta2 = rnormal()

* Generate 5 irrelevant factors that might affect each of the
* different responses on the pretest
gen f1 = rnormal()
gen f2 = rnormal()
gen f3 = rnormal()
gen f4 = rnormal()
gen f5 = rnormal()

* Now let's apply the treatment
expand 2, gen(t)   // double our data 

gen treat=0
replace treat=1 if ((id<=_N/4)&(t==1))

* Now let's generate our changes in etas
replace eta1 = eta1 + treat*1 + t*.5
replace eta2 = eta2 + treat*.5 + t*1

* Finally we generate out pre and post test responses
gen v1 = f1*.8  + eta1*1  + eta2*.4   // eta1 has more loading on 
gen v2 = f2*1.5 + eta1*1  + eta2*.3   // the first few questions 
gen v3 = f3*2   + eta1*1  + eta2*1  
gen v4 = f4*1   + eta1*.2 + eta2*1   // eta2 has more loading on 
gen v5 = f5*1   +           eta2*1   // the last few questions 

* END Simulation
* Begin Estimation

sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
predict L1 L2, latent

sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
predict L12 L22, latent

replace  L1 = L12 if t==1
replace  L2 = L22 if t==1

* Now let's see if our latent predicted factors are correlated with our true factors.
corr eta1 eta2 L1 L2

* We can see already that we are having problems.  
* I am no expert on SEM so I don't really know what is going wrong except
* that eta1 is reasonably highly correlated with L1 and L2 and
* eta2 is less highly correlated with L1 and L2 equally each
* individually, which is not what we want.

* Well too late to stop now.  Let's do our diff in diff estimation.
* In this case we can easily accomplish it by generating one more variable.

* Let's do a seemingly unrelated regression form to make a single joint estimator.

sureg (L1 t id treat) (L2 t id treat)

* Now we have estimated the effect of the treatment given a control for the
* time effect and individual differences.  Can we be sure of our results?
* Not quite.  We are treating L1 and L2 like observed varaibles rather than
* random variables we estimated.  We need to adjust out standard errors to
* take this into account.  The easiest way though computationally intensive is
* to use a bootstrap routine.

* This is how it is done.  Same as above but we will use temporary variables.
cap program drop SEMdnd
program define SEMdnd

  tempvar L1 L2 L12 L22
  sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
  predict `L1' `L2', latent
  sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
  predict `L12' `L22', latent

  replace  `L1' = `L12' if t==1
  replace  `L2' = `L22' if t==1

  sureg (`L1' t id treat) (`L2' t id treat)

  drop `L1' `L2' `L12' `L22'


SEMdnd   // Looking good 

* This should do it though I don't hae the machine time available to wait
* for it to finish.
bs , rep(200) cluster(id): SEMdnd 

Formatted By

Monday, August 5, 2013

Item Equating with same Group - SAT, ACT example

# Item equating is the practice of making the results from two 
# different assessments equivalent.  This can be done by either
# 1. having the same group take both assessments
# 2. having equivalent groups take the different assessments
# 3. having non-equivalent groups which use common items take the different
#    assessments.
# In this post I will cover topic 1.
  # For this code I will use the catR package to generate my assessments # and responses. library("catR")   # Let's attempt the first proceedure: # First let's generate our item parameters for assessment 1.   # Let's create an item bank with 100 items for the assessments nitems <- 100 bank1 <- createItemBank(model="2PL", items=nitems)$itemPar bank2 <- createItemBank(model="2PL", items=nitems)$itemPar   # Now let's generate a 1000 person population sample to take our assessment npeep <- 1000 theta <- rnorm(npeep)   # Calculate the score on both assessments resp1 <- resp2 <- matrix(0, nrow=npeep, ncol=nitems) for (i in 1:npeep) { resp1[i,Pi(theta[i],bank1)$Pi<runif(nitems)] <- 1 # Test 1 resp2[i,Pi(theta[i],bank2)$Pi<runif(nitems)] <- 1 # Test 2 }   # To calculate total score on the tests we can sum the results of each row score1 <- apply(resp1, 1, sum) score2 <- apply(resp2, 1, sum)   # Since we know both forms of the test are parrellel we can check the correlation # between scores on the different forms of the test as a measure of reliability.   cor(score1,score2) # This gives me a correlation estimate of .95 which is very good.   # However, we are not interested in reliability right now. We would like to # equate the two tests using the information thus far garnished.   # First let's estimate our parameters. # I will use the ltm command in the package ltm for this. library("ltm")   est2pl1 <- ltm(resp1~z1) est2pl2 <- ltm(resp2~z1)   # Now we have two tests with the different items. # We want to make sure the items are on the same scale # so that it does not matter which test individuals take. # Their expected score will be the same.   cbind(score1,score2)   # Because we simulated the generation of both tests which are very lengthy # the tests are pretty much already equated by design which is frankly # much easier to do by simulation and perhaps impossible to do with # real assessments.   # Nevertheless, let's act as if our tests were not already parrellel and # equate them.   # Let's first do some linear equating which is done by setting the standardized # scores of the two exames equal to each other. (page 33) # See   # X1 and X2 refers to total scores for each individual on exams 1 and 2 # (X1 - mean(X1))/Sd(X1) = (X2 - mean(X2))/Sd(X2) # X1 = Sd(X1)/Sd(X2)*X2 + (mean(X1)-Sd(X1)/Sd(X2)*mean(X2)) = A*X2 + B # Where A=Sd(x1)/Sd(x2) and B=mean(X1)-A*mean(X2)   (A <- sd(score1)/sd(score2)) # A being close to 1 indicates that the tests are scaled similary   (B <- mean(score1)-A*mean(score2)) # B being close to 0 indicates the tests are of similar difficulty   # Now let's calculate what score2 would be if scaled on assessment 1.   score2scaled <- A*score2 + B   summary(cbind(score1,score2,score2scaled)) # We can see that score2scaled looks slightly closer to score1.   # Let's try the same thing with slightly more interesting.   # Let's imagine that score1 is for ACT from 0 to 32 and score 2 is from SAT 200 to 800   # The standard deviation is 6 and average score around 20 for the ACT ACT <- (score1-mean(score1))/sd(score1)*6 + 20 ACT[ACT>32] <- 32 ACT[ACT<4] <- 4 ACT <- round(ACT) # ACT rounds to whole numbers   # The standard deviation is 100 and average score around 500 for the ACT SAT <- (score2-mean(score2))/sd(score2)*100 + 500 SAT[SAT>800] <- 800 SAT[SAT<200] <- 200 SAT <- round(SAT/10)*10 # SAT rounds to nearest 10     summary(cbind(SAT,ACT))   # Now let's see if we can transform our SAT scores to be on our ACT scale (A <- sd(ACT)/sd(SAT)) (B <- mean(ACT)-A*mean(SAT))   SATscaled <- A*SAT + B summary(cbind(SAT,ACT,SATscaled))   # The results from taking parrellel tests should fall on a linear form plot(ACT,SATscaled, main="SAT results placed on ACT scale")  

# This is the easiest method of equating two tests.  
# However, it is not usually the most practical since it is costly to get the same
# group of individuals to take two different tests.  In addition, there
# may be issues with fatigue which could be alleviated somewhat if for half of the
# group the first assessment was given first and for a different half 
# the second assessment assessment was given first.

Formatted by Pretty R at