## Thursday, November 29, 2012

### Pre and Post Test Data Merge/Append

* Stata Script

* Imagine that you have two sets of data.

* 1 pre-test
* 2 current

* You would like to merge the data together.

* You have two options.

* 1 Create a tall data set (using the append command)
* 2 Create a wide data set (using the merge command)

* Both forms of joined data are common.  Let's see hwo to do this.

* First let's create our pretest data.

clear
set obs 1000

* We have some unique identifier on each person/student
gen ID = _n

gen score = rnormal()

* Now let's generate say 88 variables (v11 through v99)
forv i=11/99 {
gen vi' = runiform()
label var vi' "Random uniform variable i' - pretest"
}

save pretest, replace

* Now we have our pretest data saved

* Let's create our "current" data

* Let's imagine that there is some kind of treatment
gen treatment = rbinomial(1,.3)

* It has a positive effect
replace score = score+.4*treatment

* Now let's generate say 88 variables (v11 through v99)
forv i=11/99 {
local change = rbinomial(1,.1)
* There is a 10% chance that one of your other variables will change

if change'==1 replace vi' = vi'+runiform()
label var vi' "Random uniform variable i' - current"

}

save current, replace

clear

* END DATA SIMULATION
******************************************************************
* Begin file management

* Now we have two data sets with different variable in each.
* Let's start by appending the data together into a long file.

use pretest, clear

* First let's generate a variable to indicate pretest
gen phase = "Pretest"

* Now, let's append the data together:
append using current

sum
* We can see that now we have 2000 observations as expected.

* Now our data is in a tall format

* We might also be interested in making sure the "treatment" variable is duplicated for every observation.
bysort ID: egen treat = mean(treatment)

* Drop the old treatment variable
drop treatment
rename var treat treatment

clear
*************
* Alternatively, we may want to put our data into a long format.

* First we need to rename variables so that they will not be overwritten.
use pretest, clear

foreach v of varlist * {
rename v' v'_0
}

* We need to make sure only that the merging variable keeps the same name.

rename ID_0 ID

save pretest_rename, replace

* Now we load the current test data.

use current, clear

* We will rename the variables in the current as well
foreach v of varlist * {
rename v' v'_1
}

* Making sure to change ID_1 back to ID
rename ID_1 ID

* Now we merge the data together
merge 1:1 ID using pretest_rename

* Now our data is wide.  Note, the reshape command could be use to change data from wide to tall or tall to wide.

* It might be useful to order all of the variables from before and after next to each other.

order *, alphabetic

## Wednesday, November 28, 2012

### CAT using Kullback–Leibler vs Fisher Information

R Code

# This post builds on my recent post that builds a simulation of a computer adaptive test (CAT) selecting items from an infinite item pool using fisher information as the criteria.

# This post is very similar except that it uses instead the Kullback-Leibler (KL) information criteria to select items.

http://www.econometricsbysimulation.com/2012/11/selecting-your-first-item-on-computer.html

#########################################################
# Specify the initial conditions:

true.ability = rnorm(1)

# Load three parameter model ICC
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# Load three parameter item information:
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

#########################################################

# First let's specify and initial guess

est.start = 0

# How much do we adjust our estimate of person ability when all answer's are either right or wrong.
est.jump = .7

# Number of items on the test. This will be the end condition.
num.items = 50

# Set the other parameters in the 3PL.
a.base = 3
c.base = .1

# We will draw all of the random outcomes in advance so that both methods of selecting items get the "same" draws.
rdraw = runif(num.items)

### Let's generate a vector to hold both ability estimates.  Fisher and KL start at the same estimate.
ability.est = est.start

# Let's first generate a empty data frame to hold the set of item taken.
items = data.frame(a=NA,b=NA,c=NA,response=NA,p=NA, ability.est=NA)

i = 1

# For this first mock test we will not select items from a pool but instead assume the pool is infinite and has an item with an a=a.base, c=c.base, and b equal to whatever the current guess is.

# Let's select our first item - a,b,c, response are scalars that will be reused to simplify coding.
a=a.base
c=c.base

### This is the first divergence from the alternative CAT simulation.

### In order to select an item we must specify first a theta distribution to select the best item for.

### I will use 1000 draws from theta~normal()
theta.dist = rnorm(1000)

### Drawing from the second link posted above the KL information is defined as:
KL = function(theta.true,theta.estimate, a, b, c) {
# For the true value
p.true = PL3(theta.true,a,b,c)
q.true = 1-p.true

# For the estimate
p.estimate = PL3(theta.estimate,a,b,c)
q.estimate = 1-p.estimate

# The following line is the value to be returned to the KL function:
p.true*log(p.true/p.estimate) + q.true*log(q.true/q.estimate)
}

### In order to select the best item from all the infinite number of items I will use a maximimization routine to select the b while a and c are held constant.

### I want to define a function to maximize the KL over for a given theta estimate.
KLmax = function(b) mean(KL(theta.true=theta.dist, theta.estimate=ability.est[i], a=a, b=b, c=c))

### Now we want to optimize the KLmax function looking for the choice of b that gives us the highest value.
optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))

b=optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par # Probability of getting the item correct p=PL3(true.ability, a,b,c) # Note that all of the ps are already draw response = p > rdraw[i] # The Item Characteristic Curve (ICC) gives the probability of getting the item correct. # Thus, a .9 is the max of the runifrom that should produce a response of TRUE or correct or 1 (as far as R is concerned these TRUE and 1 are the same as is FALSE and 0) items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i]) # We have now successfully administered our first item. # Should do our first MLE estimation? # Not quite, unfortunately MLE requires in the bianary case that the student has answered at least one question right and at least one question wrong. i=1+1 # Instead we will just adjust the ability estimate by the fixed factor (est.jump) ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump # Now we administer the second item: # We will continue this until we get some heterogeneity in the responses response.v = items$response
response.ave = sum(response.v)/length(response.v)

while ((response.ave == ceiling(response.ave)) & (num.items >= i) ) {
# This condition will no longer be true when at least one of the items is ansered correctly and one of the items answered incorrectly.
ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump

a=a.base
c=c.base

### Using the KL information criteria
b=optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par p=PL3(true.ability, a,b,c) response = p > rdraw[i] items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i]) response.v = items$response
response.ave = sum(response.v)/length(response.v)

i=i+1
}

items
true.ability

# Now that we have some heterogeneity of responses we can use the MLE estimator
MLE = function(theta) sum(log((items$response==T)*PL3(theta, items$a, items$b, items$c) +
(items$response==F)*(1-PL3(theta, items$a, items$b, items$c))))

optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
# Okay, it seems to be working properly  now we will loop through using the above function.

# The only thing we need change is the ability estimate.
while (num.items >= i) {
ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par a=a.base c=c.base ### Using the KL information criteria b=optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par

p=PL3(true.ability, a,b,c)

response =  p > rdraw[i]

items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

response.v = items$response response.ave = sum(response.v)/length(response.v) i=i+1 } ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par

# Save the paths for comparison with Fisher information
items.KL = items
ability.est.KL = ability.est

#######################################################################
#####
##### Now let's compare this with the Fisher information item selection
#####
#######################################################################

ability.est = est.start
items = data.frame(a=NA,b=NA,c=NA,response=NA,p=NA, ability.est=NA)
i = 1
a=a.base
c=c.base
### Now b just equals the ability estimate
b=ability.est[i]
p=PL3(true.ability, a,b,c)
response =  p > rdraw[i]
items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])
i=1+1
ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump
response.v = items$response response.ave = sum(response.v)/length(response.v) while ((response.ave == ceiling(response.ave)) & (num.items >= i) ) { ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump a=a.base c=c.base ### Change how b is selected b=ability.est[i] p=PL3(true.ability, a,b,c) response = p > rdraw[i] items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i]) response.v = items$response
response.ave = sum(response.v)/length(response.v)
i=i+1
}
while (num.items >= i) {
ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par a=a.base c=c.base ### Change how b is selected b=ability.est[i] p=PL3(true.ability, a,b,c) response = p > rdraw[i] items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i]) response.v = items$response
response.ave = sum(response.v)/length(response.v)
i=i+1
}

# One final ability estimate calculation
ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par ability.est.Fisher = ability.est items.Fisher = items ####################################################################### minmax = function(x) c(min(x),max(x)) plot(0:num.items, ability.est.Fisher, type="l", main="CAT Estimates Ideally Converge on True Ability", , xlab="Number of Items Administered", ylim=minmax(c(ability.est.Fisher,ability.est.KL)), ylab="Estimated Ability", lwd=3, col="blue") lines(0:num.items, ability.est.KL, lwd=2, col="green") abline(h=true.ability, col="red", lwd=2) ## Graph I # I have included two different graphs because I thought both of these were quite peculiar. # The first graph is strange because in this chance draw the KL looks like it is doing as well or better than the Fisher. # In almost all other samples that I ran it seemed to do considerably worse. ## Graph II # Let's see what items both methods are choosing and how they relate to the true ability estimate. plot(1:num.items, items.Fisher$b, type="l", main="Item's are selected using different criteria",
, xlab="Number of Items Administered", ylim=minmax(c(items.KL$b,items.Fisher$b)),
ylab="Estimated Ability", lwd=3, col="blue")
lines(1:num.items, items.KL$b, lwd=2, col="green") abline(h=true.ability, col="red", lwd=2) # Blue is Fisher # Green is KL # I have included two different graphs because I thought both of these were quite peculiar. # The first graph is strange because in this chance draw the KL looks like it is doing as well or better than the Fisher. # In almost all other samples that I ran it seemed to do considerably worse. # Let's see what items both methods are choosing and how they relate to the true ability estimate. plot(1:num.items, items.Fisher$b, type="l", main="Item's are selected using different criteria",
, xlab="Number of Items Administered", ylim=minmax(c(items.KL$b,items.Fisher$b)),
ylab="Item Difficulty", lwd=3, col="blue")
lines(1:num.items, items.KL$b, lwd=2, col="green") abline(h=true.ability, col="red", lwd=2) # The y-axis should read item difficulty. # We can see the items tend to be closer to zero in difficulty for the KL selection (these items correspond to the most recent graph) # From this simulation, it appears KL is an inferior item selection criteria. # It is possible that this simulation is not taking into consideration all of the features of the KL criteria. Perhaps, if the item pool has differences in the parameters a and c then KL will perform better. I will leave that for a later simulation. ## Sunday, November 25, 2012 ### Commodity Demand Estimation - Fish Demand Stata Do File * My field exam is coming up and I am therefore giving some thought to commodity demand once again. * This simulation will follow the Paper "Demand Structure for Fish" Asche, Bjorndal, and Gordan SNF Working Paper No. 37/05 * In this simulation I will attempt to simulate commodity data that fits several empirical model specifications. * A common initial attempt at estimating demand was to specify: * ln(q_it) = alpha_i + sum(across j) of Beta_j*ln(pjt) + e_i*ln(X_t) * The advantage of this formation was that it allowed the elasticities to be easily recovered. * If consumers of fish have a fixed budget for fish and no strong substitutes then we expect that the price elasticity of fish to be around -1. * Let's see if we can generate aggrogate data that looks like this. clear * Let's say we have 20 years of weekly price data ~1000 observations set obs 1000 gen t=_n * Let's generate some substitute products gen pbeef = (rpoisson(10)+1)/10 gen pchicken = (rpoisson(7)+1)/10 label var pchicken "Price of Chicken" gen ppork = (rpoisson(12)+1)/14 gen pfish = (rpoisson(15)+1)/12 label var pfish "Price of Fish" * Let's generate some fish covariates. * There was a scare that commercial fishing was killing dolphins. Let's say that happened at time 500. gen dolphin = 0 replace dolphin = 1 if t>500 * Let's imagine that the fishing industry is also campainging to encourage the consumption of fish. gen fishad = rnbinomial(10,.1) * Now let's generate the natural logs of the above values foreach v in pbeef pchicken ppork pfish fishad { gen lnv' = ln(v') } * Generate an error term gen e=rnormal()/4 * Now let's generate quantity demand for fish! gen lnqfish = 1 + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish - .3*dolphin + .1*lnfishad + e * Done with simulation * I won't even bother estimating this. The reason is that by construction our data perfectly fits our OLS model. * Well, just a reference point I will estimate the above. reg lnqfish lnpbeef lnpchicken lnppork lnpfish dolphin lnfishad * We can see all of our estimates work exceedingly well when the model is perfectly specified. * However, it might be instructive to look at the relationships between variables. gen qfish = exp(lnqfish) label var qfish "Quantity of Fish Purchased" twoway (scatter pchicken qfish) (lfit pchicken qfish) /// (scatter pfish qfish) (lfit pfish qfish) , /// title(Relationship Between Own Price and Quantity) * This graph is a little bit busy and might be hard to read. * Notice that each point in blue is also represented in green. * This is because while the horizontal axis is the same the vertical axis changes from price of chicken to price of fish. * Let's imagine that rather than estimating using logs we estimated using quantities and prices. reg qfish pbeef pchicken ppork pfish dolphin fishad * It is interesting to note that while the estimates when the functional form is not speficied correctly are not as good as previously they still seem to be working well. * Let's try converting some of them to point elasticities and see how they compare. foreach v in pbeef pchicken ppork pfish fishad { qui sum v' di "Point Elasticity of v' = =_b[v']/r(mean)'" } * Some of the point elasticities are better estimates of the constant elasticity than just the coefficients but not consistently so. * Okay, so this has gotten our feet a little wet with the idea of estimating demand. * Now, let's shift into demand estimation given a more flexible forms for the demand to take. ***************************************************** * Houthakker and Taylor's habit formation model. * ln(q_it) = alpha_i + c_i*ln(q_i,t-1) + sum(across j) of Beta_j*ln(pjt) + e_i*ln(X_t) * The difference between this model and the last is that now we have a form of temporal dependence. * Consumption this period depends upon consumption last period. gen lnqfish2 = 1 + .75*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish - .3*dolphin + .1*lnfishad + e if t==1 replace lnqfish2 = .75*lnqfish2[_n-1] + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish - .3*dolphin + .1*lnfishad + e if t>1 label var lnqfish2 "Auto-correlated Fish Quantity Demanded" * In order to see the time trends forv i=2/12 { gen ti'=t^(i'/3) } reg lnqfish2 t t? t?? predict lnqfish2_hat two (line lnqfish2 t) (line lnqfish2_hat t) , title(Fish Quantity Demanded with Temporal Dependency) * We can see long term movement in quantity demanded as a result of the dophin scare in additition to some short term temporal dependency as a result of habit formation. gen lnqfish2_l = lnqfish2[_n-1] * To verify that we can estimate the model directly: reg lnqfish2 lnqfish2_l lnpbeef lnpchicken lnppork lnpfish dolphin lnfishad * As before, when our model is correctly specified we have little difficulty estimating the parameters. * If we fail to include the once lagged fish quantity: reg lnqfish2 lnpbeef lnpchicken lnppork lnpfish dolphin lnfishad * We have less explanatory power, but generally almost all of the estimates are identical. predict resid, resid twoway (scatter resid lnqfish2) (lfit resid lnqfish2) * There is a clear relationship between the y variable and the residual suggesting autocorrelation. * Even though there is autocorrelation of the errors this does not substantially affect the estimates of the demand function. * However, this is somewhat an artifact of the simulation. * We should think that there might be a causal relationship between amount of fish demanded during the previous period and current period prices if there exists some friction in the fish markets. * Thus perhaps the higher the quantity demanded in the current period the higher the price is expected to be during the next period because of fish reserves. * This demands a more complex simulation in which we calculate each period individually. gen lnpfish2 = ln((rpoisson(15)+1)/12) if t==1 gen lnqfish3 = 1 + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish2 - .3*dolphin + .1*lnfishad + e if t==1 * For the first period we assume prices are exogenous. * For each additional period we will make prices a function of quantity demanded. forv i=2/1000 { qui replace lnpfish2 = ln((rpoisson(15)+1)/12) + .7*lnqfish3[i'-1] if t==i' * Thus the price elasticity is something less than 1/.7= 1.4. Less because there is the random poisson variation that diminishes the relationship. qui replace lnqfish3 = .75*lnqfish3[_n-1] + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish2 - .3*dolphin + .1*lnfishad + e if t==i' } reg lnqfish3 lnpbeef lnpchicken lnppork lnpfish2 dolphin lnfishad * Now failing to take into account the previous quantity of fish demanded causes the price elasticity to be downward biased. * The reason is that high quantity demanded last period will cause the habit of high quantity demanded this period. * However, high quantity demanded last period will also cause the price to be higher this period. * If you do not control for high demand last period then the higher price will look like it is explaining both the higher habitual consumption of fish and the lower consumption of fish due to the higher price. * Thus, controlling for habit formation is necessary for unbiased estimates. gen lnqfish3_l = lnqfish3[_n-1] reg lnqfish3 lnqfish3_l lnpbeef lnpchicken lnppork lnpfish2 dolphin lnfishad predict res3, residual scatter res3 t if t<200 autocorrelation="autocorrelation" o="o" of="of" p="p" remaining="remaining" residuals="residuals" title="title"> gen res3_1 = res3[_n-1] reg res3_1 res3 * After controlling for AR1 errors, there is no longer autocorrelation of residuals. ## Saturday, November 24, 2012 ### Moving Averages & Data Cleaning Original Code * Imagine you have data on prices for many products. * For each of the products you record weekly price information. clear set obs 200 gen prod_id = _n * Each product has a unique average price gen prod_price = rpoisson(5)/7 * You have data on weekly prices for 200 weeks. expand 200 bysort prod_id: gen t=_n label var t "Week" sort prod_id t * There is also some seasonal variation gen seasonal = .2*sin(_pi*t/50) * As well as a general time trend gen trend = t*.005 * The first observation is not correlated with anything gen price = prod_price*2.5 + trend + rpoisson(10)/10 if t==1 replace price = prod_price*2 + trend + seasonal + .7*price[_n-1] + .3*rpoisson(10)/10 if t==2 replace price = prod_price + trend + seasonal + .5*price[_n-1] + .2*price[_n-2] + .3*rpoisson(10)/10 if t==3 replace price = prod_price + trend + seasonal + .3*price[_n-1] + .2*price[_n-2] + .2*price[_n-3] + .3*rpoisson(10)/10 if t==4 replace price = prod_price + trend + seasonal + .3*price[_n-1] + .175*price[_n-2] + .125*price[_n-3] + .1*price[_n-4] +.3*rpoisson(10)/10 if t>4 * Create a globabl to store global twograph forv i = 1/6 { global twograph${twograph} (line price t if prod_id == i')
}

twoway $twograph, legend(off) title(True price trends for first six products) * Now let's imagine that the above generated data is the true price information which is fundamentally unobservable. * Instead you have multiple collections of data per week on prices which each vary by some random addative error. expand 3 bysort prod_id t: gen prod_obs = _n gen price_collect = price + rnormal()*.25 * However the price information that you have has some entries that 10% have been mistakenly entered wrong. gen entry_error = rbinomial(1,.1) gen scalar_error = rnormal()+1 gen price_obs = price_collect*(1+entry_error*scalar_error) label var price_obs "Recorded Price" * In addition, 35% of your price data was never collected gen missing = rbinomial(1,.35) drop if missing==1 * Create a globabl to store global twograph forv i = 1/6 { global twograph${twograph} (line price_obs t if prod_id == i' & prod_obs==1)
}

twoway $twograph, legend(off) title(Observed price trends for first six products) keep t price_obs prod_id entry_error * I am keeping entry error in the data set as a means of comparison though it would not be directly observed. ************************************************************************** * DONE with simulation * The question is: * Can you now with this messy data recover price data that is similar to the original? * The first thing that we should exploit is the duplicate recorded data. scatter price_obs t if prod_id == 1, title(It is easy to see individual deviations) * It is easy to see individual deviations but we do not want to go through all 200 products to identify individually price outliers. * We want to come up with a system to identify outliers. * Let's generate a mean by product and time bysort prod_id t: egen price_mean = mean(price_obs) * Let's flag any observation that is 120% greater than the mean or 80% less than the mean. gen flag = (price_mean > price_obs*1.2 | price_mean < price_obs*.8) * Let's see how it is working: two (scatter price_obs t if prod_id == 1) /// (scatter price_obs t if prod_id == 1 & flag==1 , msymbol(lgx)) /// , title(Some of outliers can be identified just looking at the mean) legend(off) corr flag entry_error * Our flag is correlated about 45% with the entry errors. This is good but we can do better. * I propose that rather than using just the mean that we construct a moving average of prices and see how each entry deviates from the average. * The only problem is that the moving average command requires xtset and that requires only one entry per time period. * So, I say we rescale the time variable and add in as if recorded at a different time of the week the observation number. * We need to newly generate prod_obs since we do not know which observation is missing from each product. bysort prod_id t: gen prod_obs = _n gen t2 = t*4 + prod_obs * xtset sets the panel data panel id and time series level. xtset prod_id t2 * The command we will be using is "tssmooth" * It is coded such that by specifying ma it means moving average and window tells Stata how many time periods to count ahead and how many behind in the moving aerage. * This command can take a little while. tssmooth ma ma_price_obs=price_obs, window(23 0 23) * 23 is in effect 5 weeks ahead and 5 weeks behind * The 0 tells stata not to include inself in that average * The moving average two (scatter price_obs t if prod_id == 1) /// (line ma_price_obs t if prod_id == 1) /// (line price_mean t if prod_id == 1) /// , title(The Moving Average is less succeptable to outliers) * The moving average is more stable than just the time average. * Let's try flagging using the moving average cap drop flag2 gen flag2 = (ma_price_obs > price_obs*1.2 | ma_price_obs < price_obs*.8) two (scatter price_obs t if prod_id == 1) /// (scatter price_obs t if prod_id == 1 & flag2==1 , msymbol(lgx)) /// , title(The Moving Average can also be useful) legend(off) corr flag2 entry_error * Drop our flagged data drop if flag2==1 * Collapse to the weekly level collapse price_obs, by(prod_id t) label var price_obs "Mean price observed" global twograph forv i = 1/6 { global twograph${twograph} (scatter price_obs t if prod_id == i')
}

twoway $twograph, legend(off) title(Observed price trends for first six products) * The data is looking a lot better but we still clearly have some unwanted outliers. * We could take advantage of the cross product trends to help identify outliers within product prices. bysort t: egen ave_price = mean(price_obs) reg price_obs ave_price if prod_id == 1 predict resid1, residual reg price_obs ave_price if prod_id == 2 predict resid2, residual reg price_obs ave_price if prod_id == 3 predict resid3, residual twoway (line resid1 t if prod_id == 1) /// (line price_obs t if prod_id == 1) /// (line resid2 t if prod_id == 2) /// (line price_obs t if prod_id == 2) /// (line resid3 t if prod_id == 3) /// (line price_obs t if prod_id == 3) , title(The residuals are clear indicators of outliers) legend(off) * Finally, let us drop observations with residuals that are greater than 1.5 standard deviations from the mean. drop resid? gen flag = 0 qui forv i=1/200 { reg price_obs ave_price if prod_id == i' predict resid_temp, residual sum resid_temp replace flag = ((resid_temp-r(mean)>r(sd)*1.5 | resid_temp-r(mean) drop resid_temp } * Let's see how it is working: two (scatter price_obs t if prod_id == 2) /// (scatter price_obs t if prod_id == 2 & flag==1 , msymbol(lgx)) /// , title(Now just trying remove some final outliers) legend(off) * Plotting product 1 pricing relative to outliers. global twograph forv i = 1/6 { global twograph${twograph} (line price_obs t if prod_id == i')
}

* Finally dropping the outliers
drop if flag

* One final graph
global twograph

forv i = 1/6 {
global twograph ${twograph} (scatter price_obs t if prod_id == i') } twoway$twograph, legend(off) title(Observed price trends for first six products)

* Not as clean as our first graph but definitely much improved.

## Friday, November 23, 2012

### Computer Adaptive Test Assuming an Infinite Item Pool

Original Code

# In order for us to understand what a Computer Adaptive Test (CAT) is, let's first think about how the CAT works.

# A CAT test starts by having some kind of initial assessment of student ability (test taker's ability).

# This is typically at the population mean.

# The test then selects an item that (in the most straightforward case) has the most information at that initial guess.

# If the student answers that question correctly then the program reassesses student ability and finds the next question which has the most information at the new assessment of student ability.

# The computer continues to select items until the termination conditions are met.  These conditions might be anything from a fixed length of time for the test, a fixed number of questions for the test, or more interestingly a sufficient level of precision achieved for student ability.  See flow chart:

# In order to asses student ability I will use code form:

http://www.econometricsbysimulation.com/2012/11/estimating-person-characteristics-from.html

#########################################################
# Specify the initial conditions:

true.ability = rnorm(1)

# Load three parameter model ICC
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# Load three parameter item information:
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

#########################################################

# First let's specify and initial guess

est.start = 0

# How much do we adjust our estimate of person ability when all answer's are either right or wrong.
est.jump = .7

# Number of items on the test. This will be the end condition.
num.items = 50

# Set the other parameters in the 3PL
a.base = 3
c.base = .1

# Let's generate a vector to hold ability estimates.
ability.est <- est.start="est.start" p="p">
# Let's first generate a empty data frame to hold the set of item taken.
items <- a="NA,b=NA,c=NA,response=NA,p=NA," ability.est="NA)</p" data.frame="data.frame">
i = 1

# For this first mock test we will not select items from a pool but instead assume the pool is infinite and has an item with an a=a.base, c=c.base, and b equal to whatever the current guess is.

# Let's select our first item - a,b,c, response are scalars that will be reused to simplify coding.
a=a.base
c=c.base
b=ability.est[i]

# Probability of getting the item correct
p=PL3(true.ability, a,b,c)
response = runif(1) < p
# The Item Characteristic Curve (ICC) gives the probability of getting the item correct.
# Thus, a .9 is the max of the runifrom that should produce a response of TRUE or correct or 1 (as far as R is concerned these TRUE and 1 are the same as is FALSE and 0)

items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

# We have now successfully administered our first item.

# Should do our first MLE estimation?

# Not quite, unfortunately MLE requires in the bianary case that the student has answered at least one question right and at least one question wrong.

i=1+1

# Instead we will just adjust the ability estimate by the fixed factor (est.jump)
ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump

# Now we administer the second item:
# We will continue this until we get some heterogeneity in the responses
response.v = items$response response.ave = sum(response.v)/length(response.v) while ((response.ave == ceiling(response.ave)) & (num.items >= i) ) { # This condition will no longer be true when at least one of the items is ansered correctly and one of the items answered incorrectly. ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump a=a.base c=c.base b=ability.est[i] p=PL3(true.ability, a,b,c) response = runif(1) < p items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i]) response.v = items$response
response.ave = sum(response.v)/length(response.v)

i=i+1
}

items
true.ability

# Now that we have some heterogeneity of responses we can use the MLE estimator
MLE = function(theta) sum(log((items$response==T)*PL3(theta, items$a, items$b, items$c) +
(items$response==F)*(1-PL3(theta, items$a, items$b, items$c))))

optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
# Okay, it seems to be working properly  now we will loop through using the above function.

# The only thing we need change is the ability estimate.
while (num.items >= i) {
ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par a=a.base c=c.base b=ability.est[i] p=PL3(true.ability, a,b,c) response = runif(1) < p items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i]) response.v = items$response
response.ave = sum(response.v)/length(response.v)

i=i+1
}

items
(ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1)))
true.ability

# We can see that even in this ideal scenario in which you always have appropriately difficult items with high discriminatory power and low guessing, there is a noticeable amount of error.

plot(0:num.items, ability.est, type="l", main="CAT Estimates Ideally Converge on True Ability",
ylim=c(-3,3), xlab="Number of Items Administered", ylab="Estimated Ability", lwd=2)
abline(h=true.ability, col="red", lwd=2)

## Thursday, November 22, 2012

### Computing Expected Values

Original Code

# It is often times difficult to solve for the expected value of a variable in closed form.

# However, using computers it can be easy to approximate.

# Imagine (for whatever reason) you would like to calculate the expected value of exp(x) where x is distributed as a standard normal distribution.

# Method 1: Randomly draw many draws of the variable and take the mean.
draws = 100000
rvar = exp(rnorm(draws))
mean(rvar)

# Method 2: Draw from the inverse CDF
draws = 100000
CDF = seq(0.0000001,.9999999,length.out=draws)
rvar = exp(qnorm(CDF))
mean(rvar)

# Both methods are likely to produce very similar results.  Method 2 might be preferred because it is not susceptible to the random draw.  However, Method 1 has the strength of not having to specify and upper and lower limit to values entering the inverse CDF.

# Sometimes you might be interested in estimating the expected value of a censored variable.

# Say we are interested in exp(x) where x is still standard normal but missing at 0 and 2 (min = 0 and max = 2).

# This is easy to approximate as well.

draws = 100000
rvar = rnorm(draws)
rvar.cens = rvar[rvar > 0]
mean(rvar.cens)

### Open Source Location For Online Computer Adaptive Test Developers

I just found a very cool online platform that is powerful and flexible that allows the free development and implementation of online computer adaptive testing.  Too bad their tutorials seem a little outdated.  But they use R to handle the adaptive item selection!  I am very excited.

http://www.psychometrics.cam.ac.uk/page/338/concerto-testing-platform.htm

## Wednesday, November 21, 2012

### Estimating Person Characteristics from IRT Data - 3PL Model

Original Code

# One of the basic tasks in item response theory is estimating the ability of a test taker from the responses to a series of items (questions).

# Let's draw the same pool of items that we have used on several previous posts:

# First let's imagine we have a pool of 100 3PL items.
set.seed(101)

npool = 500

pool = as.data.frame(cbind(item=1:npool, a=abs(rnorm(npool)*.25+1.1), b=rnorm(npool), c=abs(rnorm(npool)/7.5+.1)))
summary(pool)

# Drawing on code from a previous post we can calculate several useful functions:

# (http://www.econometricsbysimulation.com/2012/11/matching-item-information.html)

# Each item has a item characteristic curve (ICC) of:
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# Let's imagine that we have a single test taker and that test taker has taken the first 15 items from the pool.

items.count = 15
items.taken = pool[1:items.count,]

# And that the person has a latent theta ability of 1
theta = 1.3

# Let's calculate the cut points for each of the items.
items.cut = PL3(theta, items.taken$a, items.taken$b, items.taken$c) # We can see how the cut point works by graphing plot(0,0, type="n", xlim=c(-3,3),ylim=c(0,1), xlab=~theta, ylab="Probability of Correct Response", yaxs = "i", xaxs = "i" , main="Item Characteristics Curves and Ability Level") for(i in 1:items.count) { lines(seq(-3,3,.1),PL3(seq(-3,3,.1), items.taken$a[i], items.taken$b[i], items.taken$c[i]), lwd=2)
abline(h=items.cut[i], col="blue")
}
abline(v=theta,col="red", lwd=3)

# Now let's draw a uniform draw that we will use to calculate whether each item as passed.

rdraw = runif(items.count)

# Finally, we will calculate item responses
item.responses = 0
item.responses = items.cut > rdraw

###############################################
# Done with Simulation - Time for Estimation

# We want to use the information we know about the items (the item parameters and the responses) in order to estimate a best guess at the true ability of the test taker.

# First we must check if the person got all of the items either correct or incorrect.

sum(item.responses)
# If this is either a 0 or a number equal to the number of items then we cannot esimate an interior maximum without additional assumptions.

# We will attempt to recover our theta value using the r command optim

# First we need to specify the function to optimize over.

MLE = function(theta) sum(log((item.responses==T)*PL3(theta, items.taken$a, items.taken$b, items.taken$c) + (item.responses==F)*(1-PL3(theta, items.taken$a, items.taken$b, items.taken$c))))
# The optimization function takes as its argument the choice variables to be optimized (theta).
# The way the above optimization works is that you specify the probility of each response piecewise.
# If the response is correct, then you count the CDF of theta up to that point as contributing to the probability of observing a correct outcome.  If the response is negative, then you count it as contributing the the probability of a incorrect outcome.  You then choose the theta that produces the greatest total probability.

MLEval = 0
theta.range = seq(-3,3,.1)
for(i in 1:length(theta.range)) MLEval[i] =MLE(theta.range[i])

plot(theta.range, MLEval, type="l", main="Maximim Likelihood function", xlab= ~theta, ylab="Sum of Log Likelihood")
abline(v=theta, col="blue")
# We can visually see that the maximum of the slope will not be at the true value though it will be close.

optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
abline(v=optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par, col="red") # We can see that we can estimate theta reasonably well with just 15 items from a paper test (red line estimate, blue line true). However, looking at the graph of the ICCs, we can see that for most of the items, the steepest point (where they have the most discriminating power) is at an ability set lower than the test taker's ability. Thus, this test provides the most information about a person who has a lower ability than the person with a theta=1.2. # We can use R's optim function to find the ideal theta that would maximize the information from this test. # Item information is: PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c) # Notice, this is not the best way of defining the test information function since the items are not arguments. test.info = function(theta) sum(PL3.info(theta, items.taken$a, items.taken$b, items.taken$c))

# Construct a vector to hold the test information
info = 0
for(i in 1:length(theta.range)) info[i]=test.info(theta.range[i])

plot(theta.range, info, type="l", main="Information Peaks Slightly Above 0", xlab= ~theta, ylab="Information")
abline(v=theta, col="blue")
# But we want to know about the test taker at theta

optim(0,test.info, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
# The person this test would be best suited to evaluate would have an ability rating of .19
abline(v=optim(0,test.info, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par, col="red") ## Monday, November 19, 2012 ### t-tests and F-tests and rejection rates Original Code * When I was first learning about t-tests and f-tests I was told that a t-test estimated the probability of falsely rejecting the null for a single estimator. * While the f-test estimated the probability of rejecting the null that the model explained nothing significant. * It was also stated that sometimes the t-tests can fail to reject the null when the f-test will reject the null and this is the result primarily of correlation among explanatory variables. * All of these things I believe are well understood. * However, what I always wanted to know was, if the t-test rejected did that mean that the f-test must also reject? * This seems intuitively to me to be true. * That is, if one part of the model seems to be statistically significant, mustent the whole model be statistically significant? * Now thinking back on the question, I think the answer must be no. * Assuming we are using rejection rates of 10%, the reason I argue is that if the f-test assumptions are met then it should falsely reject the null 10% of the time. * Likewise, if the t-test's assumptions are met it should reject the null 10% of the time. * However, if we are estimating two ts and they are independent then the probability that neither of them reject the null at 10% is 1-(1-.10)^2=19% * Thus if the f-test rejects at 10% then there must be a range for which one or more t-stat can reject but the f-stat will fail to reject. * Let's see if we can see this in action through simulation. cap program drop ft_test program define ft_test, rclass clear set obs 1000 gen x1=rnormal() gen x2=rnormal() gen y=rnormal() reg y x? * Calculate the p-stats for the individual coefficients. * We multiply by 2 because the ttail is initially one sided and we are interested in the two sided alternative. return scalar pt1 = 2 * ttail(e(df_r), abs(_b[x1]/_se[x1])) return scalar pt2 = 2 * ttail(e(df_r), abs(_b[x2]/_se[x2])) * We also want to extract the F stat return scalar pF = Ftail(e(df_m),e(df_r),e(F)) end ft_test ft_test ft_test * Running the simulated regression on the data a few times, I can easily see how the P-stat for the t-tests diverge from the f-stat fairly frequently. simulate pt1=r(pt1) pt2=r(pt2) pF=r(pF), reps(1000): ft_test gen rt1 = (pt1<=.1) gen rt2 = (pt2<=.1) gen rF = (pF<=.1) sum r* * All of the p tests seem to be rejecting at the right level * It might be the case that we always reject the null for the f if the rejection of the null for the t-tests are correlated. pwcorr rt1 rt2, sig * There does not appear to be correlation between the two t-tests rejections. * By now we should already know the answer to the question. * But let's check directely. gen rtF = 0 if rt1 | rt2 replace rtF = 1 if rF == 1 & rtF == 0 sum rtF * Thus the probability of rejecting the f-null given that we have rejected at least one of the t-nulls is only a little above 50%. * It does make sense that the f and t rejections be correlated. * That is, when the individual coefficients seem to be explaining the unknown variance then overall the model seems to be working relatively well. pwcorr rF rt1 rt2, sig * There is one more thing to check. How frequently do we reject the null for the F but not for either of the ts. gen rFt = 0 if rF replace rFt = 1 if rt1 | rt2 & rFt == 0 sum rFt * In this simulation, we only reject the F-stat when at least one of the t-stats rejects. * We could therefore argue that the F-stat is a more conservative test than the t-stats. * However, I do not believe this to be entirely the case. * As mentioned before, I think it is possible for the t-stat to fail to reject when the explanatory variables are correlated when the F-stat does reject. * Let's see if we can simulate this. cap program drop ft_test2 program define ft_test2, rclass clear set obs 1000 gen x1=rnormal() gen x2=rnormal()+x1*3 * This will cause x1 and x2 to be strongly correlated. gen y=rnormal() reg y x? * Calculate the p-stats for the individual coefficients. * We multiply by 2 because the ttail is initially one sided and we are interested in the two sided alternative. return scalar pt1 = 2 * ttail(e(df_r), abs(_b[x1]/_se[x1])) return scalar pt2 = 2 * ttail(e(df_r), abs(_b[x2]/_se[x2])) * We also want to extract the F stat return scalar pF = Ftail(e(df_m),e(df_r),e(F)) end simulate pt1=r(pt1) pt2=r(pt2) pF=r(pF), reps(1000): ft_test2 * Same analysis as previously gen rt1 = (pt1<=.1) gen rt2 = (pt2<=.1) gen rF = (pF<=.1) sum r* pwcorr rt1 rt2, sig * The rate of rejection between ts is highly correlated. gen rtF = 0 if rt1 | rt2 replace rtF = 1 if rF == 1 & rtF == 0 sum rtF * Under this setup, the rejection rate of the null for the F is about 45% of the time when one of the ts is rejected. pwcorr rF rt1 rt2, sig * We can see that now the rejection rates by component is still very strong. * There is one more thing to check. How frequently do we reject the null for the F but not for either of the ts? gen rFt = 0 if rF replace rFt = 1 if rt1 | rt2 & rFt == 0 sum rFt * Now we can see the result as discussed previously. About 25% of the time the f-stat is rejecting the null even though neither t-stat is rejecting the null. * Thus it may be informative to use a F-stat to check for model fit even when the t-stats do not suggest statistical significance of the individual components. * The ultimate result of this simulation is to emphasize for me the need to do tests of model fit. * If, I were to look only at the t-tests in this example then I would falsely assume that the model fits well nearly twice as frequently as if I were to look at the F-stat only. ## Sunday, November 18, 2012 ### Selecting your First Item on a Computer Adaptive Test Original Code # Computer adaptive tests "adapt" to test taker ability by making assessments of the test taker's ability and providing questions that are meant to maximize the amount of information that can be inferred from the test. # We often start by assuming that the ability score of an individual is at the average for the population to begin with. # Once that assumption is made then the adapative test selects an item that maximizes the information at that point. # Let's imagine that the true ability of the student is -1 and we would like to select items that get us from 0 to -1. # Let's see how this works. ####################################################### # Method 1 - Item Information, Fisher Information Criteria # First let's imagine we have a pool of 100 3PL items. set.seed(101) npool = 100 pool = as.data.frame(cbind(item=1:npool, a=abs(rnorm(npool)*.25+1.1), b=rnorm(npool), c=abs(rnorm(npool)/7.5+.1))) summary(pool) # Drawing on code from a previous post we can calculate several useful functions: # (http://www.econometricsbysimulation.com/2012/11/matching-item-information.html) # Each item has a item characteristic curve (ICC) of: PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b))) # and information function defined as: PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c) # We can use the previous post to find the information for any number theta values. # For now we are only interested in the information for our itial "guess" at student ability: theta.estimate=0 # Each item has a item characteristic curve (ICC) of: PL3 = function(theta.estimate,a, b, c) c+(1-c)*exp(a*(theta.estimate-b))/(1+exp(a*(theta.estimate-b))) # and information function defined as: PL3.info = function(theta.estimate, a, b, c) a^2 *(PL3(theta.estimate,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta.estimate,a,b,c))/PL3(theta.estimate,a,b,c) # First I want to calculate the information at each theta for each item. # In this first case theta only equals 0. nthetas = length(theta.estimate) # This matrix will contain the item pool information values pool.info = matrix(NA, nrow=npool, ncol=nthetas) colnames(pool.info) = paste("IT=",theta.estimate,sep="") # These must be calculated for each item but can be for all thetas simultaneously. for(i in 1:npool) pool.info[i,] = PL3.info(theta.estimate, pool[i,2]*1.7, pool[i,3], pool[i,4]) head(pool.info) # Everything appears to be working well. Let's find the max (note this code only works well for one theta). cbind(pool,pool.info)[pool.info==max(pool.info),] # Item 33 with am information of 1.2695 with a=1.35, b=-.14, and c=.007 is the best choice for the first item using the fisher information criteria. ####################################################### # Method 2 - Kullback–Leibler information divergence method # Referencing Eggen of Cito (1999) the KL information divergence for a single item can be expressed as # KL = p(true.theta)*log(p(true.theta)/p(estimated.theta)) + q(true.theta)*log(q(true.theta)/q(estimated.theta)) # where p is the probability of getting that item correct and q is the probability of getting that item wrong. # KL can be thought of as an item selection criteria that is likely to give you the best item to distinguish between your current estimate and the true (if the true was known). # In order to use information divergence we find the expected value of each item. # Computationally this is approximated by inputing possible true.theta and their estimate and finding the average. # Let's first let's define the KL function (drawing on PL3 defined previously) KL = function(theta.true,theta.estimate, a, b, c) { # For the true value p.true = PL3(theta.true,a,b,c) q.true = 1-p.true # For the estimate p.estimate = PL3(theta.estimate,a,b,c) q.estimate = 1-p.estimate # The following line is the value to be returned to the KL function: p.true*log(p.true/p.estimate) + q.true*log(q.true/q.estimate) } # The function is written to only take a single theta.estimate while multiple true theta's should not be a problem. # Now let's specify a simplified discrete range that the true theta can take. theta.true = c(-1,0,1) nthetas = length(theta.true) # This matrix will contain the item pool KL values pool.KL = matrix(NA, nrow=npool, ncol=nthetas) colnames(pool.KL) = paste("Theta=",theta.true,sep="") # These must be calculated for each item but can be for all thetas simultaneously. for(i in 1:npool) pool.KL[i,] = KL(theta.true, theta.estimate, pool[i,2]*1.7, pool[i,3], pool[i,4]) head(pool.KL) # Note that for theta.true = 0 the entire column is equal to zero. # This is because if the true is equal to the estimate than there is no item that maximize the ability to tell the difference between the estimate and the true because there is no difference. # In order to select the best item for a true theta range of -1,0,1 we average across the three values (or sum them). pool.avKL = apply(pool.KL, 1, mean) head(pool.avKL) cbind(pool,pool.info,pool.avKL)[pool.avKL==max(pool.avKL),] # Interestingly, item 33 is once again the item which is selected. # Now let's imagine that rather than picking just three true theta's to search for items over that we instead want to search across the standard normal standarized distribution of abilities theta. This is very easy to do with the code we already have. All we need do is take draw from the normal distribution. theta.true = qnorm(seq(.01,.99,.01)) hist(theta.true) nthetas = length(theta.true) # This matrix will contain the item pool KL values pool.KL = matrix(NA, nrow=npool, ncol=nthetas) colnames(pool.KL) = paste("Theta=",theta.true,sep="") # These must be calculated for each item but can be for all thetas simultaneously. for(i in 1:npool) pool.KL[i,] = KL(theta.true, theta.estimate, pool[i,2]*1.7, pool[i,3], pool[i,4]) pool.avKL = apply(pool.KL, 1, mean) head(pool.avKL) cbind(pool,pool.info,pool.avKL)[pool.avKL==max(pool.avKL),] # Surprisingly, once again item 33 is selected. I am not sure why there is no difference in methods yet. I suspect it is due to the symetric nature of the normal distribution. # We can certainly play with the models at least. I reran the code starting with theta.estimate=2 and the fisher information criteria did select a different item than the KL though the KL selected the same item with both choices of theta. This is probably because the item pool is so small. Given a larger item pool, I suspect that distributional choices become more imporant. # Changing the item pool to 10,000 and theta.estimate=1 I ended up not finding any difference in item choice between only the three values -1,0,1 and the full normal distribution of values. # Let's do one more thing. Let us imagine that we think we know the true theta and we would like to use the KL to select an item that brings us closest to the true. theta.true = -1 nthetas = length(theta.true) pool.KL = matrix(NA, nrow=npool, ncol=nthetas) colnames(pool.KL) = paste("Theta=",theta.true,sep="") # These must be calculated for each item but can be for all thetas simultaneously. for(i in 1:npool) pool.KL[i,] = KL(theta.true, theta.estimate, pool[i,2]*1.7, pool[i,3], pool[i,4]) pool.avKL = apply(pool.KL, 1, mean) head(pool.avKL) cbind(pool,pool.info,pool.avKL)[pool.avKL==max(pool.avKL),] # In that case item 72 is selected with parameters a=1.5, b=-.43, c=.097. # Ths KL difference criteria seems to be working well since it selected an item that is between the true -1 and the estimate 0. ## Saturday, November 17, 2012 ### Posted Code not Working Dear Readers, I have recently tried running some of my own code directly from blogger and found that sometimes the code does not run properly I am very sorry for that and am grateful so many people still find this blog useful. It seems that I should have been more careful about making sure blogger was properly representing my code. I suspect the format of my blog will change soon in order to reduce posting errors. I have looked into what solutions other blogs have. Generally it seems that the authors need to separate out code on other blogs with boxes generated through html tags making the resulting documents some combination of code and commentary. This kind of layout is not what I want. I want my blog to look like my code. Thus I want to be able to take my code directly and paste it into Blogger. However, that does not seem like it is going to work unfortunately. Any suggesting on how to do this best would be greatly appreciated. Francis I think I have figured out a solution. I will post the code directly to blogger. But I will also upload it as a file that can be downloaded and provide the link. If the code does not work directly, the downloaded code should at least work. ## Friday, November 16, 2012 ### Measurement Error Original Code * It is well known that measurement error causes attenuation bias in regression analysis estimators. * This fact appeared in the literature as early at Spearman (1904). * Attenuation bias also known as regression dilution is the phenomenon where coefficient estimates are biased towards zero. * We can understand this phenomona very intuitively by thinking about what measurement error means. * It means we do not have a very good measure of what something is. * Imagine we measure people's weight and height just by watching that person walk by. * We will assume we know the average weight of people and we make sure our average guess is that weight. * However, unless we are trained our guess will probably miss the mark at a large frequency. * Thus, if we want to use our guesses of weight and height as predictors of that person's athletic ability then our estimates will suffer from potentially two problems as a result of our measurement method. * There is the one previously mentioned, attenuation bias caused from our measures not being exact. * How, are we going to identify the effect of 5 more pounds or 3 extra inches on athletic performance if we are incapable of accurately gaging the difference between 5 pounds or 3 inches? * The second potential source of problems is that our errors in measurement might be correlated with our unconsious assessment of the subject's athletic ability. * That is, perhaps subjects that appear more athletic, we will guess as being taller or weighing less. * This second issue is much more problematic than attentuation bias. * It will cause a correlation between our errors and our explanatory variables which causes bias of an unknown form. * In order to understand why attenuation bias exists remember Beta=cov(x,Y)/var(x) and that the OLS coefficient of BetaHat = cov(X,Y)/var(X). * Where the observable X = x + v. * If we assume the error term v is uncorrelated with the outcome variable Y then cov(X,Y)=cov(x,Y) * However, the var(X) = var(x)+var(v) * Thus: BetaHat = cov(X,Y)/var(X) = cov(X,Y)/(var(x)+var(v)) = cov(x,Y)/(var(x)+var(v)) * Therefore: |BetaHat| < |Beta| when var(v)>0 * Let's see this in action! set seed 101 clear set obs 100000 gen true_weight=165+rnormal()*30 gen measurement_error = 20*rnormal() gen weight_observed = true_weight+measurement_error gen u = rnormal()* 5 corr true_weight weight_observed * We can see even with measurement error, our estimate of weight is 82% correlated with the true weight. gen athletic_performance = 10 - .05*true_weight + u * We expect our estimate of weight to be biased by a factor of alpha where alpha is defined as: * alpha*|beta| = |BetaHat| * alpha*|-.05| =|cov(x,Y)/(var(x)+var(v))|= qui corr true_weight athletic_performance, cov di r(cov_12)/(30^2 + 20^2) * = -.03469636 * Thus alpha = 70% reg athletic_performance weight_observed di .05 * .7 * Thus we can see the nature of our bias is very predictable under the assumption that the measurement error is uncorrelated with our outcome variable. ## Thursday, November 15, 2012 ### Random Coefficients with IFGLS and MLE Original Code * In a recent post on Estimating Random Coefficients (http://www.econometricsbysimulation.com/2012/11/estimating-random-coefficients-on-x.html), I proposed the use of Maximum Likelihood Estimation (MLE) as a means of estimating both the heteroskedastic covariance matrix and the variance of the random coefficient matrix. * A blog read Szabolcs Lorincz responded by posting some very nice code showing how to write an Iterative Feasible Generalized Least Squares Estimator that performed very similarly to that of the MLE estimator except that it ran much faster. * On the particular seed choice the IFGLS estimtor outperformed the MLE estimator in both speed and precision of the estimate. * On this post I will evaluate two different estimators. * 1. The original MLE * 2. The IFGLS posted by Szabolcs Lorincz * I will evaluate each of these estimators on four criteria: * 1. Feasiblility of estimates (are the estimates consistent with the model and or are the estimates estimatable - no failure of convergence) * 2. Unbiasedness of estimates * 3. Statistical efficiency of estimates * 4. Computational efficiency of estimates * I will define a single command that generates the data we will use as our test of model effectiveness. cap program drop gen_data program define gen_data * In order to make this more interesting let's allow there to be multiple random coefficients and them to be correlated. clear set obs 1000 gen x1 = rnormal() gen x2 = rnormal()*(2^.5) * Coefficients on x2 should be twice as easy to estimate because the variance in x2 is twice as large. * Now, for the random coefficients. matrix C = (5, 1 , 0 \ 1, 3.5, 0 \ 0, 0, 400) drawnorm v1 v2 u, cov(C) * I am allowing for there to be two random coefficients, b1 and b2 which can be correlated and an error u which is uncorrelated with the random coefficients. * The random coefficients are: gen b1 = 2 + v1 gen b2 = -2 + v2 gen y = b1*x1 + b2*x2 + u * In our analysis we will be interested in estimating the average effect x1 on y which is 2, the average effect of x2 on y which is -2, as well as the variances of the random coefficients 5 and 3.5 as well as the covariance between the coefficients. * In order to do that effectively we will need to think about how to identify the separate parts of the error term. * Think of y = B1x1 + B2x2 + e * Where e = v1x1 + v2x2 + u * The variance of each component is identifiable: * var(e) = x1*var(v1)*x1 + x2*var(v2)*x2 + var(u) + 2*x1*x2*cov(v1,v2) + 2*x1*cov(v1,u) + 2*x2*cov(v2,u) * For simplicity (but not for neccessity, let's assume cov(v1,u)=cov(v2,u)=0 * Thus: var(e) = x1*var(v1)*x1 + x2*var(v2)*x2 + var(u) + 2*x1*x2*cov(v1,v2) * It is useful therefore to define: gen x1_2 = x1^2 gen x2_2 = x2^2 gen x12 = x1*x2 gen varu = 1 end * End the data generating program * Now let's set up the different estimation options: ******************** * 1. The original MLE (from the previous post) cap program drop myNormalReg program define myNormalReg args lnlk xb sigma2 qui replace lnlk' = -ln(sqrt(sigma2') * sqrt(2*_pi)) - ($ML_y-xb')^2/(2*sigma2')
end

* Let's see it in action:
set seed 1211
gen_data  // First let's generate the data

* Now let's attempt to estimate:

* First as a matter of reference our OLS model:
reg y x1 x2

* We can see already our coefficient estimates are not too bad.

* However, we are interested in estimating more than the coefficients.

ml model lf myNormalReg (y = x1 x2) (sigma2: varu x1_2 x2_2 x12, noconstant)
ml maximize

* We can see that our coefficient estimates on the main effect have not improved substantially.

* In addition, in this run, the estimate on x2_2 is negative which is a violation of our model (variance/covariance has to be PD).

********************
* 2. The IFGLS based on code posted by Szabolcs Lorincz (in a comment on the previous post)

* I will write it as a program for easy repeatability:

cap program drop IFGLS1
program define IFGLS1, rclass

capture drop weight
generate weight=1

* Difference needs to be some real value so as not to create problems on the while step.
local difference=1
local i=0  // Counts the number of iterations

* We need to have itial values for some of our locals
local x1 = 0
local x2 = 0
local _cons = 0
local varv1 = 0
local varv2 = 0
local varu = 0
local cov_v1v2 = 0

* Define some parameters to start the program with.
local tolerance=0.0000001  // One of the conditions of termination is that the difference between parameter iterations in value is less than this number
local maxit=100  // An alternative is that there is this many iterations
local quiet="qui"

* This is the condition under which the program continues to iterate, seeking a better solution.
while (difference'>tolerance' & i'<maxit') {

* Add 1 to the iteration count
local i=i'+1

* Run the regression using the inverse weights (itially they are set to equal to 1)
quiet' reg y x1 x2 [aweight=weight]

* This code calculates the difference between the estimates from the previous regression and the one from the most recent.
local dx1=abs(x1'-_b["x1"])
local dx2=abs(x2'-_b["x2"])
local d_cons=abs(_cons'-_b["_cons"])

* Now save the values from the most recent regression
local x1=_b["x1"]
local x2=_b["x2"]
local _cons=_b["_cons"]

* In order to estimate in a second stage the random coefficients we will use the residuals from the regression as an estimate of e.
capture drop e
predict e, res

capture drop e2
generate e2=e^2

* Remember: var(e) =  x1*var(v1)*x1 + x2*var(v2)*x2 + var(u) + 2*x1*x2*cov(v1,v2)
* And since E(e) = 0, var(e)=E(e^2)

quiet' reg e2 varu x1_2 x2_2 x12, noconstant

* As before, let's calculate the difference in parameter estimates
local dvarv1=abs(varv1'-_b["x1_2"])
local dvarv2=abs(varv2'-_b["x2_2"])
local dvaru=abs(varu'-_b["varu"])
local dcov_v1v2=abs(cov_v1v2'-_b["x12"])

* We will save the variance estimates for later use
local varv1=_b["x1_2"]
local varv2=_b["x2_2"]
local varu=_b["varu"]
local cov_v1v2=_b["x12"]

* The difference is calculating the maximum difference in parameter estimates between iterations.
local difference=max(dx1', dx2', d_cons', dvarv1', dvarv2', dvaru', dcov_v1v2')

* We use the predicted variance to calculate our next stage variance matrix
capture drop fit
predict fit, xb

* In the original code the weights are calculated in the following manner.
* This is somewhat confusing to me because I thought the aweights were based on the inverse of the variance not that standard errors.
* Therefore, I will try with both methods as see which one works better.
* Notice that the way this is done is to ignore this round of estimates if the fit is less than zero.
if "1'"=="" quiet' replace weight=1/(fit^.5) if fit>0 & fit!=.
if "1'"!="" quiet' replace weight=1/fit if fit>0 & fit!=.

* This counts how frequently the fit is less than zero.
quietly count if fit<=0 | fit==.

if (r(N)==0) display "FGLS iteration i', max. abs. difference in parameters: " %7.5f = difference'
if (r(N)>0) display "FGLS iteration i', max. abs. difference in parameters: " %7.5f = difference' ", # of obs. with inconsistent weight: " r(N) " (" 100*r(N)/_N "%)"
} // End while loop

* Now that we have our final looping weights, we can run our coefficient estiamtes oen final time.
reg y x1 x2 [aweight=weight]
return scalar bx1 = _b[x1]
return scalar bx2 = _b[x2]

local cov_v1v2=cov_v1v2'/2

matrix Chat_IFGLS1 = (varv1', cov_v1v2', 0 \ cov_v1v2', varv2', 0 \ 0, 0, varu')
matrix list Chat_IFGLS1
return scalar varv1 = varv1'
return scalar varv2 = varv2'
return scalar varu = varu'
return scalar cov_v1v2 = cov_v1v2'

end

IFGLS1
* I think I made a mistake when altering the code because it is not converging as nicely as it was before.

********************
* I will design a simulation that repeatedly generates the data and runs both estimators

cap program drop sim1
program define sim1, rclass

gen_data

timer clear 1
timer on 1

ml model lf myNormalReg (y = x1 x2) (sigma2: varu x1_2 x2_2 x12, noconstant)
ml maximize, iter(100)
* I set the maximum number of iterations to 100 (parrellel to that of the IFGLS)

timer off 1

return scalar MLE_Bx1 = [eq1]_b[x1]
return scalar MLE_Bx2 = [eq1]_b[x2]

return scalar MLE_varv1 = [sigma2]_b[x1_2]
return scalar MLE_varv2 = [sigma2]_b[x2_2]
return scalar MLE_cov_v1v2 = [sigma2]_b[x12]/2

timer list 1
return scalar MLE_time = r(t1)

timer clear 2
timer on 2
IFGLS1
timer off 2

return scalar GLS_Bx1 = _b[x1]
return scalar GLS_Bx2 = _b[x2]

return scalar GLS_varv1 = r(varv1)
return scalar GLS_varv2 = r(varv2)
return scalar GLS_cov_v1v2 = r(cov_v1v2)

timer list 2
return scalar GLS_time = r(t2)

* Sigma2 means that I will not take the square root of sigma when calculating weighting matrix
timer clear 3
timer on 3
IFGLS1 sigma2
timer off 3

return scalar GLS2_Bx1 = _b[x1]
return scalar GLS2_Bx2 = _b[x2]

return scalar GLS2_varv1 = r(varv1)
return scalar GLS2_varv2 = r(varv2)
return scalar GLS2_cov_v1v2 = r(cov_v1v2)

timer list 3
return scalar GLS2_time = r(t3)

end

sim1

* Now let's run the simulation:
simulate MLE_Bx1=r(MLE_Bx1) MLE_Bx2=r(MLE_Bx2) MLE_varv1=r(MLE_varv1) MLE_varv2=r(MLE_varv2) MLE_cov_v1v2=r(MLE_cov_v1v2) MLE_time=r(MLE_time) ///
GLS_Bx1=r(GLS_Bx1) GLS_Bx2=r(GLS_Bx2) GLS_varv1=r(GLS_varv1) GLS_varv2=r(GLS_varv2) GLS_cov_v1v2=r(GLS_cov_v1v2) GLS_time=r(GLS_time) ///
GLS2_Bx1=r(GLS2_Bx1) GLS2_Bx2=r(GLS2_Bx2) GLS2_varv1=r(GLS2_varv1) GLS2_varv2=r(GLS2_varv2) GLS2_cov_v1v2=r(GLS2_cov_v1v2) GLS2_time=r(GLS2_time) , reps(5): sim1

* Remember our targets:
* b1 = 2
* b2 = -2
* matrix C = (5, 1 , 0 \ 1, 3.5, 0 \ 0, 0, 400)
* The estimators seem practically identical at this level.
sum

* These first 5 simulations are just a sample run.  We need to simulate at least 100 times to hope to have power enough to evaluate the difference between estimators.

simulate MLE_Bx1=r(MLE_Bx1) MLE_Bx2=r(MLE_Bx2) MLE_varv1=r(MLE_varv1) MLE_varv2=r(MLE_varv2) MLE_cov_v1v2=r(MLE_cov_v1v2) MLE_time=r(MLE_time) ///
GLS_Bx1=r(GLS_Bx1) GLS_Bx2=r(GLS_Bx2) GLS_varv1=r(GLS_varv1) GLS_varv2=r(GLS_varv2) GLS_cov_v1v2=r(GLS_cov_v1v2) GLS_time=r(GLS_time) ///
GLS2_Bx1=r(GLS2_Bx1) GLS2_Bx2=r(GLS2_Bx2) GLS2_varv1=r(GLS2_varv1) GLS2_varv2=r(GLS2_varv2) GLS2_cov_v1v2=r(GLS2_cov_v1v2) GLS2_time=r(GLS2_time) , reps(100): sim1
sum

* 1. Feasiblility of estimates (are the estimates consistent with the model and or are the estimates estimatable - no failure of convergence)

* First let's test the feasibility of the estimates

gen MLEfeasible = (MLE_varv1>0) * (MLE_varv2>0) * (MLE_varv1*MLE_varv1-MLE_cov_v1v2^2>0)
gen GLSfeasible = (GLS_varv1>0) * (GLS_varv2>0) * (GLS_varv1*GLS_varv1-GLS_cov_v1v2^2>0)
gen GLS2feasible = (GLS2_varv1>0) * (GLS2_varv2>0) * (GLS2_varv1*GLS2_varv1-GLS2_cov_v1v2^2>0)

sum *feasible
ttest MLEfeasible=GLSfeasible
* As far as feasibility goes there is no difference in the feasibility of MLE or GLS.
* Both methods produce PD estimates at a rate around 30%

* 2. Unbiasedness of estimates
ttest MLE_Bx1 = GLS_Bx1
ttest GLS_Bx1 = GLS2_Bx1
ttest MLE_Bx2 = GLS_Bx2
ttest GLS_Bx2 = GLS2_Bx2

* There is no statistical significance in the difference between coefficient estimates except MLE x1 with GLS at the 10% level but given the number of t-tests this result is not convincing.

* 2. Unbiasedness of estimates
ttest MLE_varv1 = GLS_varv1
ttest GLS_varv1 = GLS2_varv1
ttest MLE_varv2 = GLS_varv2
ttest GLS_varv2 = GLS2_varv2
* There does not appear to be a difference in the variance estimates between the GLS and MLE.

* However, between GLS and GLS2 the variances of GLS2 are statistically smaller than GLS1 which makes GLS1 better for var1 and worse for var2

* Perhaps if the simulation was run 1000 or 10000 times we could get a more precise estimate of the results.

* 3. Statistical efficiency of estimates
sum *x?

* Just eyeballing it, there does not seem to be a substantial difference in the variances of the estimates either.

* 4. Computational efficiency of estimates
sum *time
ttest MLE_time = GLS_time
* The MLE estimator runs statistically significantly faster than the GLS (in this set up at least)
ttest GLS_time = GLS2_time
* GLS runs faster than GLS2 which indicates to me that the original weight adjustment was better than GLS2.

*******************
* Overall

* Both estimators seem to work pretty well at generating unbiased estimates.  The MLE estimator seems to be faster but that could be a byproduct of the sample size and simulation characteristics.

* However, the primary criteria was the ability to generate estimates that were feasible soluations to the problem (that is producing a variance matrix that was positive definite)

* All estimators failed around 70% of the time in this simulation where the correlation between random coefficients is only 24% (rho=1/(5^.5*3.5^.5))

* Thus, I am unhappy with both methods and additional methods will need be investigated.

## Wednesday, November 14, 2012

### R-squared

Original Code

* R-Squared

* R-squared and pseudo r-squared is a useful statistics produced by most regression type estimation routines.

* R-squared (R2) is a measure of how much of the variance in y is explained by the model.

* Thus a model with only an intercept has an R2 of 0.

set seed 101
clear
set obs 10000

gen y1=rnormal()

reg y1

* While in the opposite extreme a model which does not have any unexplained variance has an r2 of 1.

reg y1 y1, noconstant

* Technically this regression should not work but Stata does the math and produces the results.

* Let's see how well R2 approximates explainable variation.

gen x=rnormal()
gen u=rnormal()

gen y2 = (1)*x + u
* The variance of the model is equal to 1 (from 1^2 * var(x))
* The variance of the unexplained error is equal to 1 (var(u))
* Thus our true explained variance should be equal to var(x)/(1^2 * var(x)+var(u)) = 1/2

reg y2 x
* Thus we can see our R2 estimate of the explained variance is very close to the true which is .5

* I made enphasis on noting the coefficient on the x.

* That coefficient significantly scales explainable variation.

* Thus:
gen y3 = 2*x + u

* Should have a much larger R2 because model variance = 2^2*varx = 4
* Var(u) = 1
* R2 = 4/(4+1)=.8

reg y3 x

* If we were to add multiple xs the calculation is similar though if there is correlation between the xs then that will factor into the model.

gen x1 = rnormal()
gen x2 = rnormal()

gen y4 = x1 + x2 + u

* R2 = var(x1) + var(x2) / (var(x1) + var(x2) + var(u) = 2/3
reg y4 x1 x2

* If there is correlation between the xs then that can substantially throw off the calculations.

* In the extreme cases corr(x1, x2)=1 then we are back to the same scenario as y3

* y = x1 + x2 + u (if x1~N(0,1) and x2~N(0,1)) then x1=x2

* y = 2*x1 + u

* Thus R2 = .8

* In the other extreme corr(x1,x2)=-1

* Then, given that they are both N~(0,1), x2=-x1

* y = x1 + x2 + u = x1 - x1 + u = u

* Which is the same as y1

* R2 = 0

* R-squared can also be thought of as the square of the correlation between the predicted values and the observed.

reg y4 x1 x2
predict y4hat

corr y4hat y4
* Thus we can see that there is an 81% correlation between yhat and y observed.

* A high correlation would indicate that our model have done well at predicting observable characteristics.

di r(rho)^2

* A brief note on adjusted R2.

* R2 is known to always be larger the more variables are in your model.

gen z1 = rnormal()
gen z2 = rnormal()
gen z3 = rnormal()

reg y4 x? z?

* Thus: the R2 moved from   R-squared     =  0.6610 to
*                           R-squared     =  0.6611

* This factor being known researchers have developed the Adj-R2 which slightly penalizes the R2 for including more variables.

* Thus Adj R-squared =  0.6609

* This might be appropriate given known facts, however it is trivial and almost always worth ignoring.

* I generally don't pay attention to the AR2 and I don't know anybody else who does either.

* A .0001 difference in R2 is so unimportant as to be completely ignorable without significant loss of content.