I would like to build a Shiny app in which all the user need do is paste the code and submit it. But that will be for a future post. Here is the code and the example using my Stata post from July 30th. You can also find the code on github. Please feel free to submit possible solutions to the two technical hurdles I discuss in my code (the inability to find and format numbers and the difficulty of finding and formatting punctuation).

R-Code

# A Stata HTML formatter in R # Load up your Stata do file. txt <- readLines( "https://raw.github.com/EconometricsBySimulation/2013-07-30-Diff-n-Diff/master/Code.do") # First subsititute out all of the < and > which can be misinterpretted as # tags in HTML. txt <- gsub("<","<",txt) txt <- gsub(">",">",txt) # Choose the formatting tags you would like applied to each field type. comment.start <- '<span style="color: #669933">' comment.end <- '</span>' # I would like to auto format all numbers but I have nto yet been able to figure # out how to do this. num.start <- '<span style="color: #990000"><b>' num.end <- '</b></span>' punc.start <- '<span style="color: #0000FF">' punc.end <- '</span>' command1.start <- '<span style="color: #0000CC"><b>' command1.end <- '</b></span>' command2.start <- '<span style="color: #9900FF">' command2.end <- '</span>' command3.start <- '<span style="color: #990033">' command3.end <- '</span>' # I am not sure where exactly I got this stata.commands1 <- unlist(strsplit(readLines( "https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/C1.txt") , split=" ")) stata.commands2 <- unlist(strsplit(readLines( "https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/C2.txt") , split=" ")) stata.commands3 <- unlist(strsplit(readLines( "https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/C3.txt") , split=" ")) punc <- unlist(strsplit(readLines( "https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/Punc.txt") , split=" ")) # I want to figure out how to highlight the puncuation as well but I am having trouble # with that. # for (v in punc) txt<- gsub(v, # paste0(punc.start,v,punc.end), txt) # Create a vector to tell R to ignore entire lines. comment <- (1:length(txt))*0 # '*' Star comment recognizer for (i in grep("[:*:]", txt)) { # Break each line to discover is the first symbol which is not a space is a * txt2 <- strsplit(txt[i], split=" ")[[1]] if (txt2[txt2!=""][1]=="*") { txt.rep <- paste(c(comment.start,txt[[i]],comment.end), collapse="") txt[[i]] <- txt.rep comment[i] <- 1 } } # '//' Comment recognizer for (i in (grep("//", txt))) if (comment[i]==0) { txt2 <- strsplit(txt[i], split=" ")[[1]] comment.place <- grep("//", txt2)[1]-1 txt.rep <- paste(c(txt2[1:comment.place], comment.start, txt2[-(1:comment.place)],comment.end), collapse=" ") txt[[i]] <- txt.rep } # Format stata commands that match each list # "\\<",v,"\\>" ensures only entire word matches # are used. for (v in stata.commands1) txt[comment==0]<- gsub(paste0("\\<",v,"\\>"), paste0(command1.start,v,command1.end), txt[comment==0]) for (v in stata.commands2) txt[comment==0]<- gsub(paste0("\\<",v,"\\>"), paste0(command2.start,v,command2.end), txt[comment==0]) for (v in stata.commands3) txt[comment==0]<- gsub(paste0("\\<",v,"\\>"), paste0(command3.start,v,command3.end), txt[comment==0]) # This is my attempt at highlighting all numbers that are not words. # It did not work. # <a href ="http://stackoverflow.com/questions/18160131/replacing-numbers-r-regular-expression">stackoverflow topic</a> # txt <- gsub(".*([[:digit:]]+).*", paste0(num.start,"\\1",num.end), txt) # Add tags to the end and beginning to help control the general format. txt <- c('<pre><span style="font-family: monospace',txt, '\nFormatted By <a href="http://www.econometricsbysimulation.com">EconometricsbySimulation.com</a>', '</span></pre>') # Copy formatted HTML to the clipboard. writeClipboard(paste(txt, collapse="\n"))

Stata code formatting example:

(https://raw.github.com/EconometricsBySimulation/2013-07-30-Diff-n-Diff/master/Code.do)

**clear**
set **obs** 4000
**gen** id = _n
**gen** eta1 = rnormal()
**gen** eta2 = rnormal()
* Generate 5 irrelevant factors that might affect each of the
* different responses on the pretest
**gen** f1 = rnormal()
**gen** f2 = rnormal()
**gen** f3 = rnormal()
**gen** f4 = rnormal()
**gen** f5 = rnormal()
* Now let's apply the treatment
**expand** 2, **gen**(t) // double our data
**gen** treat=0
replace treat=1 if ((id<=_N/4)&(t==1))
* Now let's generate our changes in etas
replace eta1 = eta1 + treat*1 + t*.5
replace eta2 = eta2 + treat*.5 + t*1
* Finally we generate out pre and post test responses
**gen** v1 = f1*.8 + eta1*1 + eta2*.4 // eta1 has more loading on
**gen** v2 = f2*1.5 + eta1*1 + eta2*.3 // the first few questions
**gen** v3 = f3*2 + eta1*1 + eta2*1
**gen** v4 = f4*1 + eta1*.2 + eta2*1 // eta2 has more loading on
**gen** v5 = f5*1 + eta2*1 // the last few questions
* END Simulation
* Begin Estimation
**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
predict L1 L2, latent
**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
predict L12 L22, latent
replace L1 = L12 if t==1
replace L2 = L22 if t==1
* Now let's see if our latent predicted factors are correlated with our true factors.
corr eta1 eta2 L1 L2
* We can see already that we are having problems.
* I am no expert on SEM so I don't really know what is going wrong except
* that eta1 is reasonably highly correlated with L1 and L2 and
* eta2 is less highly correlated with L1 and L2 equally each
* individually, which is not what we want.
* Well too late to stop now. Let's do our diff in diff estimation.
* In this case we can easily accomplish it by generating one more variable.
* Let's do a seemingly unrelated regression form to make a single joint estimator.
sureg (L1 t id treat) (L2 t id treat)
* Now we have estimated the effect of the treatment given a control for the
* time effect and individual differences. Can we be sure of our results?
* Not quite. We are treating L1 and L2 like observed varaibles rather than
* random variables we estimated. We need to adjust out standard errors to
* take this into account. The easiest way though computationally intensive is
* to use a bootstrap routine.
* This is how it is done. Same as above but we will use temporary variables.
cap program drop SEMdnd
program define SEMdnd
tempvar L1 L2 L12 L22
**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
predict `L1' `L2', latent
**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
predict `L12' `L22', latent
replace `L1' = `L12' if t==1
replace `L2' = `L22' if t==1
sureg (`L1' t id treat) (`L2' t id treat)
drop `L1' `L2' `L12' `L22'
**end**
SEMdnd // Looking good
* This should do it though I don't hae the machine time available to wait
* for it to finish.
bs , **rep**(200) cluster(id): SEMdnd
Formatted By EconometricsbySimulation.com

Did you try Highlight software--- It can convert from one format to multiple format types and gives options few color customizations as well: http://www.andre-simon.de/zip/download.html

ReplyDeleteThanks for the suggestion! I had not seen this sweet little program.

Delete