Monday, August 12, 2013

A Stata HTML syntax highlighter in R

So I have been having difficulty getting my Stata code to look the way I want it to look when I post it to my blog.  To alleviate this condition I have written a html encoder in R.  I don't know much about html so it is likely to be a little clunkier in terms of tags than it need be.  It still needs some work but I thought to post what I have so far in case others would like to use the code to format their Stata code or modify it to format any language of their choosing.

I would like to build a Shiny app in which all the user need do is paste the code and submit it.  But that will be for a future post.  Here is the code and the example using my Stata post from July 30th.  You can also find the code on github.  Please feel free to submit possible solutions to the two technical hurdles I discuss in my code (the inability to find and format numbers and the difficulty of finding and formatting punctuation).


# A Stata HTML formatter in R
# Load up your Stata do file.
txt <- readLines(
# First subsititute out all of the < and > which can be misinterpretted as 
# tags in HTML.
txt <- gsub("<","&lt;",txt) 
txt <- gsub(">","&gt;",txt) 
# Choose the formatting tags you would like applied to each field type.
comment.start <- '<span style="color: #669933">'
comment.end   <- '</span>'
# I would like to auto format all numbers but I have nto yet been able to figure
# out how to do this.
num.start <- '<span style="color: #990000"><b>'
num.end   <- '</b></span>'
punc.start <- '<span style="color: #0000FF">'
punc.end   <- '</span>'
command1.start <- '<span style="color: #0000CC"><b>'
command1.end   <- '</b></span>'
command2.start <- '<span style="color: #9900FF">'
command2.end   <- '</span>'
command3.start <- '<span style="color: #990033">'
command3.end   <- '</span>'
# I am not sure where exactly I got this 
stata.commands1 <- unlist(strsplit(readLines(
                            , split=" "))
stata.commands2 <- unlist(strsplit(readLines(
                            , split=" "))
stata.commands3 <- unlist(strsplit(readLines(
                            , split=" "))
punc <- unlist(strsplit(readLines(
"")                          , split=" "))
# I want to figure out how to highlight the puncuation as well but I am having trouble
# with that.
# for (v in punc) txt<-  gsub(v, 
#        paste0(punc.start,v,punc.end), txt)
# Create a vector to tell R to ignore entire lines.
comment <- (1:length(txt))*0
# '*' Star comment recognizer
for (i in grep("[:*:]", txt)) {
  # Break each line to discover is the first symbol which is not a space is a *
  txt2 <- strsplit(txt[i], split=" ")[[1]]
  if (txt2[txt2!=""][1]=="*") {
    txt.rep <- paste(c(comment.start,txt[[i]],comment.end), collapse="")
    txt[[i]] <- txt.rep
    comment[i] <- 1
# '//' Comment recognizer
for (i in (grep("//", txt))) if (comment[i]==0) {
  txt2 <- strsplit(txt[i], split=" ")[[1]] <- grep("//", txt2)[1]-1
  txt.rep <- paste(c(txt2[], comment.start, 
                     txt2[-(],comment.end), collapse=" ")
    txt[[i]] <- txt.rep
# Format stata commands that match each list
# "\\<",v,"\\>" ensures only entire word matches
# are used.
for (v in stata.commands1) txt[comment==0]<-
for (v in stata.commands2) txt[comment==0]<-
for (v in stata.commands3) txt[comment==0]<-
# This is my attempt at highlighting all numbers that are not words.
# It did not work.  
# <a href ="">stackoverflow topic</a>
# txt <- gsub(".*([[:digit:]]+).*", paste0(num.start,"\\1",num.end), txt)
# Add tags to the end and beginning to help control the general format.
txt <- c('<pre><span style="font-family: monospace',txt,
         '\nFormatted By <a href=""></a>',
# Copy formatted HTML to the clipboard.
writeClipboard(paste(txt, collapse="\n"))
Formatted by Pretty R at

Stata code formatting example:


set obs 4000

gen id = _n

gen eta1 = rnormal()
gen eta2 = rnormal()

* Generate 5 irrelevant factors that might affect each of the
* different responses on the pretest
gen f1 = rnormal()
gen f2 = rnormal()
gen f3 = rnormal()
gen f4 = rnormal()
gen f5 = rnormal()

* Now let's apply the treatment
expand 2, gen(t)   // double our data 

gen treat=0
replace treat=1 if ((id<=_N/4)&(t==1))

* Now let's generate our changes in etas
replace eta1 = eta1 + treat*1 + t*.5
replace eta2 = eta2 + treat*.5 + t*1

* Finally we generate out pre and post test responses
gen v1 = f1*.8  + eta1*1  + eta2*.4   // eta1 has more loading on 
gen v2 = f2*1.5 + eta1*1  + eta2*.3   // the first few questions 
gen v3 = f3*2   + eta1*1  + eta2*1  
gen v4 = f4*1   + eta1*.2 + eta2*1   // eta2 has more loading on 
gen v5 = f5*1   +           eta2*1   // the last few questions 

* END Simulation
* Begin Estimation

sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
predict L1 L2, latent

sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
predict L12 L22, latent

replace  L1 = L12 if t==1
replace  L2 = L22 if t==1

* Now let's see if our latent predicted factors are correlated with our true factors.
corr eta1 eta2 L1 L2

* We can see already that we are having problems.  
* I am no expert on SEM so I don't really know what is going wrong except
* that eta1 is reasonably highly correlated with L1 and L2 and
* eta2 is less highly correlated with L1 and L2 equally each
* individually, which is not what we want.

* Well too late to stop now.  Let's do our diff in diff estimation.
* In this case we can easily accomplish it by generating one more variable.

* Let's do a seemingly unrelated regression form to make a single joint estimator.

sureg (L1 t id treat) (L2 t id treat)

* Now we have estimated the effect of the treatment given a control for the
* time effect and individual differences.  Can we be sure of our results?
* Not quite.  We are treating L1 and L2 like observed varaibles rather than
* random variables we estimated.  We need to adjust out standard errors to
* take this into account.  The easiest way though computationally intensive is
* to use a bootstrap routine.

* This is how it is done.  Same as above but we will use temporary variables.
cap program drop SEMdnd
program define SEMdnd

  tempvar L1 L2 L12 L22
  sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
  predict `L1' `L2', latent
  sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
  predict `L12' `L22', latent

  replace  `L1' = `L12' if t==1
  replace  `L2' = `L22' if t==1

  sureg (`L1' t id treat) (`L2' t id treat)

  drop `L1' `L2' `L12' `L22'


SEMdnd   // Looking good 

* This should do it though I don't hae the machine time available to wait
* for it to finish.
bs , rep(200) cluster(id): SEMdnd 

Formatted By


  1. Did you try Highlight software--- It can convert from one format to multiple format types and gives options few color customizations as well:

    1. Thanks for the suggestion! I had not seen this sweet little program.