Friday, March 22, 2013

My not so Brief Stata Formatting Guide

* I write this as a short guide though I do not always stick to it. This post was inspired by a thoughtful discussion on the linkedin Stata-Users group.

* The number one rule is: Always, always comment your code!

* See my reasons:

* I prefer using <*> before a comment as the primary method of commenting.
* And <//> before my lines when in mata.

* I only like using /* and */ when I have a large amount of comments.

* I think it is useful to add comments after commands like.
clear // remove data in memory

* Though with long commands I think it is hard to read.  For example:
twoway (scatter  length gear_ratio) (scatter  foreign mpg_price) (scatter  price mpg) ///
  , title("This is a useless and meaningless graph") // Graphs length against gear_ratio

* -----------------------------SAMPLE DOCUMENT----------------------------
/***** Title of Do File

Description of do file.  This might have several paragraphs for which
I reccomend hard breaking the lines since Stata does not have word wrap

*********************** Section 0: Initialization **********************

If your do file is very long and has multiple sections consider including
an index.

1. Parameter declaration
2. Input/clean data, generate temporary data
3. Manipulate variables
4. Generate summary statistics
5. Generate estimates
6. Delete temporary data/variables

I might also consider including a variable glossory at the begging of your do file.

For example:
cntgdp: Country GPD
cntgdp2: Country GPD demeaned
year: Year
nsrvy: Number of Survey Wave

As for naming variables, I would suggest not letting variables get longer
than six letters and two numbers long.

For example the variables might mean:
dstgdpp: disctrict gross domestic product per capita, nominal currency of that year.
dstgdppcchgyr00: disctrict gross domestic product per capita change from year 2000.
It might seem like a good idea to write it this way but it is really confusing to
try to read especially since stata will start truncating variables.

I would suggest instead naming variables something like this instead:
dstgdp1: disctrict gross domestic product per capita change from year 2000.
dstgdp0: disctrict gross domestic product per capita, nominal currency of that year.

Have two places you define the variables.
The variable glossory at the begginning of your document and the label that you
give your variables.
******************** End Section 0: Initialization **********************

****************** Section 1: Parameter declaration *********************

* Often times you might find it useful to specify globals or locals that help you
* control your analysis when you run your file.

* For example:
global exmin = 1
* When set to 1 minorities will be excluded from the analysis.

global ppp = 0
* When set to 1 purchasing power parody will be used instead of GDP per capita.

* Of course you will need to code up within the analysis what the globals actually do.

* Speficy a working directory.  This can be done with the "cd".

* Personally I don't think this is suffcient.
* Often I am loading multiple data files from multiple directories.

* I prefer using globals specified in the parameter section.
* This allows users to have slightly or largely different file organization,
* Yet still be able to run your analysis.

* For example:
* Use globals to specify directories of interest
* Read directory
global rdir = "C:/data_files/my_project/original_data/"
* Save directory
global sdir = "C:/my_project/modified_data/"

**************** End Section 1: Parameter declaration *********************

****************** Section 2: Input/clean data *********************

* When you load in data.  Always first load it then save a copy of it somewhere else.

* Load original data:
sysuse auto, clear

* Save data to new directory where it will never accidently overwrite your original data
save "${sdir}auto.dta", replace

*************** End Section 2: Input/clean data *********************

****************** Section 3: Manipulate variables *********************

* Always give your variables labels when you define them.
gen mpg_price = mpg*price
  label var mpg_price "Miles Per Gallon times Price"
* Uses spaces to help denote commands which are secondary.
* Never use tabs instead of spaces
* because
* they
* are
* hard
* to
* read
* and can
* substantially
  * decrease
  * your
  * page space
  * Also, your code may
  * look different with different
  * programs.
  * This is very annoying.
  * I stuck a
  * lot of spaces to
  * simulate the 
  * Stata editor.
* Always explain why you do things.
drop if foreign == 1
  * We only are interested in domestic cars (for example).

* When doing any kind of looping also use indentation:
forv i = 1(1)10 {
  * When using forvalues never do i = 1/10 instead of i = 1(1)10 which are equivalent.
  * But i = 1/10 notation can cause problems when using macros.

  * It is very improtant to indent.
  if (`i' == 3) {
    * Do something

* I am displaying filler text when i==3 and only then
di "Filler"

* This will display i squared when i==3 (which is obviously 9)
di `i'^2

  * End if
* End forv i loop

* Also, indent when commands go on multiple lines in length.
twoway (scatter  length gear_ratio) (scatter  foreign mpg_price) (scatter  price mpg) ///
  , title("This is a useless and meaningless graph")

* This can made commands much easier to read.

*************** End Section 3: Manipulate variables *********************

 * Also, take a look at some of the comments below.  There have been some very thoughtful contributions by Stata users.


  1. I came here from your LinkedIn announcement.

    I personally do things differently. (1) Lines like -clear- do not need comments, they are self-explanatory. (2) Generally, the comments should be explaining WHY are you doing something, not HOW you are doing something (unless you are using a trick that is so weird that it may be difficult to recall). It may be worth commenting the "how" business if it is very heavily data dependent, such as when a variable serves multiple purposes, e.g. -svyset [pw=weight], strata(region) // the sample was stratified by region-. (3) Some editors, like UltraEdit, utilize tabs to effectively display (and fold) code, although, for instance, Python requires spaces for indentation, and chokes on tabs. So stating to never use tabs is not the best advice. (4) globals is a device of last resort; avoid them, as you would avoid a GOTO (which Stata does not have); there are other ways to transfer parameters between files, like using -args-. If you do use globals, they have to be defined in a separate file (and then cleaned in the same or another clearly obvious file). (5) Use an editor that has appropriate syntax highlighting. Otherwise the comments are still not easy to read... as in the plain text guide above :). (6) Variable naming convention is odd, you have 32 characters to express the power of good names. The shorter names are easy to mix up, and there is usually little reason not to use more informative names. They may not show nicely in all tabulations and regression results, but you should be using -outreg-, -estout- and similar third-party commands for these purposes, anyway. (7) Data management and data analysis should be two separate do-files. A good size of a do-file is whatever fits one screen; if it does not, break it down, and collect the calls in the master file. (8) On storing a copy of the data somewhere -- Stata has a concept of -tempfile- which you can use here without cramming your disk space with the files that you won't remember where they came from; Stata will clean -tempfile-s at exit, on the other hand.

    In the end, my ratio of comment lines to substantive lines is roughly 20-40% comments to 60-80% code. You seem to have more comments than code (in your production code, not only in this guide), and that is not very efficient.

    Overall, I would also suggest your readers to look at references such as Kit Baum's Intro to Stata Programming and J Scott Long's The Workflow of Data Analysis books. These are excellent ways to improve the quality of your Stata code.

    Stas Kolenikov

    1. Thanks for your thorough response,

      You make some good points. I think your point number 7 is particularly helpful and something I did not think to mention though I strongly agree with.

      I am still not sure why people fervently oppose the use of globals (4). I have heard this said by various people but they never give explanation. Thus, I have shied away, using locals instead which are annoyed because Stata drops them from memory after the command has run its course confounding the challenges of debugging.

      As for points 1, 2, 5, as well as the comment/code ratio, I don't think we disagree at all. Obviously a public blog in which code is put forward to explain to new or experienced users how to do various tasks will and should have more comments than that of coding for analysis since the primary purpose in coding for an audience is to demonstrate what you are doing.

      I stand by my point that use of tabs is messy and I think your response confirms this. The primary reason is that it is bad form to have your code look differently to different users. Tabs can grossly aggravate this problem. Even if some text editors handle tabs well, are you going to tell your coauthors what text editors they need use? I am probably stating this a little too strong though. Obviously, if you find that tabs are working for you then I am perfectly fine with that. This is just a conclusion that I came to while working in a large coding group (9 to 10 people).

      Finally, as for tempfiles. This is another common preference (like using locals over globals) people have which I think makes debugging all the more difficult. What happens if your data is merging multiple files incorrectly and you need to edit it mid-run? Are you going to run the code up to a point look at your temporary data, then run it up to the next part? If you are going to do this why not just copy your data to a temporary location, then delete all of the temporary files at the end of your code? This will result in the same number of lines of code as well as the same run speed yet will avoid the annoying notation of specifying file names as tempfiles?


    2. Great post. For looping, I do prefer to put the comments on the same line as the closing (and/or the opening) brace since it helps make it clear where multiple, nested loops begin/end, particularly when you 'fold'/hide the loop in the stata do-file editor. E.g.,

      foreach x in one two three {
      if `"`c(os)'"' != "MacOSX" { //only on my windows os!
      forval y = 10(-1)1 {
      di `"`x':`y'"'
      } //close y loop
      foreach z in test1 test2 {
      di "`z'"
      } //close z loop
      } //close if loop
      } //close x loop

  2. A very useful guide. I had not considered the use of an extra space-worth's indent for secondary comments, and will probably adapt it to my own practices.

    Where I disagree with you is in using hard spaces instead of tabs to indent lines. You are right that tabs appear differently to different users, but surely this is a feature rather than a bug. Different people like different numbers of tab stops, and most text editors will at the least allow you to set that as a preference. Personally I find that 8 is too many; 4 works for me (speaking as someone who usually works with my monitor rotated 90°. But I would find it very annoying to read someone else's code and find that the indent size is dictated for me by their use of hard spaces, particularly as tabbed indents are also used for other optional features in many text editors. Similarly I wouldn't wish to inflict my tab-width preferences on anyone else, in much the same way as I wouldn't insist that they use the same text editor, font or colour scheme.

    (Side note: at least it's not quite as extreme as a former colleague of mine who, for some unknown reason, insisted on removing *all* indents from my code, and squashing as many closing braces onto the same line as possible! It wasn't Stata, I might add, hence the code still worked but was a complete pig to read.)

    With respect to globals, I can see they might lead to bad habits, but I don't put them in the same category as GOTO (my personal avoid-like-the-plague command is -cd-). They serve a purpose (but it is not hard to see how they might be misused for inappropriate purposes). I use them mainly for defining working directories. In the main, I keep them in a separate file. Unfortunately (because I haven't found a way for Stata to return a macro containing the directory of the current running do-file), I have to include a global in my main program file in order to tell Stata where to look for the rest of the globals!

    Finally, with tempfiles, I must admit, I prefer using them over permanent "in-progress" datasets, mainly for the reason that it makes my working directory easier to keep tidy.

  3. Thanks for all of the responses. I think I will reconsider how I use tabs and tempfiles.