## Friday, July 6, 2012

### Logit: Logistic regression on a factor variable

* Logistic regression on a factor variable

* A reader recently contacted me with a request.  I want to run logistic regressions to examine if a person ever visited a dentist in the last year.

* The example of the data she sent looked something like the following:

/*                  A12 |      Freq.     Percent        Cum.
------------------------+-----------------------------------
In the last 4 weeks |      2,028       20.28       20.28
Between 1 and 12 months |      2,036       20.36       40.64
1-2 years ago |      1,997       19.97       60.61
More than 2 years ago |      1,963       19.63       80.24
Never |      1,976       19.76      100.00
------------------------+-----------------------------------
Total |     10,000      100.00          */

* Let us first generate data that looks something like her data.

clear
set obs 10000
gen A12 = int(runiform()*5)+1
label define dental 1 "In the last 4 weeks" ///
2 "Between 1 and 12 months" ///
3 "1-2 years ago" ///
4 "More than 2 years ago" ///
5 "Never"

label values A12  dental

tab A12
* The problem is that the dependent variable is coded as a factor variable but the logistic regression takes a binary varailble.

* First we want to figure out what the label book on the A12 varaible is.
desc A12
* But this might not be the case that A12 is a factor variable.  We might find that A12 is actually a string variable.

* Let us generate string duplicate
decode A12, gen(A12b)

tab A12b
* We can see that the tab commands are identical except in the order that the items are listed.
* Thus we can infer that the original data is in factor form.

* Though just looking at the desc command tells us as well.  If the storage type is not string then it must be a factor variable.

* This tells us that A12 has the label dental applied to it.

label list

* Here is a detailed post on how to convert factor variables to dummies:
* http://www.econometricsbysimulation.com/2012/06/convert-factor-variables-dummy-lists.html

* However, it might be a bit of overkill for this problem.  Instead we can manually convert the factor variables as we need to.

* Generate first an empty variable
gen dental_yr1 = 0
label var dental_yr1 "Went to the dentist in the last year"
replace dental_yr1 = 1 if A12 == 1
* We know from the label list that A12 == 1 is "In the last 4 weeks"
replace dental_yr1 = 1 if A12 == 2
* We know from the label list that A12 == 1 is "Between 1 and 12 months"

sum dental_yr1
* Everything is looking good.

* Now in order to do a logistic regression we need to have some explanatory variables so let's generate some independent ones for now.
gen indepvar1 = rnormal()
gen indepvar2 = rnormal()

* Finally:
logit dental_yr1 indepvar1 indepvar2
* Unsprisingly the independent vars are not statistically significant.