Sunday, September 2, 2012

Three ways to Create Unique Identifiers


* Imagine that you have just recieved a data set and there are three variables household_ID, city_ID, state_ID which are characteristics of the observations.

* There is a the possibility that you have multiple observations per combination representing different responses from different surveys.

* You would like to create a unique identifier for each of the potential outcomes

* Let's start with 25,000 observations
clear
set obs 25000

* Household IDs go from 1 to 50
gen household_ID = ceil(runiform()*50)

* State IDs go from 1 to 50
gen state_ID = ceil(runiform()*50)

* City IDs go from 1 to 20
gen city_ID = ceil(runiform()*20)

di "Thus " 50*50*20 " potential outcomes"

* By chance there will be some mutiple outcomes and some outcomes that are never drawn.

* To count the multiple outcomes we can use the duplicates report command

duplicates report household_ID city_ID state_ID

* Let's imagine that we have a couple of variables of interest

gen x1 = rnormal()
gen x2 = rnormal()

* Now let's create our unique identifiers

* The easiest way is by using the group command

egen ID = group(household_ID city_ID state_ID)
* This command will just create a unique identifier of for each combination of household_ID city_ID state_ID in a manner consistent with sorting those variables.

sort household_ID city_ID state_ID
* We can easily see this in the numbering of ID.

sum ID
* I get a max ID of a little less than 20k.  This means that there are a little less than 20k unique household_ID city_ID state_ID combinations.

* We can manually do a similar command as the group function through the following set of nested loops.
local i = 0
gen ID2 = .

 qui levelsof household_ID, clean
 local levelsv = r(levels)
   qui levelsof city_ID, clean
   local levelsvv = r(levels)
    qui levelsof state_ID, clean
    local levelsvvv = r(levels)

foreach v of local levelsv  {
  foreach vv of local levelsvv  {
    foreach vvv of local levelsvvv  {
      qui replace ID2 = `i' if `vvv' == state_ID & `vv' == city_ID  & `v' == household_ID
 local i = `i'+1
}
  }
}
* This method is obviously much slower than the group function and does create uniform numbering when each combination is not at least once represented.

* An alternative way of creating a unique ID would be to create a text variable which is a combination of the other IDs.

* This method, though more complex (and using up more system resources) may be desireable in that it yeilds a unique identifier which is easily identifiable.

gen ID3 = "H" + string(household_ID,"%2.0f" ) + "C" + string(city_ID) + "S" + string(state_ID)

* ID3 is less easy to count unique identifiers with.

* Now the only problem is that some of our data has multiple observations.

* In order to resolve these multiple observations we will decide a rule on how to handle x1 and x2 that reduces our observations to 1 per identifier (thus making it finally unique)

* I am not a huge fan of the collapse command because it is a destructive command but it can save a few steps.
collapse (mean) mean_x1=x1 mean_x2=x2 (median) med_x1=x1 med_x2=x2, by(ID)

sum ID
di "Now max ID (" r(max) ") equals the number of observations (" r(N) ") indicating that we have achieved our goal of creating a unique identifier.

1 comment:

  1. Looks like something those archiving solutions providers have been doing for a while now.

    ReplyDelete