▼
Sunday, September 2, 2012
Three ways to Create Unique Identifiers
* Imagine that you have just recieved a data set and there are three variables household_ID, city_ID, state_ID which are characteristics of the observations.
* There is a the possibility that you have multiple observations per combination representing different responses from different surveys.
* You would like to create a unique identifier for each of the potential outcomes
* Let's start with 25,000 observations
clear
set obs 25000
* Household IDs go from 1 to 50
gen household_ID = ceil(runiform()*50)
* State IDs go from 1 to 50
gen state_ID = ceil(runiform()*50)
* City IDs go from 1 to 20
gen city_ID = ceil(runiform()*20)
di "Thus " 50*50*20 " potential outcomes"
* By chance there will be some mutiple outcomes and some outcomes that are never drawn.
* To count the multiple outcomes we can use the duplicates report command
duplicates report household_ID city_ID state_ID
* Let's imagine that we have a couple of variables of interest
gen x1 = rnormal()
gen x2 = rnormal()
* Now let's create our unique identifiers
* The easiest way is by using the group command
egen ID = group(household_ID city_ID state_ID)
* This command will just create a unique identifier of for each combination of household_ID city_ID state_ID in a manner consistent with sorting those variables.
sort household_ID city_ID state_ID
* We can easily see this in the numbering of ID.
sum ID
* I get a max ID of a little less than 20k. This means that there are a little less than 20k unique household_ID city_ID state_ID combinations.
* We can manually do a similar command as the group function through the following set of nested loops.
local i = 0
gen ID2 = .
qui levelsof household_ID, clean
local levelsv = r(levels)
qui levelsof city_ID, clean
local levelsvv = r(levels)
qui levelsof state_ID, clean
local levelsvvv = r(levels)
foreach v of local levelsv {
foreach vv of local levelsvv {
foreach vvv of local levelsvvv {
qui replace ID2 = `i' if `vvv' == state_ID & `vv' == city_ID & `v' == household_ID
local i = `i'+1
}
}
}
* This method is obviously much slower than the group function and does create uniform numbering when each combination is not at least once represented.
* An alternative way of creating a unique ID would be to create a text variable which is a combination of the other IDs.
* This method, though more complex (and using up more system resources) may be desireable in that it yeilds a unique identifier which is easily identifiable.
gen ID3 = "H" + string(household_ID,"%2.0f" ) + "C" + string(city_ID) + "S" + string(state_ID)
* ID3 is less easy to count unique identifiers with.
* Now the only problem is that some of our data has multiple observations.
* In order to resolve these multiple observations we will decide a rule on how to handle x1 and x2 that reduces our observations to 1 per identifier (thus making it finally unique)
* I am not a huge fan of the collapse command because it is a destructive command but it can save a few steps.
collapse (mean) mean_x1=x1 mean_x2=x2 (median) med_x1=x1 med_x2=x2, by(ID)
sum ID
di "Now max ID (" r(max) ") equals the number of observations (" r(N) ") indicating that we have achieved our goal of creating a unique identifier.
Looks like something those archiving solutions providers have been doing for a while now.
ReplyDelete