Identify unique levels of categorical variable

问题

I have a list of person ids, and the types of medicines they got on specific dates.

I would like to create a variable count whereby I can give the indicator 1 to the first drug that occurs, 2 to the second unique drug and 3 to the third unique drug. When the first drug occurs after the second and third, I want it to still have the indicator 1. Likewise for unique drug 2, it should maintain the value 2 throughout the person's whole medication history, and the same for drug 3.

     +-------------------------------------+
     | p_id      date      agent_~e  count |
     |-------------------------------------|
 38. |  1001   13dec2001   thiazide       1|
 39. |  1001   12apr2002   thiazide       1|
 40. |  1001   15jul2002   thiazide       1|
 41. |  1001   28aug2002        arb       2|
 42. |  1001   26sep2002        CCB       3|
     |-------------------------------------|
 43. |  1001   26sep2002        arb       2|
 44. |  1001   10oct2002        CCB       3|
 45. |  1001   10oct2002   thiazide       1|
 46. |  1001   10oct2002        arb       2|
 47. |  1001   10dec2002        CCB       3|
     |-------------------------------------|
 48. |  1001   10dec2002        arb       2|
     +-------------------------------------+

Because each person has a different set of drugs, I think I need quite a general solution as opposed to something like

gen count = 1 if agent_type == "thiazide".

For example, person two is below and they have a very different drug history to person one above.

         +-------------------------------+
         | p_id    date        agent_t~e |
         |-------------------------------|
    207. |  2001   08jul1999   ace_inhib |
    208. |  2001   02aug1999   ace_inhib |
    209. |  2001   25aug1999   ace_inhib |
    210. |  2001   22oct1999   ace_inhib |
    211. |  2001   18nov1999         CCB |
         |-------------------------------|
    212. |  2001   18nov1999   ace_inhib |
    213. |  2001   14dec1999         CCB |
    214. |  2001   12jan2000         CCB |
    215. |  2001   03feb2000         CCB |
    216. |  2001   03feb2000         arb |
         |-------------------------------|
    217. |  2001   02mar2000         CCB |
         +-------------------------------+

回答1:

"Unique" is a common misnomer here; strictly, it means occurring once only, which is not what you mean at all. "Distinct" is a much better word: for a discussion in Stata context, see here.

Please find out about dataex from SSC to be able to show data examples that can be copied and pasted directly. Yours required some engineering to be made easy to use.

Your problem is already a Stata FAQ found here. It is a good idea to look through the FAQs before posting.

* Example generated by -dataex-. To install: ssc install dataex
clear
input float p_id str8 agent_type float(wanted date)
1001 "thiazide" 1 15322
1001 "thiazide" 1 15442
1001 "thiazide" 1 15536
1001 "arb"      2 15580
1001 "CCB"      3 15609
1001 "arb"      2 15609
1001 "CCB"      3 15623
1001 "thiazide" 1 15623
1001 "arb"      2 15623
1001 "CCB"      3 15684
1001 "arb"      2 15684
2001 "ace_inhi" 1 14433
2001 "ace_inhi" 1 14458
2001 "ace_inhi" 1 14481
2001 "ace_inhi" 1 14539
2001 "CCB"      2 14566
2001 "ace_inhi" 1 14566
2001 "CCB"      2 14592
2001 "CCB"      2 14621
2001 "CCB"      2 14643
2001 "arb"      3 14643
2001 "CCB"      2 14671
end
format date %td 

bysort p_id agent_type (date) : gen firstdate = date[1] 
egen group = group(p_id firstdate agent_type) 
bysort p_id (group date agent_type): gen count = sum(group != group[_n-1])  
assert count == wanted

Note that the code takes care of the possibility that two or more drugs are first used on the same day by the same person.

来源：https://stackoverflow.com/questions/40483213/identify-unique-levels-of-categorical-variable

标签

unique

stata