Stata counting substring

旧时模样 提交于 2019-12-12 01:59:51

问题


My table looks like this:

ID         AQ_ATC     amountATC
.       "A05"                 1
123     "A05AA02"          2525
234     "A05AA02"          2525
991     "A05AD39"           190
.       "C10"                 1
441     "C10AA11"          4330
229     "C10AA22"          3100
.       "C05AA"               1
441     "C05AA03"           130

The count for the full 8-character AQ_ATC codes is already correct. The shorter codes are unique in the table and are substrings of the complete 8-character codes (they represent the first x characters). What I am looking for is the count of the appearances of the shorter codes throughout the entire table. For example in this case the resulting table would be

ID         AQ_ATC     amountATC
.       "A05"              2715       <-- 2525 + 190
123     "A05AA02"          2525
234     "A05AA02"          2525
991     "A05AD39"           190
.       "C10"              7430       <-- 4330 + 3100
441     "C10AA11"          4330
229     "C10AA22"          3100
.       "C05AA"             130       <-- 130
441     "C05AA03"           130

The partial codes do not overlap, by what I mean that if there is "C05" there wont be another partial code "C05A1".

I created the amountATC column using

bysort ATC: egen amountATC = total(AQ_ATC==AQ_ATC)

I attempted recycling the code that I had received yesterday but failed in doing so. My attempt looks like this:

levelsof AQ_ATC, local(ATCvals) 

quietly foreach y in AQ_ATC { 
local i = 0
quietly foreach x of local ATCvals { 
    if strpos(`y', `"`x'"') == 1{
    local i = `i'+1
    replace amountATC = `i'
    }
 }
}

My idea was to use a counter "i" and increase it by 1 everytime the an AQ_ATC starts with another AQ_ATC code. Then I write "i" into amountATC and after I iterated over the entire table for my AQ_ATC, I will have an "i"-value that will be equal to the amount of occurences of the substring. Then I reset "i" to 0 and continue with the next AQ_ATC. At least thats how I intended for it to work, what it did in the end is set all amountATC-values to 1.

I also attempted looking into different egen-functions such as noccur and moss, but my connection keeps timing out when I attempt to install the packages.


回答1:


It seems as if you come from another language and you insist in using loops when not strictly necessary. Stata does many things without explicit loops, precisely because commands already apply to all observations.

One way is:

clear
set more off

input ///
ID         str15 AQ_ATC      amountATC
.       "A05"                 1
123     "A05AA02"          2525
234     "A05AA02"          2525
991     "A05AD39"           190
.       "C10"                 1
441     "C10AA11"          4330
229     "C10AA22"          3100
.       "C05AA"               1
441     "C05AA03"           130
end

*----- what you want -----

sort AQ_ATC ID
gen grou = sum(missing(ID))

bysort grou AQ_ATC: gen tosum = amountATC if _n == 1 & !missing(ID)
by grou: egen s = total(tosum)

replace amountATC = s if missing(ID)

list, sepby(grou)

Edit

With your edit the same principles apply. Below code that adjusts to your change and slightly changes the code (one line less):

*----- what you want -----

sort AQ_ATC
gen grou = sum(missing(ID))

bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1] & !missing(ID)
by grou: replace amountATC = s[_N] if missing(ID)

More efficient should be:

<snip>

bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1]
by grou: replace amountATC = s[_N] - 1 if missing(ID)

Some comments:

  1. sort is a very handy command. If you sort the data by AQ_ATC they are arranged in such a way that the short (sub)strings are placed before corresponding long strings.

  2. The by: prefix is fundamental and very helpful, and I noticed you can use it after defining appropriate groups. I created the groups taking advantage of the fact that all short (sub)strings have a missing(ID).

  3. Then (by the groups just defined) you only want to add up one value (observation) per amountATC. That's what the condition if AQ_ATC != AQ_ATC[_n+1] does.

  4. Finally, replace back into your original variable. I would usually generate a copy and work with that, so my original variable doesn't suffer.

An excellent read for the by: prefix is Speaking Stata: How to move step by: step, by Nick Cox.

Edit2

Yet another slightly different way:

*----- what you want -----

sort AQ_ATC
gen grou = sum(missing(ID))

egen t = tag(grou AQ_ATC)
bysort grou: gen s = sum(amountATC * t)

by grou: replace amountATC = s[_N] - 1 if missing(ID)


来源:https://stackoverflow.com/questions/27344146/stata-counting-substring

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!