问题
My table looks like this:
ID AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
The count for the full 8-character AQ_ATC
codes is already correct.
The shorter codes are unique in the table and are substrings of the complete 8-character codes (they represent the first x characters).
What I am looking for is the count of the appearances of the shorter codes throughout the entire table.
For example in this case the resulting table would be
ID AQ_ATC amountATC
. "A05" 2715 <-- 2525 + 190
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 7430 <-- 4330 + 3100
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 130 <-- 130
441 "C05AA03" 130
The partial codes do not overlap, by what I mean that if there is "C05"
there wont be another partial code "C05A1
".
I created the amountATC column using
bysort ATC: egen amountATC = total(AQ_ATC==AQ_ATC)
I attempted recycling the code that I had received yesterday but failed in doing so. My attempt looks like this:
levelsof AQ_ATC, local(ATCvals)
quietly foreach y in AQ_ATC {
local i = 0
quietly foreach x of local ATCvals {
if strpos(`y', `"`x'"') == 1{
local i = `i'+1
replace amountATC = `i'
}
}
}
My idea was to use a counter "i" and increase it by 1 everytime the an AQ_ATC
starts with another AQ_ATC
code. Then I write "i" into amountATC
and after I iterated over the entire table for my AQ_ATC
, I will have an "i"-value that will be equal to the amount of occurences of the substring. Then I reset "i" to 0 and continue with the next AQ_ATC
.
At least thats how I intended for it to work, what it did in the end is set all amountATC
-values to 1.
I also attempted looking into different egen-functions such as noccur and moss, but my connection keeps timing out when I attempt to install the packages.
回答1:
It seems as if you come from another language and you insist in using loops when not strictly necessary. Stata does many things without explicit loops, precisely because commands already apply to all observations.
One way is:
clear
set more off
input ///
ID str15 AQ_ATC amountATC
. "A05" 1
123 "A05AA02" 2525
234 "A05AA02" 2525
991 "A05AD39" 190
. "C10" 1
441 "C10AA11" 4330
229 "C10AA22" 3100
. "C05AA" 1
441 "C05AA03" 130
end
*----- what you want -----
sort AQ_ATC ID
gen grou = sum(missing(ID))
bysort grou AQ_ATC: gen tosum = amountATC if _n == 1 & !missing(ID)
by grou: egen s = total(tosum)
replace amountATC = s if missing(ID)
list, sepby(grou)
Edit
With your edit the same principles apply. Below code that adjusts to your change and slightly changes the code (one line less):
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1] & !missing(ID)
by grou: replace amountATC = s[_N] if missing(ID)
More efficient should be:
<snip>
bysort grou: gen s = sum(amountATC) if AQ_ATC != AQ_ATC[_n+1]
by grou: replace amountATC = s[_N] - 1 if missing(ID)
Some comments:
sort
is a very handy command. If you sort the data byAQ_ATC
they are arranged in such a way that the short (sub)strings are placed before corresponding long strings.The
by:
prefix is fundamental and very helpful, and I noticed you can use it after defining appropriate groups. I created the groups taking advantage of the fact that all short (sub)strings have amissing(ID)
.Then (by the groups just defined) you only want to add up one value (observation) per
amountATC
. That's what the conditionif AQ_ATC != AQ_ATC[_n+1]
does.Finally,
replace
back into your original variable. I would usuallygenerate
a copy and work with that, so my original variable doesn't suffer.
An excellent read for the by:
prefix is Speaking Stata: How to move step by: step, by Nick Cox.
Edit2
Yet another slightly different way:
*----- what you want -----
sort AQ_ATC
gen grou = sum(missing(ID))
egen t = tag(grou AQ_ATC)
bysort grou: gen s = sum(amountATC * t)
by grou: replace amountATC = s[_N] - 1 if missing(ID)
来源:https://stackoverflow.com/questions/27344146/stata-counting-substring