SAS sum observations not in a group, by group

问题

I have a data set :

data have;
   input group $ value;
   datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;

The first variable is a group identifier, the second a value.

For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.

My issue is having to do that on nearly 30 millions of observations, so efficiency matters. I found that using data step was more efficient than using procs.

The final database should looks like :

data want;
   input group $ value $ sum;
   datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;

Any idea how to perform this please?

Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.

回答1:

The requirement

sum of all values in the column, except for the group the observation is in

indicates two passes of the data must occur:

Compute the all_sum and each group's group_sum
A hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.
The group_sum is retrieved from hash and subtracted from allsum.

Example:

data want;
  if 0 then set have; * prep pdv;

  declare hash sums (suminc:'value');
  sums.defineKey('group');
  sums.defineDone();

  do while (not hash_loaded);
    set have end=hash_loaded;
    sums.ref();                * adds value to internal sum of hash data record;
    allsum + value;
  end;

  do while (not last_have);
    set have end=last_have;
    sums.sum(sum:sum);         * retrieve groups sum. Do you hear the Dragnet theme too?;
    sum = allsum - sum;        * subtract from allsum;
    output;
  end;

  stop;
run;

回答2:

What is wrong with a straight forward approach? You need to make two passes no matter what you do.

Like this. I included extra variables so you can see how the values are derived.

proc sql ;
 create table want as
  select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
  from have a
     , (select sum(value) as grand from have) b
  group by a.group
 ;
quit;

Results:

Obs    group    value    grand    total    sum

  1      A        3        21       10      11
  2      A        1        21       10      11
  3      A        2        21       10      11
  4      A        4        21       10      11
  5      B        1        21        1      20
  6      C        1        21        1      20
  7      D        2        21        3      18
  8      D        1        21        3      18
  9      E        1        21        1      20
 10      F        1        21        1      20
 11      G        1        21        3      18
 12      G        2        21        3      18
 13      H        1        21        1      20

Note it does not matter what you have as your GROUP BY clause.

Do you really need to output all of the original observations? Why not just output the summary table?

proc sql ;
 create table want as
  select a.group, b.grand - sum(value) as sum
  from have a
     , (select sum(value) as grand from have) b
  group by a.group
 ;
quit;

Results

Obs    group    total    sum

 1       A        10      11
 2       B         1      20
 3       C         1      20
 4       D         3      18
 5       E         1      20
 6       F         1      20
 7       G         3      18
 8       H         1      20

回答3:

I would break this out into two different segments:

1.) You could start by using PROC SQL to get the sums by the group

2.) Then use some IF/THEN statements to reassign the values by group

来源：https://stackoverflow.com/questions/60533524/sas-sum-observations-not-in-a-group-by-group

标签

dataframe

sas