Populating new variable using vlookup with multiple criteria in another variable

问题

1) A new variable should be created for each unique observation listed in variable sku, which contains repeated values.

2) These newly created variables should be assigned the value of own product's price at the store/week level, as long as observations' sku value is in the same subcategory (subc) as the variable itself. For example, in eta2,3, observations in line 3, 4, and 5 have the same value because they all belong to the same subcategory as sku #3. [eta2,3 indicates sku 3, subc 2.]

3) x indicates that this is the original value for the product/subcategory that is currently being replicated.

4) If an observation doesn't belong to the same subcategory, it should reflect "0".

Orange is the given data. In green are the values from the steps 1, 2, and 3. White cells are step 4.

I am unable to offer a solution of my own, as searching for a way to generate a variable using existing observations hasn't given me results.

I also understand that it must be a combination of forvalues, foreach, and levelsof commands?

clear
input units price   sku week    store   subc
3   4.3 1   1   1   1
2   3   2   1   1   1
1   2.5 3   1   1   2
4   12  5   1   1   2
5   12  6   1   1   3
35  4.3 1   1   2   1
23  3   2   1   2   1
12  2.5 3   1   2   2
35  12  5   1   2   2
35  12  6   1   2   3   
3   20  1   2   1   1
2   30  2   2   1   1
4   40  3   2   2   2
1   50  4   2   2   2
9   10  5   2   2   2
2   90  6   2   2   3
end

UPDATE Based on Nick Cox' feedback, this is the final code that gives the result I have been looking for:

clear
input units price   sku week    store   subc
35  4.3 1   1   1   1
23  3   2   1   1   1
12  2.5 3   1   1   2
10  1   4   1   1   2
35  12  5   1   1   2
35  12  6   1   1   3
35  5.3 1   2   1   1
23  4   2   2   1   1
12  3.5 3   2   1   2
10  2   4   2   1   2
35  13  5   2   1   2
35  13  6   2   1   3
end

egen joint = group(subc sku), label 

bysort store week : gen freq = _N
su freq, meanonly 
local jmax = r(max) 
drop freq

tostring subc sku, replace
gen new = subc + "_"+sku 


su joint, meanonly 
forval j = 1/`r(max)'{     
 local J = new[`j'] 
    gen eta`J' = . 
} 

sort  subc week store sku 
egen joint1 = group(subc week store), label 

gen long id = _n 
su joint1, meanonly  

quietly forval i = 1/`r(max)' { 
   su id if joint1 == `i', meanonly
   local jmin = r(min) 
   local jmax = r(max) 

   forval j = `jmin'/`jmax' {  
   local subc = subc[`j'] 
   local sku = sku[`j'] 
   replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax' 
   replace eta`subc'_`sku' = 0 in `j'/`j'  
   }
}

回答1:

I worry on your behalf that in a dataset of any size what you ask for would mean many, many extra variables. I wonder on your behalf whether you need all of them any way for whatever you want to do with them.

That aside, this seems to be what you want. Naturally your column headers in your spreadsheet view aren't legal variable names. Disclosure: despite being the original author of levelsof I wouldn't prefer its use here.

clear
input units price   sku week    store   subc
35  4.3 1   1   1   1
23  3   2   1   1   1
12  2.5 3   1   1   2
10  1   4   1   1   2
35  12  5   1   1   2
35  12  6   1   1   3
end

sort subc sku 
* subc identifiers guaranteed to be integers 1 up 
egen subc_id = group(subc), label 

* observation numbers in a variable  
gen long id = _n 

* how many subc? loop over the range 
su subc_id, meanonly 
forval i = 1/`r(max)' { 

   * which subc is this one? look it up using -summarize-
   * assuming that subc is numeric!    
   su subc if subc_id == `i', meanonly  
   local I = r(min) 

   * which observation numbers for this subc? 
   * given the prior sort, they are all contiguous 
   su id if subc_id == `i', meanonly 

   * for each observation in the subc, find out the sku and copy its price 
   * to all observations in that subc  
   forval j = `r(min)'/`r(max)' { 
       local J = sku[`j'] 
       gen eta_`I'_`J' = cond(subc_id == `i', price[`j'], 0) 
   }
}    

list subc eta*, sepby(subc)

     +------------------------------------------------------------------+
     | subc   eta_1_1   eta_1_2   eta_2_3   eta_2_4   eta_2_5   eta_3_6 |
     |------------------------------------------------------------------|
  1. |    1       4.3         3         0         0         0         0 |
  2. |    1       4.3         3         0         0         0         0 |
     |------------------------------------------------------------------|
  3. |    2         0         0       2.5         1        12         0 |
  4. |    2         0         0       2.5         1        12         0 |
  5. |    2         0         0       2.5         1        12         0 |
     |------------------------------------------------------------------|
  6. |    3         0         0         0         0         0        12 |
     +------------------------------------------------------------------+

Notes:

N1. In your example, subc is numbered 1, 2, etc. My extra variable subc_id ensures that to be true even if in your real data the identifiers are not so clean.

N2. The expression

cond(subc_id == `i', price[`j'], 0)

could also be

(subc_id == `i') * price[`j']

N3. It seems possible that a different data structure would be much more efficient.

EDIT: Here is code and results for another data structure.

clear
input units price   sku week    store   subc
35  4.3 1   1   1   1
23  3   2   1   1   1
12  2.5 3   1   1   2
10  1   4   1   1   2
35  12  5   1   1   2
35  12  6   1   1   3
end

sort subc sku 
egen subc_id = group(subc), label 

bysort subc : gen freq = _N
su freq, meanonly 
local jmax = r(max) 
drop freq

forval j = 1/`jmax' { 
    gen eta`j' = . 
    gen which`j' = . 
} 

gen long id = _n 
su subc_id, meanonly  

quietly forval i = 1/`r(max)' { 
   su id if subc_id == `i', meanonly
   local jmin = r(min) 
   local jmax = r(max) 

   local k = 1 
   forval j = `jmin'/`jmax' { 
       replace which`k' = sku[`j'] in `jmin'/`jmax' 
       replace eta`k' = price[`j'] in `jmin'/`jmax' 
       local ++k 
   }
}    

   list subc sku *1 *2 *3 , sepby(subc)

     +------------------------------------------------------------+
     | subc   sku   eta1   which1   eta2   which2   eta3   which3 |
     |------------------------------------------------------------|
  1. |    1     1    4.3        1      3        2      .        . |
  2. |    1     2    4.3        1      3        2      .        . |
     |------------------------------------------------------------|
  3. |    2     3    2.5        3      1        4     12        5 |
  4. |    2     4    2.5        3      1        4     12        5 |
  5. |    2     5    2.5        3      1        4     12        5 |
     |------------------------------------------------------------|
  6. |    3     6     12        6      .        .      .        . |
     +------------------------------------------------------------+

回答2:

I am adding another answer that tackles combinations of subc and week. Previous discussion establishes that what you are trying to do would add an extra variable for every observation. This can't be a good idea! At best, you might just have many new variables, mostly zeros. At worst, you will run into Stata's limits.

Hence I won't support your endeavour to go further down the same road, but show how the second data structure I discuss in my previous answer can be produced. Indeed, you haven't indicated (a) why you want all these variables, which are just the existing data redistributed; (b) what your strategy is for dealing with them; (c) why rangestat (SSC) or some other program could not remove the need to create them in the first place.

clear
input units price   sku week    store   subc
35  4.3 1   1   1   1
23  3   2   1   1   1
12  2.5 3   1   1   2
10  1   4   1   1   2
35  12  5   1   1   2
35  12  6   1   1   3
35  5.3 1   2   1   1
23  4   2   2   1   1
12  3.5 3   2   1   2
10  2   4   2   1   2
35  13  5   2   1   2
35  13  6   2   1   3
end

sort subc week sku 
egen joint = group(subc week), label 

bysort joint : gen freq = _N
su freq, meanonly 
local jmax = r(max) 
drop freq

forval j = 1/`jmax' { 
    gen eta`j' = . 
    gen which`j' = . 
} 

gen long id = _n 
su joint, meanonly  

quietly forval i = 1/`r(max)' { 
   su id if joint == `i', meanonly
   local jmin = r(min) 
   local jmax = r(max) 

   local k = 1 
   forval j = `jmin'/`jmax' { 
       replace which`k' = sku[`j'] in `jmin'/`jmax' 
       replace eta`k' = price[`j'] in `jmin'/`jmax' 
       local ++k 
   }
}    

list subc week sku *1 *2 *3 , sepby(subc week)

     +-------------------------------------------------------------------+
     | subc   week   sku   eta1   which1   eta2   which2   eta3   which3 |
     |-------------------------------------------------------------------|
  1. |    1      1     1    4.3        1      3        2      .        . |
  2. |    1      1     2    4.3        1      3        2      .        . |
     |-------------------------------------------------------------------|
  3. |    1      2     1    5.3        1      4        2      .        . |
  4. |    1      2     2    5.3        1      4        2      .        . |
     |-------------------------------------------------------------------|
  5. |    2      1     3    2.5        3      1        4     12        5 |
  6. |    2      1     4    2.5        3      1        4     12        5 |
  7. |    2      1     5    2.5        3      1        4     12        5 |
     |-------------------------------------------------------------------|
  8. |    2      2     3    3.5        3      2        4     13        5 |
  9. |    2      2     4    3.5        3      2        4     13        5 |
 10. |    2      2     5    3.5        3      2        4     13        5 |
     |-------------------------------------------------------------------|
 11. |    3      1     6     12        6      .        .      .        . |
     |-------------------------------------------------------------------|
 12. |    3      2     6     13        6      .        .      .        . |
     +-------------------------------------------------------------------+

回答3:

clear
input units price   sku week    store   subc
35  4.3 1   1   1   1
23  3   2   1   1   1
12  2.5 3   1   1   2
10  1   4   1   1   2
35  12  5   1   1   2
35  12  6   1   1   3
35  5.3 1   2   1   1
23  4   2   2   1   1
12  3.5 3   2   1   2
10  2   4   2   1   2
35  13  5   2   1   2
35  13  6   2   1   3
end

egen joint = group(subc sku), label 

bysort store week : gen freq = _N
su freq, meanonly 
local jmax = r(max) 
drop freq

tostring subc sku, replace
gen new = subc + "_"+sku 


su joint, meanonly 
forval j = 1/`r(max)'{     
 local J = new[`j'] 
    gen eta`J' = . 
} 

sort  subc week store sku 
egen joint1 = group(subc week store), label 

gen long id = _n 
su joint1, meanonly  

quietly forval i = 1/`r(max)' { 
   su id if joint1 == `i', meanonly
   local jmin = r(min) 
   local jmax = r(max) 

   forval j = `jmin'/`jmax' {  
   local subc = subc[`j'] 
   local sku = sku[`j'] 
   replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax' 
   replace eta`subc'_`sku' = 0 in `j'/`j'  
   }
}    

 list subc sku store week eta*, sepby(subc)


   +---------------------------------------------------------------------------------+
     | store   week   subc   sku   eta1_1   eta1_2   eta2_3   eta2_4   eta2_5   eta3_6 |
     |---------------------------------------------------------------------------------|
  1. |     1      1      1     2      4.3        0        .        .        .        . |
  2. |     1      1      1     1        0        3        .        .        .        . |
     |---------------------------------------------------------------------------------|
  3. |     1      1      2     4        .        .      2.5        0       12        . |
  4. |     1      1      2     3        .        .        0        1       12        . |
  5. |     1      1      2     5        .        .      2.5        1        0        . |
     |---------------------------------------------------------------------------------|
  6. |     1      1      3     6        .        .        .        .        .        0 |
     |---------------------------------------------------------------------------------|
  7. |     1      2      1     2      5.3        0        .        .        .        . |
  8. |     1      2      1     1        0        4        .        .        .        . |
     |---------------------------------------------------------------------------------|
  9. |     1      2      2     3        .        .        0        2       13        . |
 10. |     1      2      2     5        .        .      3.5        2        0        . |
 11. |     1      2      2     4        .        .      3.5        0       13        . |
     |---------------------------------------------------------------------------------|
 12. |     1      2      3     6        .        .        .        .        .        0 |
     +---------------------------------------------------------------------------------+

来源：https://stackoverflow.com/questions/48818009/populating-new-variable-using-vlookup-with-multiple-criteria-in-another-variable

标签

stata

lookup