R - multiple criteria search

你。 提交于 2020-06-17 11:05:25

问题


I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:

  • 5 million observations
  • 7 variables of which three of interest in this case:
  • Code ID Buy
  • Code ID Sell
  • Date

I would like another variable called new, which takes the value 0 in line i if:

  • there exists an observation k, with Code_IB_Buy_[i]=Code_IB_Buy_[k] and Code_IB_Sell_[i]=Code_IB_Sell_[k] and Date[i] is after Date[k] if not, I would like new[i]=1.

Basically if there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1.

What I have tried until now, is to sort the database (as a data.table) by date and by buyer. Then I use a For Loop, which would work in other languages but here it does not for some reason.

EDIT: A reduced sample would look like this:

    library(data.table)
set.seed(1)
Data <- data.frame(
  Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
  Amount <- rnorm(21,mean=100,sd=20),
  Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
  Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A"),
  Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2),
LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
    )

Data$new<-0

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
  r <- rowid(Code_ID_Buy, Code_ID_Sell)
  +(r==1L)
}]

Data[Month==1L, new:=0L]

    View(Data)

The data is already ordered since I used (and it works fine) : Data<-Data[order(month),] Data<-Data[order(Code_ID_Buy),]

However, what I would like to do now is to introduce the new column "new" that takes value 1 if and if only R does not find any observation where ID_Buy and ID_Sell are the same before the given date (new customer relationship)

In the example above, line 7,14 and 19 should be marked with 1 but none of the others.

In a second step, I would like to calculate the number of months between the first and second relationship (a column "distance", which would take the value "month of new relationship"-"month of last new relationship" else, that is in line 14, it would take the value "2",5-3, but this is getting ahead of myself).

Is this sufficient information ? Thank you a lot already for your time.

EDIT : @chinsoon12 thank you a lot, you are really good ! this seem to work like that(I actually didn't want NA when new = 0, so I changed your formula by removing new==1L and it does not work because the distance for all lines with new=0 is 1 because it calculates the time not with the last new == 1L but the distance with the last new==0L)

I have also however a small question : I used

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
    by=.EACHI, i.Month - x.Month]$V1], but is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand)  for each i?

回答1:


An option is to order by Month, Code_ID_Buy and Code_ID_Sell and then set the first row in each group of Code_ID_Buy and Code_ID_Sell to 1:

library(data.table)
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), 
    new := +(rowid(Code_ID_Buy, Code_ID_Sell) == 1L)]

Your requirement basically translates into finding the first combi of Code_ID_Buy and Code_ID_Sell, here is another option using duplicated:

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell),
    new := !duplicated(.SD), .SDcols=c("Code_ID_Buy", "Code_ID_Sell")]

output:

    Month    Amount Code_ID_Buy Code_ID_Sell new
 1:     1  87.47092        100D          98C   1
 2:     1 103.67287        100D          99C   1
 3:     2  83.28743        100D          98C   0
 4:     2 131.90562        100D          99C   0
 5:     3 106.59016        100D          98C   0
 6:     3  83.59063        100D          99C   0
 7:     3 109.74858        100D          96V   1
 8:     4 114.76649        100D          98C   0
 9:     4 111.51563        100D          99C   0
10:     4  93.89223        100D          96V   0
11:     5 130.23562        100D          98C   0
12:     5 107.79686        100D          99C   0
13:     5  87.57519        100D          96V   0
14:     5  55.70600        100D          94D   1
15:     6 122.49862        100D          98C   0
16:     6  99.10133        100D          99C   0
17:     6  99.67619        100D          96V   0
18:     6 118.87672        100D          94D   0
19:     3 116.42442        102D          25A   1
20:     4 111.87803        102D          25A   0
21:     5 118.37955        102D          25A   0
    Month    Amount Code_ID_Buy Code_ID_Sell new

This solution has 1 for the first few rows as well since there are no rows prior to them with the same Code_ID_Buy or Code_ID_Sell. If it is necessary to remove them, you can use Data[Month==1L, new := 0L]


For the 2nd question, you can use a non-equi join to find rows where Month is before current one and new=1L for the same Code_ID_Buy:

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
    by=.EACHI, i.Month - x.Month]$V1]

output:

    Month    Amount Code_ID_Buy Code_ID_Sell new distance
 1:     1  87.47092        100D          98C   1       NA
 2:     1 103.67287        100D          99C   1       NA
 3:     2  83.28743        100D          98C   0       NA
 4:     2 131.90562        100D          99C   0       NA
 5:     3 106.59016        100D          98C   0       NA
 6:     3  83.59063        100D          99C   0       NA
 7:     3 109.74858        100D          96V   1        2
 8:     4 114.76649        100D          98C   0       NA
 9:     4 111.51563        100D          99C   0       NA
10:     4  93.89223        100D          96V   0       NA
11:     5 130.23562        100D          98C   0       NA
12:     5 107.79686        100D          99C   0       NA
13:     5  87.57519        100D          96V   0       NA
14:     5  55.70600        100D          94D   1        2
15:     6 122.49862        100D          98C   0       NA
16:     6  99.10133        100D          99C   0       NA
17:     6  99.67619        100D          96V   0       NA
18:     6 118.87672        100D          94D   0       NA
19:     3 116.42442        102D          25A   1       NA
20:     4 111.87803        102D          25A   0       NA
21:     5 118.37955        102D          25A   0       NA
    Month    Amount Code_ID_Buy Code_ID_Sell new distance

for your 3rd qn, you can try:

Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))] 

output:

    Month    Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr
 1:     1  87.47092        100D          98C                 NA                    98C   0       NA  NA
 2:     1 103.67287        100D          99C                 NA                    98C   0       NA  NA
 3:     2  83.28743        100D          98C                 NA                    98C   0       NA  NA
 4:     2 131.90562        100D          99C                 NA                    98C   0       NA  NA
 5:     3 106.59016        100D          98C                 NA                    98C   0       NA  NA
 6:     3  83.59063        100D          99C                 NA                    98C   0       NA  NA
 7:     3 109.74858        100D          96V                  0                    98C   1       NA   0
 8:     4 114.76649        100D          98C                  1                    96V   0       NA   1
 9:     4 111.51563        100D          99C                  1                    96V   0       NA   1
10:     4  93.89223        100D          96V                  1                    96V   0       NA   1
11:     5 130.23562        100D          98C                  2                    96V   0       NA   2
12:     5 107.79686        100D          99C                  2                    96V   0       NA   2
13:     5  87.57519        100D          96V                  2                    96V   0       NA   2
14:     5  55.70600        100D          94D                  2                    96V   1        2   0
15:     6 122.49862        100D          98C                  1                    94D   0       NA   1
16:     6  99.10133        100D          99C                  1                    94D   0       NA   1
17:     6  99.67619        100D          96V                  1                    94D   0       NA   1
18:     6 118.87672        100D          94D                  1                    94D   0       NA   1
19:     3 116.42442        102D          25A                  0                   <NA>   1       NA   0
20:     4 111.87803        102D          25A                  1                    25A   0       NA   1
21:     5 118.37955        102D          25A                  2                    25A   0       NA   2
    Month    Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr

p.s.: Welcome to SO. Just to let you know that it might be good ettiquete to post a new question rather than add a new question to a previous question and in addition, you might get others to respond as well.



来源:https://stackoverflow.com/questions/60441583/r-multiple-criteria-search

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!