问题
I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:
- 5 million observations
- 7 variables of which three of interest in this case:
- Code ID Buy
- Code ID Sell
- Date
I would like another variable called new
, which takes the value 0
in line i
if:
- there exists an observation
k
, withCode_IB_Buy_[i]=Code_IB_Buy_[k]
andCode_IB_Sell_[i]=Code_IB_Sell_[k]
andDate[i]
is afterDate[k]
if not, I would likenew[i]=1
.
Basically if there was no transaction before between the buyer and the seller, this is the first occurrence in line i
so the variable new
takes the value 1
.
What I have tried until now, is to sort the database (as a data.table) by date and by buyer. Then I use a For Loop, which would work in other languages but here it does not for some reason.
EDIT: A reduced sample would look like this:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
Amount <- rnorm(21,mean=100,sd=20),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A"),
Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2),
LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
)
Data$new<-0
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
r <- rowid(Code_ID_Buy, Code_ID_Sell)
+(r==1L)
}]
Data[Month==1L, new:=0L]
View(Data)
The data is already ordered since I used (and it works fine) : Data<-Data[order(month),] Data<-Data[order(Code_ID_Buy),]
However, what I would like to do now is to introduce the new column "new" that takes value 1 if and if only R does not find any observation where ID_Buy and ID_Sell are the same before the given date (new customer relationship)
In the example above, line 7,14 and 19 should be marked with 1 but none of the others.
In a second step, I would like to calculate the number of months between the first and second relationship (a column "distance", which would take the value "month of new relationship"-"month of last new relationship" else, that is in line 14, it would take the value "2",5-3, but this is getting ahead of myself).
Is this sufficient information ? Thank you a lot already for your time.
EDIT : @chinsoon12 thank you a lot, you are really good ! this seem to work like that(I actually didn't want NA when new = 0, so I changed your formula by removing new==1L and it does not work because the distance for all lines with new=0 is 1 because it calculates the time not with the last new == 1L but the distance with the last new==0L)
I have also however a small question : I used
Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
by=.EACHI, i.Month - x.Month]$V1], but is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand) for each i?
回答1:
An option is to order by Month
, Code_ID_Buy
and Code_ID_Sell
and then set the first row in each group of Code_ID_Buy
and Code_ID_Sell
to 1:
library(data.table)
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell),
new := +(rowid(Code_ID_Buy, Code_ID_Sell) == 1L)]
Your requirement basically translates into finding the first combi of Code_ID_Buy and Code_ID_Sell, here is another option using duplicated
:
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell),
new := !duplicated(.SD), .SDcols=c("Code_ID_Buy", "Code_ID_Sell")]
output:
Month Amount Code_ID_Buy Code_ID_Sell new
1: 1 87.47092 100D 98C 1
2: 1 103.67287 100D 99C 1
3: 2 83.28743 100D 98C 0
4: 2 131.90562 100D 99C 0
5: 3 106.59016 100D 98C 0
6: 3 83.59063 100D 99C 0
7: 3 109.74858 100D 96V 1
8: 4 114.76649 100D 98C 0
9: 4 111.51563 100D 99C 0
10: 4 93.89223 100D 96V 0
11: 5 130.23562 100D 98C 0
12: 5 107.79686 100D 99C 0
13: 5 87.57519 100D 96V 0
14: 5 55.70600 100D 94D 1
15: 6 122.49862 100D 98C 0
16: 6 99.10133 100D 99C 0
17: 6 99.67619 100D 96V 0
18: 6 118.87672 100D 94D 0
19: 3 116.42442 102D 25A 1
20: 4 111.87803 102D 25A 0
21: 5 118.37955 102D 25A 0
Month Amount Code_ID_Buy Code_ID_Sell new
This solution has 1 for the first few rows as well since there are no rows prior to them with the same Code_ID_Buy or Code_ID_Sell. If it is necessary to remove them, you can use Data[Month==1L, new := 0L]
For the 2nd question, you can use a non-equi join to find rows where Month is before current one and new=1L for the same Code_ID_Buy:
Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
by=.EACHI, i.Month - x.Month]$V1]
output:
Month Amount Code_ID_Buy Code_ID_Sell new distance
1: 1 87.47092 100D 98C 1 NA
2: 1 103.67287 100D 99C 1 NA
3: 2 83.28743 100D 98C 0 NA
4: 2 131.90562 100D 99C 0 NA
5: 3 106.59016 100D 98C 0 NA
6: 3 83.59063 100D 99C 0 NA
7: 3 109.74858 100D 96V 1 2
8: 4 114.76649 100D 98C 0 NA
9: 4 111.51563 100D 99C 0 NA
10: 4 93.89223 100D 96V 0 NA
11: 5 130.23562 100D 98C 0 NA
12: 5 107.79686 100D 99C 0 NA
13: 5 87.57519 100D 96V 0 NA
14: 5 55.70600 100D 94D 1 2
15: 6 122.49862 100D 98C 0 NA
16: 6 99.10133 100D 99C 0 NA
17: 6 99.67619 100D 96V 0 NA
18: 6 118.87672 100D 94D 0 NA
19: 3 116.42442 102D 25A 1 NA
20: 4 111.87803 102D 25A 0 NA
21: 5 118.37955 102D 25A 0 NA
Month Amount Code_ID_Buy Code_ID_Sell new distance
for your 3rd qn, you can try:
Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))]
output:
Month Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr
1: 1 87.47092 100D 98C NA 98C 0 NA NA
2: 1 103.67287 100D 99C NA 98C 0 NA NA
3: 2 83.28743 100D 98C NA 98C 0 NA NA
4: 2 131.90562 100D 99C NA 98C 0 NA NA
5: 3 106.59016 100D 98C NA 98C 0 NA NA
6: 3 83.59063 100D 99C NA 98C 0 NA NA
7: 3 109.74858 100D 96V 0 98C 1 NA 0
8: 4 114.76649 100D 98C 1 96V 0 NA 1
9: 4 111.51563 100D 99C 1 96V 0 NA 1
10: 4 93.89223 100D 96V 1 96V 0 NA 1
11: 5 130.23562 100D 98C 2 96V 0 NA 2
12: 5 107.79686 100D 99C 2 96V 0 NA 2
13: 5 87.57519 100D 96V 2 96V 0 NA 2
14: 5 55.70600 100D 94D 2 96V 1 2 0
15: 6 122.49862 100D 98C 1 94D 0 NA 1
16: 6 99.10133 100D 99C 1 94D 0 NA 1
17: 6 99.67619 100D 96V 1 94D 0 NA 1
18: 6 118.87672 100D 94D 1 94D 0 NA 1
19: 3 116.42442 102D 25A 0 <NA> 1 NA 0
20: 4 111.87803 102D 25A 1 25A 0 NA 1
21: 5 118.37955 102D 25A 2 25A 0 NA 2
Month Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr
p.s.: Welcome to SO. Just to let you know that it might be good ettiquete to post a new question rather than add a new question to a previous question and in addition, you might get others to respond as well.
来源:https://stackoverflow.com/questions/60441583/r-multiple-criteria-search