I am trying to find the first/last observation by group. I tired both R and excel (because it is so slow in R so I tried excel). The excel
In R
, we can use dplyr
. After grouping by 'Shopper', create the 'Flag' column for first observation by using the logical condition row_number() < 2
and convert the logical to integer if required.
library(dplyr)
df1 %>%
group_by(Shopper) %>%
mutate(Flag = as.integer(row_number() < 2))
If we can use the minimum and maximum 'Day' as the identifier, then use the logical condition based on that.
df1 %>%
group_by(Shopper) %>%
mutate(Flag = as.integer(Day %in% range(Day)))
Or using data.table
library(data.table)
setDT(df1)[, Flag := as.integer(Day %in% range(Day)), by = Shopper]
Or using base R
, we can compare the previous 'Shopper' with the current 'Shopper' (assuming that the dataset is already ordered)
i1 <- with(df1, Shopper[-1]!= Shopper[-nrow(df1)])
as.integer(c(TRUE, i1)|c(i1, TRUE))
#[1] 1 1 1 1 0 1 1 1
All these methods should be faster than the for
loop in the OP's code.
Based on the updated expected output, if we need to replace the 1st observation with "0" while the others remain same, either an ifelse
or replace
can be used and using the lead
of 'tagging', we create the 'tagChoice2'.
df1 %>%
group_by(Shopper) %>%
mutate(tagging = ifelse(row_number()==1, "0", as.character(Choice)),
tagChoice2 = lead(tagging, default = "0"))
# Day Shopper Choice tagging tagChoice2
# <int> <chr> <chr> <chr> <chr>
#1 1 A apple 0 apple
#2 2 A apple apple 0
#3 1 B Banana 0 0
#4 1 C apple 0 Banana
#5 2 C Banana Banana apple
#6 3 C apple apple 0
#7 1 D berry 0 berry
#8 2 D berry berry 0
You can try install the Microsoft R open as your default R. In terms of math calculation, it is way faster than R base. Because it employs more cores while the R.BASE only uses one core to compute.
I was looking for answer to finding first and last value of a column by grouping
in data.table
. After looking here and there, and thinking about it, here you go.
To create order of rows by group:
library(data.table)
DT <- data.table(col1 = rep(LETTERS[1:2], each = 4), col2 = c(3,12,5,56,6,678,233,70))
setorder(DT, col1, col2)
DT
col1 col2
1: A 3
2: A 5
3: A 12
4: A 56
5: B 6
6: B 70
7: B 233
8: B 678
DT[, rank := order(col2), by = col1]
DT
col1 col2 rank
1: A 3 1
2: A 5 2
3: A 12 3
4: A 56 4
5: B 6 1
6: B 70 2
7: B 233 3
8: B 678 4
To create first and last values by group:
DT[, first_val := col2[1], by = col2]
DT[, last_val := col2[.N], by = col1]
DT
col1 col2 rank first_val last_val
1: A 3 1 3 56
2: A 5 2 3 56
3: A 12 3 3 56
4: A 56 4 3 56
5: B 6 1 6 678
6: B 70 2 6 678
7: B 233 3 6 678
8: B 678 4 6 678
First, assuming the data are sorted by Shopper
and then by Day
in ascending order, you can add a column indicating the purchase number with
df$Purchase <- unlist(with(df, tapply(Shopper, Shopper, seq_along)))
df
# Day Shopper Choice Purchase
#1 1 A apple 1
#2 2 A apple 2
#3 1 B Banana 1
#4 1 C apple 1
#5 2 C Banana 2
#6 3 C apple 3
#7 1 D berry 1
#8 2 D berry 2
Then reshape the data-frame to "wide" format with
df.w <- reshape(df[c('Shopper', 'Choice', 'Purchase')],
idvar='Shopper', v.names='Choice', timevar='Purchase',
direction='wide')
df.w
# Shopper Choice.1 Choice.2 Choice.3
#1 A apple apple <NA>
#3 B Banana <NA> <NA>
#4 C apple Banana apple
#7 D berry berry <NA>
Finally you calculate the repurchase matrix of the first two purchases
with(df.w, prop.table(table(First=Choice.1, Second=Choice.2)))
# Second
#First apple Banana berry
# apple 0.3333333 0.3333333 0.0000000
# Banana 0.0000000 0.0000000 0.0000000
# berry 0.0000000 0.0000000 0.3333333
To calculate the repurchase matrix of all purchases, start with the repurchase matrices of every two consecutive purchases
repurchase <- lapply(seq(2, ncol(df.w) - 1),
function(i) table(First=df.w[[i]], Second=df.w[[i + 1]]))
repurchase <- simplify2array(repurchase)
repurchase
#, , 1
#
# Second
#First apple Banana berry
# apple 1 1 0
# Banana 0 0 0
# berry 0 0 1
#
#, , 2
#
# Second
#First apple Banana berry
# apple 0 0 0
# Banana 1 0 0
# berry 0 0 0
then add all matrices to get the "total" repurchase matrix
apply(repurchase, 1:2, sum)
# Second
#First apple Banana berry
# apple 1 1 0
# Banana 1 0 0
# berry 0 0 1
(absolute frequencies)
prop.table(apply(repurchase, 1:2, sum))
# Second
#First apple Banana berry
# apple 0.25 0.25 0.00
# Banana 0.25 0.00 0.00
# berry 0.00 0.00 0.25
(relative frequencies)