Add column to dataframe depending on specific row values

问题

I am trying to solve something that for me is a problem since a few days.

Here an example of my data.frame, which I hope will work with my real one.

df <- read.table(text = 'ID    Day Count
    33012   9526    4
    35004   9526    4
    37006   9526    4
    37008   9526    4
    21009   1913    3
    24005   1913    3
    25009   1913    3
    22317   2286    2
    37612   2286    2
    25009   14329   1
    48007   9525    0
    88662   9524    0
    1845    9524    0
    8872    2285    0
    49002   1912    0
    1664    1911    0', header = TRUE)

I need to add a new column (new_col) to my data.frame which contains values from 1 to 4. These new_col values have to include, each one, day (x) day (x -1) and day (x -2), where x = 9526, 1913, 2286, 14329 (column Day).

My output should be the following:

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
22317   2286    2     3
37612   2286    2     3
25009   14329   1     4
48007   9525    0     1
88662   9524    0     1
1845    9524    0     1
8872    2285    0     3
49002   1912    0     2
1664    1911    0     2

The data.frame ordered by new_col will be then:

   ID    Day Count  new_col
33012   9526    4     1
35004   9526    4     1
37006   9526    4     1
37008   9526    4     1
48007   9525    0     1
88662   9524    0     1
1845    9524    0     1
21009   1913    3     2
24005   1913    3     2
25009   1913    3     2
49002   1912    0     2
1664    1911    0     2
22317   2286    2     3
37612   2286    2     3
8872    2285    0     3
25009   14329   1     4

My real data.frame is more complex than the example (i.e. more columns and more values in the Count column, therefore be patient if I will update the question.

Any suggestion will be really helpful.

回答1:

I'm not sure I totally understand your question, but it seems like you could use cut() to achieve this, as follows:

x <- c(1913, 2286, 9526, 14329) 
df$new_col <- cut(df$Day, c(-Inf, x, Inf))
df$new_col <- as.numeric(factor(df$new_col, levels=unique(df$new_col)))

回答2:

Here is a non scalable but easy to understand solution using dplyr package, we can use case_when to recode the Day based on the condition:

library(dplyr)
df %>% mutate(new_col = case_when(abs(df$Day - 9526) <= 2 ~ 1,
                                  abs(df$Day - 1913) <= 2 ~ 2,
                                  abs(df$Day - 2286)<= 2 ~ 3,
                                  abs(df$Day - 14329) <= 2 ~ 4)) %>%
    arrange(new_col)

#       ID   Day Count new_col
# 1  33012  9526     4       1
# 2  35004  9526     4       1
# 3  37006  9526     4       1
# 4  37008  9526     4       1
# 5  48007  9525     0       1
# 6  88662  9524     0       1
# 7   1845  9524     0       1
# 8  21009  1913     3       2
# 9  24005  1913     3       2
# 10 25009  1913     3       2
# 11 49002  1912     0       2
# 12  1664  1911     0       2
# 13 22317  2286     2       3
# 14 37612  2286     2       3
# 15  8872  2285     0       3
# 16 25009 14329     1       4

A more scalable approach would be to use foverlaps from data.table package, where we prepare a look up table and then join back with the original table and use within type join to make sure the days are in the range specified in the look up table, for better explanation about foverlaps

library(data.table)
# prepare the look up table
x <- c(9526, 1913, 2286, 14329)
dt1 <- data.table(start = x - 2, end = x, new_col = 1:4)
setkey(dt1)
dt1
#    start   end new_col
# 1:  1911  1913       2
# 2:  2284  2286       3
# 3:  9524  9526       1
# 4: 14327 14329       4

# prepare the original table
dt = copy(setDT(df))
dt[, Day2 := Day]

# do a foverlaps
foverlaps(dt, dt1, by.x = c("Day", "Day2"), by.y = c("start", "end"), type = "within", mult = "all", nomatch = 0L)[, .(ID, Day, Count, new_col)][order(new_col)]

#       ID   Day Count new_col
# 1  33012  9526     4       1
# 2  35004  9526     4       1
# 3  37006  9526     4       1
# 4  37008  9526     4       1
# 5  48007  9525     0       1
# 6  88662  9524     0       1
# 7   1845  9524     0       1
# 8  21009  1913     3       2
# 9  24005  1913     3       2
# 10 25009  1913     3       2
# 11 49002  1912     0       2
# 12  1664  1911     0       2
# 13 22317  2286     2       3
# 14 37612  2286     2       3
# 15  8872  2285     0       3
# 16 25009 14329     1       4

来源：https://stackoverflow.com/questions/38660467/add-column-to-dataframe-depending-on-specific-row-values

标签

dataframe

add