R combining duplicate rows by ID with different column types in a dataframe

问题

I have a dataframe with a column ID as an identifier and some other columns of different types (factors and numerics). It looks like this

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

Now I want to combine the duplicate IDs. The numeric columns are defined as the mean value of all identical IDs (without the NAs!). The factor columns are combined into one. The NAs can be omitted.

In the end the result should look like this

dfRes <- data.frame(id    = c(1, 2, 3, 4),
                    abst  = c(1, 0, 0, 3),
                    farbe = as.factor(c("keine", "keine", "keine", "rot")),
                    gier  = c(2.5, 0, 0, 3))

I hope there is a way to do the calculations fast, because I have about a million observations. Thanks in advance!

Edit (additions): "farbe" may not be unique. In this case I think the best idea for my data is to have a duplicate row but only with a different "farbe", so there are 2 identical IDs and all the same but different values for "farbe". This should be just very rare case, but a great addition.

I have a lot more numeric and factor columns in my real data. Is it also possible to create a solution, so I don't have to define every single column?

回答1:

EDIT:

Just saw your edit about non-unique factor columns and selecting columns by type. This will work but I will think about a cleaner way to do this and report back (I am sure there is a simple way). If you want to manually specify columns like the original example and you have non-unique factors, just use unlist() with unique() in the same fashion as below. Alternatively, you could consider combining all factor levels on one line using paste() with collapse = "; " or something to that effect. If you want to change the column order for the final data.table, use setcolorder() on the data.table

setDT(df)

# For selecting columns later
num_cols <- sapply(df, is.numeric)
num_cols[names(num_cols) == "id"] <- FALSE
fac_cols <- sapply(df, is.factor)

df[, lapply(.SD, mean, na.rm = T), by = id, .SDcols = num_cols][
  df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols], on = "id"]

   id abst gier farbe
1:  1    1  2.5 keine
2:  2    0  0.0 keine
3:  3    0  0.0 keine
4:  4    3  3.0  rot2
5:  4    3  3.0   rot

How it works: It joins the numeric column summary

df[, lapply(.SD, mean, na.rm = TRUE), by = id, .SDcols = num_cols]

with the factor column summary

df[, lapply(.SD, function(i) unlist(unique(i[!is.na(i)]))), by = id, .SDcols = fac_cols]

Data for edit:

df <- data.frame(id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4),
                 abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                 farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, "rot2", "rot", "rot")),
                 gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))

Original answer:

Here is one of many data.table solutions. This orders the data.table by the factor column so it can grab the top value while summarizing. I also converted it back to a pure data.frame but you do not have to do that if you do not want to. Hope this helps!

Also, this assumes that farbe will be the same for each id

library(data.table)

setDT(df)

df <- df[order(farbe), .(abst = mean(abst, na.rm = TRUE),
                         farbe = farbe[1],
                         gier = mean(gier, na.rm = TRUE)), by = id]

setDF(df)
df
  id abst farbe gier
1  1    1 keine  2.5
2  2    0 keine  0.0
3  3    0 keine  0.0
4  4    3   rot  3.0

回答2:

A dplyr solution.

library(dplyr)

df %>% 
  group_by(id) %>% 
  summarise(abst = mean(na.omit(abst)), 
            farbe = na.omit(farbe)[1],
            gier = mean(na.omit(gier)))
#> # A tibble: 4 x 4
#>      id  abst farbe  gier
#>   <dbl> <dbl> <fct> <dbl>
#> 1     1     1 keine   2.5
#> 2     2     0 keine   0  
#> 3     3     0 keine   0  
#> 4     4     3 rot     3

^{Created on 2020-05-14 by the reprex package (v0.3.0)}

回答3:

Also a data.table solution:

    library(data.table)
    df <- data.table( # instead of data.frame
                     id    = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4), 
                     abst  = c(0, NA, 2, NA, NA, NA, 0, 0, NA, 2, NA, 3, 4),
                     farbe = as.factor(c("keine", NA, "keine", NA, NA, NA, "keine", "keine", NA, NA, NA, "rot", "rot")),
                     gier  = c(0, NA, 5, NA, NA, NA, 0, 0, NA, 1, NA, 6, 2))


    newdf <- df[,
                .(abst=mean(abst,na.rm=T), # perform mean of abst removing NAs
                  farbe=na.omit(unique(farbe)), # assuming farbe is unique for each ID, extract the unique value after removing the NAs
                  gier=mean(gier,na.rm=T)), # perform mean of gier removing NAs
                by=id] # for each ID

    newdf

       id abst farbe gier
    1:  1    1 keine  2.5
    2:  2    0 keine  0.0
    3:  3    0 keine  0.0
    4:  4    3   rot  3.0

回答4:

additional solution

df %>% 
  group_by(id) %>% 
  fill(farbe, .direction = "updown") %>% 
  group_by(id, farbe) %>% 
  summarise_all(~ mean(., na.rm = T))

来源：https://stackoverflow.com/questions/61798169/r-combining-duplicate-rows-by-id-with-different-column-types-in-a-dataframe

标签

dataframe

data.table

aggregate

tidyverse