How to use Dplyr's Summarize and which() to lookup min/max values

后端 未结 3 2015
我寻月下人不归
我寻月下人不归 2020-12-06 04:33

I have the following data:

Name <- c(\"Sam\", \"Sarah\", \"Jim\", \"Fred\", \"James\", \"Sally\", \"Andrew\", \"John\", \"Mairin\", \"Kate\", \"Sasha\", \         


        
相关标签:
3条回答
  • 2020-12-06 05:16

    Here are some data.table approaches, the first one borrowed from @akrun:

    setDT(data)
    
    # show one, wide format
    data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
       # Group min.Name min.Age max.Name max.Age
    # 1:     A      Sam      22      Sam      22
    # 2:     B    Sarah      31    James      58
    # 3:     C   Andrew      17    Sally      82
    # 4:     D   Mairin      12      Ray      67
    
    # show all, long format
    data[,{
      mina=min(Age)
      maxa=max(Age)
      rbind(
        data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
        data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
    )},by=Group]
       # Group minmax Age   Name
    # 1:     A    min  22    Sam
    # 2:     A    max  22    Sam
    # 3:     B    min  31  Sarah
    # 4:     B    min  31    Jim
    # 5:     B    max  58  James
    # 6:     C    min  17 Andrew
    # 7:     C    max  82  Sally
    # 8:     D    min  12 Mairin
    # 9:     D    max  67    Ray    
    

    I think the long format is the best, since it allows you to filter with minmax, but the code is tortured and inefficient.

    Here are some arguably less good ways:

    # show all, wide format (with a list column)
    data[,{
      mina=min(Age)
      maxa=max(Age)
      list(
        minAge=mina,
        maxAge=maxa,
        minNames=list(Name[Age==mina]),
        maxNames=list(Name[Age==maxa]))
    },by=Group]
       # Group minAge maxAge  minNames maxNames
    # 1:     A     22     22       Sam      Sam
    # 2:     B     31     58 Sarah,Jim    James
    # 3:     C     17     82    Andrew    Sally
    # 4:     D     12     67    Mairin      Ray
    
    
    # show all, wide format (with a string column)
    # (just look at @shadow's answer)
    
    0 讨论(0)
  • 2020-12-06 05:19

    I would actually recommend keeping your data in a "long" format. Here's how I would approach this:

    library(dplyr)
    

    Keeping all values when there are ties:

    data %>%
      group_by(Group) %>%
      arrange(Age) %>%  ## optional
      filter(Age %in% range(Age))
    # Source: local data frame [8 x 3]
    # Groups: Group
    # 
    #     Name Age Group
    # 1    Sam  22     A
    # 2  Sarah  31     B
    # 3    Jim  31     B
    # 4  James  58     B
    # 5 Andrew  17     C
    # 6  Sally  82     C
    # 7 Mairin  12     D
    # 8    Ray  67     D
    

    Keeping only one value when there are ties:

    data %>%
      group_by(Group) %>%
      arrange(Age) %>%
      slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
    # Source: local data frame [7 x 3]
    # Groups: Group
    # 
    #     Name Age Group
    # 1    Sam  22     A
    # 2  Sarah  31     B
    # 3  James  58     B
    # 4 Andrew  17     C
    # 5  Sally  82     C
    # 6 Mairin  12     D
    # 7    Ray  67     D
    

    If you really want a "wide" dataset, the basic concept would be to gather and spread the data, using "tidyr":

    library(dplyr)
    library(tidyr)
    
    data %>%
      group_by(Group) %>%
      arrange(Age) %>%
      slice(c(1, n())) %>%
      mutate(minmax = c("min", "max")) %>%
      gather(var, val, Name:Age) %>%
      unite(key, minmax, var) %>%
      spread(key, val)
    # Source: local data frame [4 x 5]
    # 
    #   Group max_Age max_Name min_Age min_Name
    # 1     A      22      Sam      22      Sam
    # 2     B      58    James      31    Sarah
    # 3     C      82    Sally      17   Andrew
    # 4     D      67      Ray      12   Mairin
    

    Though what wide form you would want with ties is unclear.

    0 讨论(0)
  • 2020-12-06 05:23

    You can use which.min and which.max to get the first value.

    data %>% group_by(Group) %>%
      summarize(minAge = min(Age), minAgeName = Name[which.min(Age)], 
                maxAge = max(Age), maxAgeName = Name[which.max(Age)])
    

    To get all values, use e.g. paste with an appropriate collapse argument.

    data %>% group_by(Group) %>%
      summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "), 
                maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
    
    0 讨论(0)
提交回复
热议问题