Is there an R function for dropping duplicates of index variable based on lowest value in another column? [duplicate]

主宰稳场 提交于 2019-12-24 07:36:20

问题


I am trying to analyse large data-sets of student scores. Some students do retakes which produces duplicate scores, usually with the earlier low score placed the row above their retake, usually higher, score. I want to select their highest score, with a file that has only one line per student (which I will need to merge with other files having same ids).
Source file is like this:

STUDID   MATRISUBJ  SUBJSCORE
1032        AfrikaansB  2
1032        isiZuluB    7
1033        IsiXhosaB   6
1034        AfrikaansB  1
1034        EnglishB    4
1034        isiZuluB    3

result should look like this.

STUDID  MATRISUBJ   SUBJSCORE
1032        isiZuluB    7
1033        isiXhosaB   6
1034        EnglishB    4

Help, please..I used to do this process in SPS but now can't get access to this commercialised software, so am swapping to R

df2[!duplicated(df2[1:1]),]

gives the first row of the duplicate but I want the one with highest value, and sometimes student tries with another subject to get required score in languages


回答1:


Heyo! The simplest solution would be to use the top_n() function. This will allow you to choose the top n scores based on a numeric column (in your case SUBJSCORE)

The following code will give you what you need :)

  library(tidyverse)

  df %>% 
    group_by(STUDID) %>% 
    top_n(1, SUBJSCORE)



回答2:


You could use something like:

 df %>%
 dplyr::group_by(., STUDID) %>%                 
dplyr::arrange(.,desc(score) %>%
.[1,]



回答3:


I typically do something like this using the tidyverse group of packages:

library(tidyverse)

df <- data.frame(id = c('a','a','a','b','b','c','c','c')
             , score = c(90,92,93,75,87,67,68,73)
             , tesno = c(1,2,3,1,2,1,2,3))

df %>% group_by(id) %>% arrange(desc(score)) %>% filter(row_number() == 1) %>%  ungroup()



回答4:


Here's a short one-line solution, once the data is a data.table:

library(data.table)

data <- data.table(
  STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
  MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
  SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)

data[, .SD[which.max(SUBJSCORE)], by = "STUDID"]



回答5:


library(tidyverse)

data <- data.frame(
  STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
  MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
  SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)


srow <- function(x) {
  r <- which(x$SUBJSCORE == max(x$SUBJSCORE))
  x[r,]
}

dd <- data %>% split(.$STUDID) %>% map(~srow(.)) %>% bind_rows(.$STUDID)
dd

   STUDID  MARISUBJ SUBJSCORE
1   1032  isiZuluB         7
2   1033 IsiXhosaB         6
3   1034  EnglishB         4


来源:https://stackoverflow.com/questions/53964950/is-there-an-r-function-for-dropping-duplicates-of-index-variable-based-on-lowest

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!