问题
I am trying to analyse large data-sets of student scores. Some students do retakes which produces duplicate scores, usually with the earlier low score placed the row above their retake, usually higher, score. I want to select their highest score, with a file that has only one line per student (which I will need to merge with other files having same ids).
Source file is like this:
STUDID MATRISUBJ SUBJSCORE
1032 AfrikaansB 2
1032 isiZuluB 7
1033 IsiXhosaB 6
1034 AfrikaansB 1
1034 EnglishB 4
1034 isiZuluB 3
result should look like this.
STUDID MATRISUBJ SUBJSCORE
1032 isiZuluB 7
1033 isiXhosaB 6
1034 EnglishB 4
Help, please..I used to do this process in SPS but now can't get access to this commercialised software, so am swapping to R
df2[!duplicated(df2[1:1]),]
gives the first row of the duplicate but I want the one with highest value, and sometimes student tries with another subject to get required score in languages
回答1:
Heyo! The simplest solution would be to use the top_n()
function. This will allow you to choose the top n scores based on a numeric column (in your case SUBJSCORE
)
The following code will give you what you need :)
library(tidyverse)
df %>%
group_by(STUDID) %>%
top_n(1, SUBJSCORE)
回答2:
You could use something like:
df %>%
dplyr::group_by(., STUDID) %>%
dplyr::arrange(.,desc(score) %>%
.[1,]
回答3:
I typically do something like this using the tidyverse group of packages:
library(tidyverse)
df <- data.frame(id = c('a','a','a','b','b','c','c','c')
, score = c(90,92,93,75,87,67,68,73)
, tesno = c(1,2,3,1,2,1,2,3))
df %>% group_by(id) %>% arrange(desc(score)) %>% filter(row_number() == 1) %>% ungroup()
回答4:
Here's a short one-line solution, once the data is a data.table:
library(data.table)
data <- data.table(
STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)
data[, .SD[which.max(SUBJSCORE)], by = "STUDID"]
回答5:
library(tidyverse)
data <- data.frame(
STUDID = c(1032, 1032, 1033, 1034, 1034, 1034),
MARISUBJ = c("AfrikaansB","isiZuluB", "IsiXhosaB", "AfrikaansB", "EnglishB", "isiZuluB"),
SUBJSCORE = c(2, 7, 6, 1, 4, 3)
)
srow <- function(x) {
r <- which(x$SUBJSCORE == max(x$SUBJSCORE))
x[r,]
}
dd <- data %>% split(.$STUDID) %>% map(~srow(.)) %>% bind_rows(.$STUDID)
dd
STUDID MARISUBJ SUBJSCORE
1 1032 isiZuluB 7
2 1033 IsiXhosaB 6
3 1034 EnglishB 4
来源:https://stackoverflow.com/questions/53964950/is-there-an-r-function-for-dropping-duplicates-of-index-variable-based-on-lowest