问题
I am trying to separate column "Grade" to multiple columns according to their subject and grade
grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",sep=";")
# Rename the column names
names(grade)<-c("Student_ID","Name","Venue","Grade")
head(grade)
# Separate `Grade` into `subject` variables and coresponding `Grade`columns
library(tidyverse)
df<- grade %>% separate(Grade,paste("V",1:7,sep="_"),sep=":")
head(df)
# It still is not separating `subject ` and `grade` independently
# Here is what I want it to look like
new_df<-df[c(1:5),c(1:4)]
new_df<-data.frame(new_df, V2=c(1:5)) # the same for V2,4,5,6,,7 to separate subject and grade
new_df
I am trying to use dplyr and stringr, but could not produce the result as I expected
回答1:
In my solution below, I have used functions from the tidyverse
and rebus
packages. The rebus
package builds regular expressions piece by piece using human readable code.
library(tidyverse)
library(rebus)
grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",
sep = ";", stringsAsFactors = FALSE)
grade_new <- grade %>%
mutate(DIEM_THI2 = str_replace_all(DIEM_THI, pattern = ":" %R% one_or_more(SPC), "-")) %>%
separate_rows(DIEM_THI2, sep = one_or_more(SPC)) %>%
separate(DIEM_THI2, c("SUBJECT", "GRADE"), sep = "-") %>%
spread(SUBJECT,GRADE)
The resulting dataframe looks like the following:
head(grade_new[,5:12])
# Biology Chemitry English Geography History Literature Math Physics
# 1 6.00 6.00 <NA> <NA> <NA> 7.50 4.25 6.80
# 2 5.80 6.00 <NA> <NA> <NA> 6.00 5.75 <NA>
# 3 <NA> <NA> <NA> 8.00 4.50 7.75 2.25 <NA>
# 4 <NA> <NA> <NA> 7.25 7.50 7.75 3.25 <NA>
# 5 <NA> <NA> <NA> 7.75 4.50 8.25 1.75 <NA>
# 6 <NA> 6.60 6.78 <NA> <NA> 7.00 8.75 8.40
The code can be understood as follows:
- All colon+space substrings are replaced with hyphens. i.e.
"Math: 4.25 Literature: 7.50"
becomes"Math-4.25 Literature-7.50"
. This is done using thestr_replace_all
function. Lets call the new variableDIEM_THI2
. - The
separate_rows
function splits the space-separated column,DIEM_THI2
into separate rows i.e."Math-4.25"
and"Literature-7.50"
span over two different rows. - The
DIEM_THI2
column is separated into two columns, i.e.SUBJECT
andGRADE
where the former contains values like"Math"
,"Literature"
and the latter contains values like"4.25"
and"7.50"
. - The key-value pair or SUBJECT-GRADE pair are spread across multiple columns.
回答2:
Here is one attempt using tidyverse
package.After converting everything to character (i.e. grade[] <- lapply(grade, as.character)
) we create a custom function that returns the sorted subject:grade
for each StudentID
. We then use unnest
to make it long, and use separate
to split it into two columns; Subject
and Grade
. Finally we spread
to get one column for each subject.
library(tidyverse)
#This function could definetely be more elegant or even avoided
# but this is as far as my regex knowledge allows me to go
mysplit <- function(x){
y <- strsplit(x, ':\\s+|\\s+')[[1]]
z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
return(z[order(sub(':.*', '', z))])
}
grade %>%
mutate(Grade = lapply(Grade, mysplit)) %>%
unnest() %>%
separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>%
spread(Subject, Grade)
Which will split it as so:
... Biology Chemitry English Geography History Literature Math Physics ... 1 6.00 6.00 <NA> <NA> <NA> 7.50 4.25 6.80 ... 2 5.80 6.00 <NA> <NA> <NA> 6.00 5.75 <NA> ... 3 <NA> <NA> <NA> 8.00 4.50 7.75 2.25 <NA> ... 4 <NA> <NA> <NA> 7.25 7.50 7.75 3.25 <NA> ... 5 <NA> <NA> <NA> 7.75 4.50 8.25 1.75 <NA> ... 6 <NA> 6.60 6.78 <NA> <NA> 7.00 8.75 8.40 . .
To better understand the function you should break it down.
Say tha x
is the following:
x
#[1] "Math: 4.25 Literature: 7.50 Physics: 6.80 Chemitry: 6.00 Biology: 6.00"
Split it every space
or : space
to get the following vector
y <- strsplit(x, ':\\s+|\\s+')[[1]]
y
#[1] "Math" "4.25" "Literature" "7.50" "Physics" "6.80" "Chemitry" "6.00" "Biology" "6.00"
Paste it together, first all the first elements (i.e. the subjects, y[c(TRUE, FALSE)]
) and then all the second elements (i.e. the grades y[c(FALSE, TRUE)]
), with a :
separator
z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
z
#[1] "Math: 4.25" "Literature: 7.50" "Physics: 6.80" "Chemitry: 6.00" "Biology: 6.00"
Finally it outputs a sorted (based on the words sub(':.*', '', z)
) vector
z[order(sub(':.*', '', z))]
#[1] "Biology: 6.00" "Chemitry: 6.00" "Literature: 7.50" "Math: 4.25" "Physics: 6.80"
As @rosscova pointed out, strings don't need to be sorted, which simplifies this a lot (function is not needed after all), i.e.
grade %>%
mutate(Grade = strsplit(Grade, '[0-9]\\s+')) %>%
unnest() %>%
separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>%
spread(Subject, Grade)
来源:https://stackoverflow.com/questions/44984925/how-to-separate-one-column-to-multiple-column-complex-column