How to separate one column to multiple column (complex column)

时光总嘲笑我的痴心妄想 提交于 2019-12-06 08:39:35

问题


I am trying to separate column "Grade" to multiple columns according to their subject and grade

    grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",sep=";")

# Rename the column names

    names(grade)<-c("Student_ID","Name","Venue","Grade")

    head(grade)

    # Separate `Grade` into `subject` variables and coresponding `Grade`columns
    library(tidyverse)


    df<- grade %>% separate(Grade,paste("V",1:7,sep="_"),sep=":")

    head(df)

    # It still is not separating `subject ` and `grade` independently

    # Here is what I want it to look like

    new_df<-df[c(1:5),c(1:4)]

    new_df<-data.frame(new_df, V2=c(1:5)) # the same for V2,4,5,6,,7 to separate subject and grade

    new_df 

I am trying to use dplyr and stringr, but could not produce the result as I expected


回答1:


In my solution below, I have used functions from the tidyverse and rebus packages. The rebus package builds regular expressions piece by piece using human readable code.

 library(tidyverse)
 library(rebus)
 grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",
                 sep = ";", stringsAsFactors = FALSE)

 grade_new <- grade %>%
   mutate(DIEM_THI2 = str_replace_all(DIEM_THI, pattern = ":" %R% one_or_more(SPC), "-")) %>%
   separate_rows(DIEM_THI2, sep = one_or_more(SPC)) %>%
   separate(DIEM_THI2, c("SUBJECT", "GRADE"), sep = "-") %>%
   spread(SUBJECT,GRADE)

The resulting dataframe looks like the following:

head(grade_new[,5:12])
#   Biology Chemitry English Geography History Literature Math Physics
# 1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
# 2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
# 3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
# 4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
# 5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
# 6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40

The code can be understood as follows:

  1. All colon+space substrings are replaced with hyphens. i.e. "Math: 4.25 Literature: 7.50" becomes "Math-4.25 Literature-7.50". This is done using the str_replace_all function. Lets call the new variable DIEM_THI2.
  2. The separate_rows function splits the space-separated column, DIEM_THI2 into separate rows i.e. "Math-4.25" and "Literature-7.50" span over two different rows.
  3. The DIEM_THI2 column is separated into two columns, i.e. SUBJECT and GRADE where the former contains values like "Math", "Literature" and the latter contains values like "4.25" and "7.50".
  4. The key-value pair or SUBJECT-GRADE pair are spread across multiple columns.



回答2:


Here is one attempt using tidyverse package.After converting everything to character (i.e. grade[] <- lapply(grade, as.character)) we create a custom function that returns the sorted subject:grade for each StudentID. We then use unnest to make it long, and use separate to split it into two columns; Subject and Grade. Finally we spread to get one column for each subject.

library(tidyverse)

#This function could definetely be more elegant or even avoided
#  but this is as far as my regex knowledge allows me to go

mysplit <- function(x){
  y <- strsplit(x, ':\\s+|\\s+')[[1]]
  z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
  return(z[order(sub(':.*', '', z))])
}

grade %>% 
  mutate(Grade = lapply(Grade, mysplit)) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)

Which will split it as so:

...     Biology Chemitry English Geography History Literature Math Physics
...   1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
...   2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
...   3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
...   4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
...   5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
...   6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40
.
.

To better understand the function you should break it down. Say tha x is the following:

x
#[1] "Math:   4.25   Literature:   7.50   Physics:   6.80   Chemitry:   6.00   Biology:   6.00"

Split it every space or : space to get the following vector

y <- strsplit(x, ':\\s+|\\s+')[[1]]
y
 #[1] "Math"       "4.25"       "Literature" "7.50"       "Physics"    "6.80"       "Chemitry"   "6.00"       "Biology"    "6.00"

Paste it together, first all the first elements (i.e. the subjects, y[c(TRUE, FALSE)]) and then all the second elements (i.e. the grades y[c(FALSE, TRUE)]), with a : separator

z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
z
#[1] "Math: 4.25"       "Literature: 7.50" "Physics: 6.80"    "Chemitry: 6.00"   "Biology: 6.00"   

Finally it outputs a sorted (based on the words sub(':.*', '', z)) vector

z[order(sub(':.*', '', z))]
#[1] "Biology: 6.00"    "Chemitry: 6.00"   "Literature: 7.50" "Math: 4.25"       "Physics: 6.80"

As @rosscova pointed out, strings don't need to be sorted, which simplifies this a lot (function is not needed after all), i.e.

grade %>% 
  mutate(Grade = strsplit(Grade, '[0-9]\\s+')) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)


来源:https://stackoverflow.com/questions/44984925/how-to-separate-one-column-to-multiple-column-complex-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!