How to separate one column to multiple column (complex column)

问题

I am trying to separate column "Grade" to multiple columns according to their subject and grade

    grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",sep=";")

# Rename the column names

    names(grade)<-c("Student_ID","Name","Venue","Grade")

    head(grade)

    # Separate `Grade` into `subject` variables and coresponding `Grade`columns
    library(tidyverse)


    df<- grade %>% separate(Grade,paste("V",1:7,sep="_"),sep=":")

    head(df)

    # It still is not separating `subject ` and `grade` independently

    # Here is what I want it to look like

    new_df<-df[c(1:5),c(1:4)]

    new_df<-data.frame(new_df, V2=c(1:5)) # the same for V2,4,5,6,,7 to separate subject and grade

    new_df

I am trying to use dplyr and stringr, but could not produce the result as I expected

回答1:

In my solution below, I have used functions from the tidyverse and rebus packages. The rebus package builds regular expressions piece by piece using human readable code.

 library(tidyverse)
 library(rebus)
 grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",
                 sep = ";", stringsAsFactors = FALSE)

 grade_new <- grade %>%
   mutate(DIEM_THI2 = str_replace_all(DIEM_THI, pattern = ":" %R% one_or_more(SPC), "-")) %>%
   separate_rows(DIEM_THI2, sep = one_or_more(SPC)) %>%
   separate(DIEM_THI2, c("SUBJECT", "GRADE"), sep = "-") %>%
   spread(SUBJECT,GRADE)

The resulting dataframe looks like the following:

head(grade_new[,5:12])
#   Biology Chemitry English Geography History Literature Math Physics
# 1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
# 2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
# 3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
# 4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
# 5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
# 6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40

The code can be understood as follows:

All colon+space substrings are replaced with hyphens. i.e. "Math: 4.25 Literature: 7.50" becomes "Math-4.25 Literature-7.50". This is done using the str_replace_all function. Lets call the new variable DIEM_THI2.
The separate_rows function splits the space-separated column, DIEM_THI2 into separate rows i.e. "Math-4.25" and "Literature-7.50" span over two different rows.
The DIEM_THI2 column is separated into two columns, i.e. SUBJECT and GRADE where the former contains values like "Math", "Literature" and the latter contains values like "4.25" and "7.50".
The key-value pair or SUBJECT-GRADE pair are spread across multiple columns.

回答2:

Here is one attempt using tidyverse package.After converting everything to character (i.e. grade[] <- lapply(grade, as.character)) we create a custom function that returns the sorted subject:grade for each StudentID. We then use unnest to make it long, and use separate to split it into two columns; Subject and Grade. Finally we spread to get one column for each subject.

library(tidyverse)

#This function could definetely be more elegant or even avoided
#  but this is as far as my regex knowledge allows me to go

mysplit <- function(x){
  y <- strsplit(x, ':\\s+|\\s+')[[1]]
  z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
  return(z[order(sub(':.*', '', z))])
}

grade %>% 
  mutate(Grade = lapply(Grade, mysplit)) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)

Which will split it as so:

...     Biology Chemitry English Geography History Literature Math Physics
...   1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
...   2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
...   3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
...   4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
...   5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
...   6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40
.
.

To better understand the function you should break it down. Say tha x is the following:

x
#[1] "Math:   4.25   Literature:   7.50   Physics:   6.80   Chemitry:   6.00   Biology:   6.00"

Split it every space or : space to get the following vector

y <- strsplit(x, ':\\s+|\\s+')[[1]]
y
 #[1] "Math"       "4.25"       "Literature" "7.50"       "Physics"    "6.80"       "Chemitry"   "6.00"       "Biology"    "6.00"

Paste it together, first all the first elements (i.e. the subjects, y[c(TRUE, FALSE)]) and then all the second elements (i.e. the grades y[c(FALSE, TRUE)]), with a : separator

z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
z
#[1] "Math: 4.25"       "Literature: 7.50" "Physics: 6.80"    "Chemitry: 6.00"   "Biology: 6.00"

Finally it outputs a sorted (based on the words sub(':.*', '', z)) vector

z[order(sub(':.*', '', z))]
#[1] "Biology: 6.00"    "Chemitry: 6.00"   "Literature: 7.50" "Math: 4.25"       "Physics: 6.80"

As @rosscova pointed out, strings don't need to be sorted, which simplifies this a lot (function is not needed after all), i.e.

grade %>% 
  mutate(Grade = strsplit(Grade, '[0-9]\\s+')) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)

来源：https://stackoverflow.com/questions/44984925/how-to-separate-one-column-to-multiple-column-complex-column

标签

string

data-manipulation