Split strings into columns in R where each string has a potentially different number of column entries

前端 未结 3 688
情歌与酒
情歌与酒 2021-01-12 17:43

I\'ve got a data frame that\'s got the following form

pages                         count
[page 1, page 2, page 3]      23
[page 2, page 4]              4
[p         


        
相关标签:
3条回答
  • 2021-01-12 18:05

    My "splitstackshape" package has a function that addresses this kind of problem. The relevant function in this case is concat.split and works as follows (using "myDat" from Ricardo's answer):

    # Get rid of "[" and "]" from your "pages" variable
    myDat$pages <- gsub("\\[|\\]", "", myDat$pages)
    # Specify the source data.frame, the variable that needs to be split up
    #   and whether to drop the original variable or not
    library(splitstackshape)
    concat.split(myDat, "pages", ",", drop = TRUE)
    #   count pages_1 pages_2 pages_3
    # 1    23  page 1  page 2  page 3
    # 2     4  page 2  page 4        
    # 3    12  page 1  page 3  page 4
    
    0 讨论(0)
  • 2021-01-12 18:09

    sample data

    myDat <- read.table(text=
      "pages|count
    [page 1, page 2, page 3]|23
    [page 2, page 4]|4
    [page 1, page 3, page 4]|12", header=TRUE, sep="|") 
    

    We can pull pages out of myDat to work on it.

    # if factors, convert to characters
    pages <- as.character(myDat$page)
    
    # remove brackets.  Note the double-escape's in R
    pages <- gsub("(\\[|\\])", "", pages)
    
    # split on comma
    pages <- strsplit(pages, ",")
    
    # find the largest element
    maxLen <- max(sapply(pages, length))
    
    # fill in any blanks. The t() is to transpose the return from sapply
    pages <- 
    t(sapply(pages, function(x)
          # append to x, NA's.  Note that if (0 == (maxLen - length(x))), then no NA's are appended 
          c(x, rep(NA, maxLen - length(x)))
      ))
    
    # add column names as necessary
    colnames(pages) <- paste(c("First", "Second", "Third"), "Page")
    
    # Put it all back together
    data.frame(pages, Count=myDat$count)
    



    Results

    > data.frame(pages, Count=myDat$count)
      First.Page Second.Page Third.Page Count
    1     page 1      page 2     page 3    23
    2     page 2      page 4       <NA>     4
    3     page 1      page 3     page 4    12
    
    0 讨论(0)
  • 2021-01-12 18:15

    read.table with fill=TRUE can fill them in. The names(DF2)<- line could be omitted if nice column names are not important. No packages are used.

    # test data
    
    Lines <- "pages                         count
    [page 1, page 2, page 3]      23
    [page 2, page 4]              4
    [page 1, page 3, page 4]      12"
    
    # code - replace text=Lines with something like "myfile.dat"
    
    DF <- read.table(text = Lines, skip = 1, sep = "]", as.is = TRUE)
    DF2 <- read.table(text = DF[[1]], sep = ",", fill = TRUE, as.is = TRUE)
    names(DF2) <- paste0(read.table(text = Lines, nrow = 1, as.is = TRUE)[[1]], seq_along(DF2))
    DF2$count <- DF[[2]]
    DF2[[1]] <- sub(".", "", DF2[[1]]) # remove [
    

    which gives this:

    > DF2
      pages1  pages2  pages3 count
    1 page 1  page 2  page 3    23
    2 page 2  page 4             4
    3 page 1  page 3  page 4    12
    

    Note: This gives columns headings of page1, page2, etc. If it were important to have precisely the column headings shown in the question then replace that line with this which uses those headings if there are less than 20 page columns.

     ord <- c('First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
     'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth',
     'Fourteenth', 'Fiftheenth', 'Sixteenth', 'Seventeenth', 'Eighteenth', 
     'Nineteenth')
    ix <- seq_along(DF2)
    names(DF2) <- if (ncol(DF2) < 20) paste(ord[ix], "Page") else paste("Page", ix)
    
    0 讨论(0)
提交回复
热议问题