I\'ve got a data frame that\'s got the following form
pages count
[page 1, page 2, page 3] 23
[page 2, page 4] 4
[p
My "splitstackshape" package has a function that addresses this kind of problem. The relevant function in this case is concat.split
and works as follows (using "myDat" from Ricardo's answer):
# Get rid of "[" and "]" from your "pages" variable
myDat$pages <- gsub("\\[|\\]", "", myDat$pages)
# Specify the source data.frame, the variable that needs to be split up
# and whether to drop the original variable or not
library(splitstackshape)
concat.split(myDat, "pages", ",", drop = TRUE)
# count pages_1 pages_2 pages_3
# 1 23 page 1 page 2 page 3
# 2 4 page 2 page 4
# 3 12 page 1 page 3 page 4
sample data
myDat <- read.table(text=
"pages|count
[page 1, page 2, page 3]|23
[page 2, page 4]|4
[page 1, page 3, page 4]|12", header=TRUE, sep="|")
We can pull pages
out of myDat
to work on it.
# if factors, convert to characters
pages <- as.character(myDat$page)
# remove brackets. Note the double-escape's in R
pages <- gsub("(\\[|\\])", "", pages)
# split on comma
pages <- strsplit(pages, ",")
# find the largest element
maxLen <- max(sapply(pages, length))
# fill in any blanks. The t() is to transpose the return from sapply
pages <-
t(sapply(pages, function(x)
# append to x, NA's. Note that if (0 == (maxLen - length(x))), then no NA's are appended
c(x, rep(NA, maxLen - length(x)))
))
# add column names as necessary
colnames(pages) <- paste(c("First", "Second", "Third"), "Page")
# Put it all back together
data.frame(pages, Count=myDat$count)
Results
> data.frame(pages, Count=myDat$count)
First.Page Second.Page Third.Page Count
1 page 1 page 2 page 3 23
2 page 2 page 4 <NA> 4
3 page 1 page 3 page 4 12
read.table
with fill=TRUE
can fill them in. The names(DF2)<-
line could be omitted if nice column names are not important. No packages are used.
# test data
Lines <- "pages count
[page 1, page 2, page 3] 23
[page 2, page 4] 4
[page 1, page 3, page 4] 12"
# code - replace text=Lines with something like "myfile.dat"
DF <- read.table(text = Lines, skip = 1, sep = "]", as.is = TRUE)
DF2 <- read.table(text = DF[[1]], sep = ",", fill = TRUE, as.is = TRUE)
names(DF2) <- paste0(read.table(text = Lines, nrow = 1, as.is = TRUE)[[1]], seq_along(DF2))
DF2$count <- DF[[2]]
DF2[[1]] <- sub(".", "", DF2[[1]]) # remove [
which gives this:
> DF2
pages1 pages2 pages3 count
1 page 1 page 2 page 3 23
2 page 2 page 4 4
3 page 1 page 3 page 4 12
Note: This gives columns headings of page1, page2, etc. If it were important to have precisely the column headings shown in the question then replace that line with this which uses those headings if there are less than 20 page columns.
ord <- c('First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh',
'Eighth', 'Ninth', 'Tenth', 'Eleventh', 'Twelfth', 'Thirteenth',
'Fourteenth', 'Fiftheenth', 'Sixteenth', 'Seventeenth', 'Eighteenth',
'Nineteenth')
ix <- seq_along(DF2)
names(DF2) <- if (ncol(DF2) < 20) paste(ord[ix], "Page") else paste("Page", ix)