问题
R beginner here with a probably rather simple question.
I have a dataframe as seen below in the reproducible sample. What I want to do is export it to pdf, so that I always have 4 columns (Country, Fruit, Start, End) per pdf sheet. So in this case I would need a pdf file with 3 pages, one for Apples, Bananas and Cheeries.
In reality I have 50 "Fruits", hence a loop of sorts would be useful, preferably using grid.table or similar due to it's nicely formatted output tables.
Country <- c("AUS", "AT", "BE", "CHN", "US")
Fruit <- c(rep("Apples", 5))
Start <- c(1999, 1998, 1987, 1988, 1997)
End <- c(2014, 2014, 2015, 2013, 2014)
Country.1 <- c("AUS", "AT", "BE", "CHN", "US")
Fruit.1 <- c(rep("Bananas", 5))
Start.1 <- c(1998, 1999, 1987, 1988, 1999)
End.1 <- c(2014, 2014, 2014, 2014, 2015)
Country.2 <- c("AUS", "AT", "BE", "CHN", "US")
Fruit.2 <- c(rep("Cherries", 5))
Start.2 <- c(1981, 1988, 1987, 1977, 1999)
End.2 <- c(2014, 2014, 2015, 2013, 2014)
mydf <- data.frame(Country, Fruit, Start, End, Country.1, Fruit.1, Start.1, End.1, Country.2, Fruit.2, Start.2, End.2)
I tried to work with the expression mydf[c(TRUE, seq(FALSE, 4))], and tried to incorporate it in grid.table (from gridExtra) but couldn't make it work. I would hugely appreciate any help.
Furthermore (and not as important), I'd like to ask you to comment on the structure of this data. The way this dataframe is set up I basically have lots of duplicate columns (Country). I doubt that this is the ideal way to work with data in R, and would highly appreciate and any comments that would help me improve my R skills in this regard, since I want to improve on handling big (mostly panel) datasets in R.
Edit: I assume I could delete the duplicate Country columns, as they do not change. Edit2: Below a small sample of data representing my initial data structure. x1-x10 are the "Fruits", y1 - y12 are analogous to the countries of the previous sample.
Fruits <- c("x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12")
y1 <- c(round(runif(12, 1980, 1999), digits = 0))
y2 <- c(round(runif(12, 1980, 1999), digits = 0))
y3 <- c(round(runif(12, 1980, 1999), digits = 0))
y4 <- c(round(runif(12, 1980, 1999), digits = 0))
y5 <- c(round(runif(12, 1980, 1999), digits = 0))
y6 <- c(round(runif(12, 1980, 1999), digits = 0))
y7 <- c(round(runif(12, 1980, 1999), digits = 0))
y8 <- c(round(runif(12, 1980, 1999), digits = 0))
y9 <- c(round(runif(12, 1980, 1999), digits = 0))
startdf <- data.frame(Fruits, y1, y2, y3, y4, y5, y6, y7, y8, y9)
y1 <- c(round(runif(12, 2012, 2015), digits = 0))
y2 <- c(round(runif(12, 2012, 2015), digits = 0))
y3 <- c(round(runif(12, 2012, 2015), digits = 0))
y4 <- c(round(runif(12, 2012, 2015), digits = 0))
y5 <- c(round(runif(12, 2012, 2015), digits = 0))
y6 <- c(round(runif(12, 2012, 2015), digits = 0))
y7 <- c(round(runif(12, 2012, 2015), digits = 0))
y8 <- c(round(runif(12, 2012, 2015), digits = 0))
y9 <- c(round(runif(12, 2012, 2015), digits = 0))
enddf <- data.frame(Fruits, y1, y2, y3, y4, y5, y6, y7, y8, y9)
回答1:
First of all you're overwriting the Start and End columns in your sample so I've also added numbers to them and started the numbering at 1:
Country <- c("AUS", "AT", "BE", "CHN", "US")
Fruit.1 <- c(rep("Apples", 5))
Start.1 <- c(1999, 1998, 1987, 1988, 1997)
End.1 <- c(2014, 2014, 2015, 2013, 2014)
Fruit.2 <- c(rep("Bananas", 5))
Start.2 <- c(1998, 1999, 1987, 1988, 1999)
End.2 <- c(2014, 2014, 2014, 2014, 2015)
Fruit.3 <- c(rep("Cherries", 5))
Start.3 <- c(1981, 1988, 1987, 1977, 1999)
End.3 <- c(2014, 2014, 2015, 2013, 2014)
mydf <- data.frame(Country, Fruit.1, Start.1, End.1, Fruit.2, Start.2, End.2, Fruit.3, Start.3, End.3)
Notice how I only put the country column once, at the beginning of the data frame. There is really no need to have duplicated columns in a data frame as you can always pull them out by name, number or logical value.
I've made a simple loop based on a regular expression to print each separate part on a new page:
library(gridExtra)
pdf("data_output.pdf", height=11, width=8.5)
for(i in 1:3) {
plot.new()
regex <- paste0("Country|",i)
tempdf <- mydf[grepl(regex,names(mydf))]
grid.table(tempdf)
}
dev.off()
The regular expression pulls out all the columns that either have "Country" or (| = OR) the digit "i" in the column name. The plot.new() command ensures that grid.table starts a new page.
ADDON: how I would organize your data.
The above writeup solves your original problem. Really though, the structure of your data lends itself perfectly to making a data frame composed of factors and numbers and using the powerful split-apply-combine framework from either base R or the plyr/dplyr packages.
Your data frame would ideally be reorganized as follows:
Country <- rep(c("AUS", "AT", "BE", "CHN", "US"), 3)
Fruit <- rep(c("Apples", "Bananas", "Cherries") , each = 5)
Start <- c(1999,1998,1987,1988,1997,1998,1999,1987,1988,1999,1981,1988,1987,1977,1999)
End <- c(2014,2014,2015,2013,2014,2014,2014,2014,2014,2015,2014,2014,2015,2013,2014)
mydf <- data.frame(Country, Fruit, Start, End)
Then you could either use base R to loop over the different type of fruits
pdf("data_output.pdf", height=11, width=8.5)
for(fruit in levels(mydf$Fruit)) {
tempdf <- subset(mydf, Fruit == fruit)
plot.new()
grid.table(tempdf)
}
or, you could use a lapply call like in Tensibai 's answer,
or, you can use the by function
by(mydf, mydf$Fruit, function(x) {plot.new(); grid.table(x)})
or, you could use the ddply function of the plyr package to accomplish the same,
library(plyr)
ddply(mydf, .(Fruit), function(x) { plot.new(); grid.table(x) })
The advantage of the last two methods is that you can easily add other columns (e.g., Continent) and print more complicated subsets of your data, like Fruit = Apples and Continent = North America, without having to wrap everything in more loops.
回答2:
For your data organization I would go this way:
# Known Fixed list
Country <- c("AUS", "AT", "BE", "CHN", "US")
Fruits <- c("Apples","Bananas","Cherries")
# Build list of variables entry
Start <- list() # Init the list
End <- list() # Init the list
# Fill them up (1 being first fruit in Fruits)
Start[[1]] <- c(1999, 1998, 1987, 1988, 1997)
End[[1]] <- c(2014, 2014, 2015, 2013, 2014)
Start[[2]] <- c(1998, 1999, 1987, 1988, 1999)
End[[2]] <- c(2014, 2014, 2014, 2014, 2015)
Start[[3]] <- c(1981, 1988, 1987, 1977, 1999)
End[[3]] <- c(2014, 2014, 2015, 2013, 2014)
#Build a list of data frame by iterating over the Fruits
mydfs <- lapply(seq_along(Fruits), function(x) { data.frame(Country,Fruit = rep(Fruits[x],length(Start[[x]])),Start = Start[[x]],End = End[[x]]) } )
Which gives:
> mydfs
[[1]]
Country Fruit Start End
1 AUS Apples 1999 2014
2 AT Apples 1998 2014
3 BE Apples 1987 2015
4 CHN Apples 1988 2013
5 US Apples 1997 2014
[[2]]
Country Fruit Start End
1 AUS Bananas 1998 2014
2 AT Bananas 1999 2014
3 BE Bananas 1987 2014
4 CHN Bananas 1988 2014
5 US Bananas 1999 2015
[[3]]
Country Fruit Start End
1 AUS Cherries 1981 2014
2 AT Cherries 1988 2014
3 BE Cherries 1987 2015
4 CHN Cherries 1977 2013
5 US Cherries 1999 2014
And you can access any of this df like this:
> mydfs[[ as.factor(Fruits)[Fruits == "Bananas"] ]]
Country Fruit Start End
1 AUS Bananas 1998 2014
2 AT Bananas 1999 2014
3 BE Bananas 1987 2014
4 CHN Bananas 1988 2014
5 US Bananas 1999 2015
So you can print any of them individually with grid.table(mydfs[[ as.factor(Fruits)[Fruits == "Bananas"] ]] or all of them through a lapply call as show by @user28400
pdf("data_output.pdf", height=11, width=8.5)
lapply( mydfs, function(x) {
plot.new()
grid.table(x)
})
dev.off()
Updated construction of mydfs per question update and comments.
nbYears <- length(startdf[1,-1])
#Build a list of data frame by iterating over the Fruits
mydfs <- lapply( Fruits, # iterate over the Fruits to use their names
function(x) {
lVec <- startdf$Fruits == x # build the logical vector (shorten the subset syntax later, not needed for perf)
data.frame(
Country,
Fruit = rep(x,nbYears),
Start = unlist(startdf[lVec,-1]), # Get the subset of the df from the logical vector, omit the first column, and cast to vector instead of data.frame
End = unlist(enddf[lVec,-1]) # same as above
)
}
)
This assume Country will match the number of years present in the stardf and enddf.
来源:https://stackoverflow.com/questions/32203753/export-fixed-range-of-columns-from-dataframe-to-pdf-one-slice-per-sheet