问题
I followed Hadley's thread: Issue in Loading multiple .csv files into single dataframe in R using rbind to read multiple CSV files and then convert them to one dataframe. I also experimented with lapply vs. sapply as discussed on Grouping functions (tapply, by, aggregate) and the *apply family.
Here's my first CSV file:
dput(File1)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
23L, 34L, 45L, 44L), Tax = c(23L, 21L, 22L, 24L, 25L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
Here's my second CSV file:
dput(File2)
structure(list(First.Name = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("A",
"C"), class = "factor"), Last.Name = structure(c(1L, 2L, 2L,
2L, 2L), .Label = c("B", "D"), class = "factor"), Income = c(55L,
55L, 55L, 55L, 55L), Tax = c(24L, 24L, 24L, 24L, 24L), Location = structure(c(3L,
3L, 1L, 4L, 2L), .Label = c("Americas", "AP", "EMEA", "LATAM"
), class = "factor")), .Names = c("First.Name", "Last.Name",
"Income", "Tax", "Location"), class = "data.frame", row.names = c(NA,
-5L))
Here's my code:
dat1 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,23,EMEA\n2,C,D,23,21,EMEA\n3,A,D,34,22,Americas\n4,A,D,45,24,LATAM\n5,A,D,44,25,AP"
dat2 <-",First.Name,Last.Name,Income,Tax,Location\n1,A,B,55,24,EMEA\n2,C,D,55,24,EMEA\n3,A,D,55,24,Americas\n4,A,D,55,24,LATAM\n5,A,D,55,24,AP"
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
merged_file <- do.call(rbind, lapply(list(tc1,tc2), read.csv))
While this works beautifully, I wanted to change lapply to sapply. From the above thread, I realize that sapply would change the read factors from csv file to matrices, but I am unsure why the fields are flipped. For instance, Income field occupies row#3 and row#8, but are not in one column.
Here's the code:
tc1 <- textConnection(dat1)
tc2 <- textConnection(dat2)
# change lapply to sapply
merged_file <- do.call(rbind, sapply(list(tc1,tc2), read.csv))
Here's the output:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 1 1 1
[2,] 1 2 2 2 2
[3,] 55 23 34 45 44
[4,] 23 21 22 24 25
[5,] 3 3 1 4 2
[6,] 1 2 1 1 1
[7,] 1 2 2 2 2
[8,] 55 55 55 55 55
[9,] 24 24 24 24 24
[10,] 3 3 1 4 2
I'd appreciate any help. I am fairly new to R and not sure what's going on.
回答1:
The issue had nothing to do with factors, it's generic sapply vs lapply.
Why does sapply get it so wrong whereas lapply gets it right? Remember in R, dataframes are lists-of-columns. and each column can have a distinct type.
lapplyreturns a list-of-columns torbind, which does the concatenation correctly. It keeps corresponding columns together. So your factors emerge correctly.sapplyhowever...- returns a matrix of numeric... (since matrices can only have one type, unlike dataframes)
- ...which, worse still, has an unwanted transpose
- so
sapplyturns your two 5x6 input dataframes into transposed 6x5 matrices (columns now correspond to rows)... - with all data coerced to numeric (garbage!).
- then
rbindrow-"concatenates" those two garbage 6x5 matrices of numeric into one very-garbage 12x5 matrix. Since columns have been transposed into rows, row-concatenating the matrices combines datatypes, and obviously your factors are messed up.
Summary: just use lapply
来源:https://stackoverflow.com/questions/39666755/sapply-vs-lapply-while-reading-files-and-rbinding-them