Combining tab delim files into a single file using R

做~自己de王妃 提交于 2019-11-30 22:00:55

My approach is to read the files into data.frames

see help(read.delim) for reading modes.

After you have your three data.frames you can use

total <- merge(dataframeA,dataframeB,by="ProbeID")

look here http://www.statmethods.net/management/merging.html for documentation.

Read in the files as given by Richie Cotton, but make sure you add the appropriate extra arguments in the apply call. For one, header=TRUE should probably be added.

file.names <- c("file X.txt", "file Y.txt", "file Z.txt")
file.list <- lapply(file.names, read.table, header=TRUE)

Then you'll probably need a merge_recurse from the reshape package :

require(reshape)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")

This will work for any given number of dataframes, provided it's not a billion of them. For more information on the arguments used, see the help page of ?merge.

CORRECTION : in merge_recurse, you have to use all.x and all.y as shown in the correction above. You can't just use the shortcut all or you'll get errors.

Small demonstration :

X2 <- data.frame(ProbeID=(2:4),Z2=4:6)
X1 <- data.frame(ProbeID=1:3,Z1=1:3)
X3 <- data.frame(ProbeID=1:3,Z3=7:9)
file.list <- list(X1,X2,X3)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")
> mynewframe
  ProbeID Z1 Z2 Z3
1       1  1 NA  7
2       2  2  4  8
3       3  3  5  9
4       4 NA  6 NA

Read in your files

filenames <- c("file X.txt", "file Y.txt", "file Z.txt")
data_list <- lapply(filenames, read.table)

Combine them into one big data frame

all_data <- do.call(cbind, data_list)

all_data <- do.call(merge, data_list, by = "ProbeID")

This gives a good lesson to "always concentrate when providing an answer". cbind isn't smart enough to do ID matching, and merge isn't smart enough to handle more than two data frames. Take a look at Joris's answer and use merge_recurse instead. Or forget what you thought you wanted and use my other answer below.


Actually, a better idea, rather than having many columns would be to have just 4 columns: ProbeID, Signal_intensity, P_value and Source_file.

data_list <- lapply(data_list, function(x) {
  colnames(x) <- c("ProbeID", "Signal_intensity", "P_value")
  x
})

all_data <- do.call(rbind, data_list)
all_data$Source_file <- rep(filenames, times = sapply(data_list, nrow))

I am going to throw another approach into the mix which uses Reduce

Reduce(function(...) merge(..., all = T), file.list)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!