Combining tab delim files into a single file using R

I have several txt files with 3 columns in each files like this: file 1:

ProbeID X_Signal_intensity X_P-Value   
xxx         2.34          .89
xxx         6.45          .04 
xxx         1.09          .91  
xxx         5.87          .70
.            .            . 
.            .            .
.            .            .

file 2:

ProbeID Y_Signal_intensity Y_P-Value   
xxx         1.4             .92
xxx         2.55            .14 
xxx         4.19            .16  
xxx         3.47            .80
.            .               . 
.            .               .
.            .               .

file 3:

ProbeID Z_Signal_intensity Z_P-Value   
xxx         9.40             .82
xxx         1.55            .04 
xxx         3.19            .56  
xxx         2.47            .90
.            .               . 
.            .               .
.            .               .

In all the above files the values of ProbeID column are identical but not the other columns.Now I want to combine the all the above files using a for-loop into a single file like this:

ProbeID X_intensity X_P-Value   Y_intensity Y_P-Value   Z_intensity Z_P-Value     
xxx      2.34          .89       1.4             .92     9.40            .82
xxx      6.45          .04       2.55            .14     1.55            .04
xxx      1.09          .91       4.19            .16     3.19            .56
xxx      5.87          .70       3.47            .80     2.47            .90

Please do help me.

My approach is to read the files into data.frames

see help(read.delim) for reading modes.

After you have your three data.frames you can use

total <- merge(dataframeA,dataframeB,by="ProbeID")

look here http://www.statmethods.net/management/merging.html for documentation.

Read in the files as given by Richie Cotton, but make sure you add the appropriate extra arguments in the apply call. For one, header=TRUE should probably be added.

file.names <- c("file X.txt", "file Y.txt", "file Z.txt")
file.list <- lapply(file.names, read.table, header=TRUE)

Then you'll probably need a merge_recurse from the reshape package :

require(reshape)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")

This will work for any given number of dataframes, provided it's not a billion of them. For more information on the arguments used, see the help page of ?merge.

CORRECTION : in merge_recurse, you have to use all.x and all.y as shown in the correction above. You can't just use the shortcut all or you'll get errors.

Small demonstration :

X2 <- data.frame(ProbeID=(2:4),Z2=4:6)
X1 <- data.frame(ProbeID=1:3,Z1=1:3)
X3 <- data.frame(ProbeID=1:3,Z3=7:9)
file.list <- list(X1,X2,X3)
mynewframe <- merge_recurse(file.list,all.x=TRUE,all.y=TRUE,by="ProbeID")
> mynewframe
  ProbeID Z1 Z2 Z3
1       1  1 NA  7
2       2  2  4  8
3       3  3  5  9
4       4 NA  6 NA

Read in your files

filenames <- c("file X.txt", "file Y.txt", "file Z.txt")
data_list <- lapply(filenames, read.table)

Combine them into one big data frame

~~all_data <- do.call(cbind, data_list)~~

~~all_data <- do.call(merge, data_list, by = "ProbeID")~~

This gives a good lesson to "always concentrate when providing an answer". cbind isn't smart enough to do ID matching, and merge isn't smart enough to handle more than two data frames. Take a look at Joris's answer and use merge_recurse instead. Or forget what you thought you wanted and use my other answer below.

Actually, a better idea, rather than having many columns would be to have just 4 columns: ProbeID, Signal_intensity, P_value and Source_file.

data_list <- lapply(data_list, function(x) {
  colnames(x) <- c("ProbeID", "Signal_intensity", "P_value")
  x
})

all_data <- do.call(rbind, data_list)
all_data$Source_file <- rep(filenames, times = sapply(data_list, nrow))

I am going to throw another approach into the mix which uses Reduce

Reduce(function(...) merge(..., all = T), file.list)

来源：https://stackoverflow.com/questions/6942662/combining-tab-delim-files-into-a-single-file-using-r

标签

merge

dataframe

read.table