问题
I have n matrix in a list and an additional matrix which contain the value I want to find in the list of matrix.
To get the list of matrix, I use this code :
setwd("C:\\~\\Documents\\R")
import.multiple.txt.files<-function(pattern=".txt",header=T)
{
list.1<-list.files(pattern=".txt")
list.2<-list()
for (i in 1:length(list.1))
{
list.2[[i]]<-read.delim(list.1[i])
}
names(list.2)<-list.1
list.2
}
txt.import.matrix<-cbind(txt.import)
My list look like that: (I show only an example with n=2). The number of rows in each array is different (here I just take 5 and 6 rows to simplify but I have in my true data more than 500).
txt.import.matrix[1]
[[1]]
X. RT. Area. m.z.
1 1 1.01 2820.1 358.9777
2 2 1.03 9571.8 368.4238
3 3 2.03 6674.0 284.3294
4 4 2.03 5856.3 922.0094
5 5 3.03 27814.6 261.1299
txt.import.matrix[2]
[[2]]
X. RT. Area. m.z.
1 1 1.01 7820.1 358.9777
2 2 1.06 8271.8 368.4238
3 3 2.03 12674.0 284.3294
4 4 2.03 5856.6 922.0096
5 5 2.03 17814.6 261.1299
6 6 3.65 5546.5 528.6475
I have another array of values I want to find in the list of matrix. This array was obtained by combine all the array from the list in an array and removing the duplicates.
reduced.list.pre.filtering
RT. m.z.
1 1.01 358.9777
2 1.07 368.4238
3 2.05 284.3295
4 2.03 922.0092
5 3.03 261.1299
6 3.56 869.4558
I would like to obtain a new matrix where it is written the Area.
result of matched RT. ± 0.02
and m.z. ± 0.0002
for all the matrix in the list. The output could be like that.
RT. m.z. Area.[1] Area.[2]
1 1.01 358.9777 2820.1 7820.1
2 1.07 368.4238 8271.8
3 2.05 284.3295 6674.0 12674.0
4 2.03 922.0092 5856.3
5 3.03 261.1299 27814.6
6 3.65 528.6475
I have only an idea how to match only one exact value in one array. The difficulty here is to find the value in a list of array and need to find the value ± an interval. If you have any suggestion, I will be very grateful.
回答1:
This is an alternative approach to Arun's rather elegant answer using data.table
. I decided to post it because it contains two additional aspects that are important considerations in your problem:
Floating point comparison: comparison to see if a floating point value is in an interval requires consideration of the round-off error in computing the interval. This is the general problem of comparing floating point representations of real numbers. See this and this in the context of R. The following implements this comparison in the function
in.interval
.Multiple matches: your interval match criterion can result in multiple matches if the intervals overlap. The following assumes that you only want the first match (with respect to increasing rows of each
txt.import.matrix
matrix). This is implemented in the functionmatch.interval
and explained in the notes to follow. Other logic is needed if you want to get something like the average of the areas that match your criterion.
To find the matching row(s) in a matrix from txt.import.matrix
for each row in the matrix reduced.list.pre.filtering
, the following code vectorizes the application of the comparison function over the space of all enumerated pairs of rows between reduced.list.pre.filtering
and the matrix from txt.import.matrix
. Functionally for this application, this is the same as Arun's solution using data.table
's non-equi
joins; however, the non-equi
join feature is more general and the data.table
implementation is most likely better optimized for both memory usage and speed for even this application.
in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
return (abs(x-center) <= (deviation + tol))
}
match.interval <- function(r, t) {
r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r)) ## 1.
ind <- which(in.interval(r.rt, t.rt, 0.02) &
in.interval(r.mz, t.mz, 0.0002))
r.ind <- floor((ind - 1)/nrow(t)) + 1 ## 2.
dup <- duplicated(r.ind)
r.ind <- r.ind[!dup]
t.ind <- ind[!dup] - (r.ind - 1)*nrow(t) ## 3.
return(cbind(r.ind,t.ind))
}
get.area.matched <- function(r, t) {
match.ind <- match.interval(r, t)
area <- rep(NA,nrow(r))
area[match.ind[,1]] <- t[match.ind[,2], 3] ## 4.
return(area)
}
res <- cbind(reduced.list.pre.filtering,
do.call(cbind,lapply(txt.import.matrix,
get.area.matched,
r=reduced.list.pre.filtering))) ## 5.
colnames(res) <- c(colnames(reduced.list.pre.filtering),
sapply(seq_len(length(txt.import.matrix)),
function(i) {return(paste0("Area.[",i,"]"))})) ## 6.
print(res)
## RT. m.z. Area.[1] Area.[2]
##[1,] 1.01 358.9777 2820.1 7820.1
##[2,] 1.07 368.4238 NA 8271.8
##[3,] 2.05 284.3295 6674.0 12674.0
##[4,] 2.03 922.0092 5856.3 NA
##[5,] 3.03 261.1299 27814.6 NA
##[6,] 3.56 869.4558 NA NA
Notes:
This part constructs the data to enable the vectorization of the application of the comparison function over the space of all enumerated pairs of rows between
reduced.list.pre.filtering
and the matrix fromtxt.import.matrix
. The data to be constructed are four arrays that are the replications (or expansions) of the two columns, used in the comparison criterion, ofreduced.list.pre.filtering
in the row dimension of each matrix fromtxt.import.matrix
and the replications of the two columns, used in the comparison criterion, of each matrix fromtxt.import.matrix
in the row dimension ofreduced.list.pre.filtering
. Here, the term array refers to either a 2-D matrix or a 1-D vector. The resulting four arrays are:r.rt
is the replication of theRT.
column ofreduced.list.pre.filtering
(i.e.,r[,1]
) in the row dimension oft
t.rt
is the replication of theRT.
column of the matrix fromtxt.import.matrix
(i.e.,t[,2]
) in the row dimension ofr
r.mz
is the replication of them.z.
column ofreduced.list.pre.filtering
(i.e.r[,2]
) in the row dimension oft
t.mz
is the replication of them.z.
column of the matrix fromtxt.import.matrix
(i.e.t[,4]
) in the row dimension ofr
What is important is that the indices for each of these arrays enumerate all pairs of rows in
r
andt
in the same manner. Specifically, viewing these arrays as 2-D matrices of sizeM
byN
whereM=nrow(t)
andN=nrow(r)
, the row indices correspond to the rows oft
and the column indices correspond to the rows ofr
. Consequently, the array values (over all four arrays) at thei
-th row and thej
-th column (of each of the four arrays) are the values used in the comparison criterion between thej
-th row ofr
and thei
-th row oft
. Implementation of this replication process uses the R functionrep
. For example, in computingr.rt
,rep
witheach=M
is used, which has the effect of treating its array inputr[,1]
as a row vector and replicating that rowM
times to formM
rows. The result is such that each column, which corresponds to a row inr
, has theRT.
value from the corresponding row ofr
and that value is the same for all rows (of that column) ofr.rt
, each of which corresponds to a row int
. This means that in comparing that row inr
to any row int
, the value ofRT.
from that row inr
is used. Conversely, in computingt.rt
,rep
withtimes=N
is used, which has the effect of treating its array input as a column vector and replicating that columnN
times to form aN
columns. The result is such that each row int.rt
, which corresponds to a row int
, has theRT.
value from the corresponding row oft
and that value is the same for all columns (of that row) oft.rt
, each of which corresponds to a row inr
. This means that in comparing that row int
to any row inr
, the value ofRT.
from that row int
is used. Similarly, the computations ofr.mz
andt.mz
follow using them.z.
column fromr
andt
, respectively.This performs the vectorized comparison resulting in a
M
byN
logical matrix where thei
-th row and thej
-th column isTRUE
if thej
-th row ofr
matches the criterion with thei
-th row oft
, andFALSE
otherwise. The output ofwhich()
is the array of array indices to this logical comparison result matrix where its element isTRUE
. We want to convert these array indices to the row and column indices of the comparison result matrix to refer back to the rows ofr
andt
. The next line extracts the column indices from the array indices. Note that the variable name isr.ind
to denote that these correspond to the rows ofr
. We extract this first because it is important for detecting multiple matches for a row inr
.This part handles possible multiple matches in
t
for a given row inr
. Multiple matches will show up as duplicate values inr.ind
. As stated above, the logic here only keeps the first match in terms of increasing rows int
. The functionduplicated
returns all the indices of duplicate values in the array. Therefore removing these elements will do what we want. The code first removes them fromr.ind
, then it removes them fromind
, and finally computes the column indices to the comparison result matrix, which corresponds to the rows oft
, using the prunedind
andr.ind
. What is returned bymatch.interval
is a matrix whose rows are matched pair of row indices with its first column being row indices tor
and its second column being row indices tot
.The
get.area.matched
function simply uses the result frommatch.ind
to extract theArea
fromt
for all matches. Note that the returned result is a (column) vector with length equaling to the number of rows inr
and initialized toNA
. In this way rows inr
that has no match int
has a returnedArea
ofNA
.This uses
lapply
to apply the functionget.area.matched
over the listtxt.import.matrix
and append the returned matchedArea
results toreduced.list.pre.filtering
as column vectors. Similarly, the appropriate column names are also appended and set in the resultres
.
Edit: Alternative implementation using the foreach
package
In hindsight, a better implementation uses the foreach
package for vectorizing the comparison. In this implementation, the foreach
and magrittr
packages are required
require("magrittr") ## for %>%
require("foreach")
Then the code in match.interval
for vectorizing the comparison
r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r)) # 1.
ind <- which(in.interval(r.rt, t.rt, 0.02) &
in.interval(r.mz, t.mz, 0.0002))
can be replaced by
ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:%
foreach(t.row = 1:nrow(t)) %do%
match.criterion(r.row, t.row, r, t) %>%
as.logical(.) %>% which(.)
where the match.criterion
is defined as
match.criterion <- function(r.row, t.row, r, t) {
return(in.interval(r[r.row,1], t[t.row,2], 0.02) &
in.interval(r[r.row,2], t[t.row,4], 0.0002))
}
This is easier to parse and reflects what is being performed. Note that what is returned by the nested foreach
combined with cbind
is again a logical matrix. Finally, the application of the get.area.matched
function over the list txt.import.matrix
can also be performed using foreach
:
res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do%
get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
cbind(reduced.list.pre.filtering,.)
The complete code using foreach
is as follows:
require("magrittr")
require("foreach")
in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
return (abs(x-center) <= (deviation + tol))
}
match.criterion <- function(r.row, t.row, r, t) {
return(in.interval(r[r.row,1], t[t.row,2], 0.02) &
in.interval(r[r.row,2], t[t.row,4], 0.0002))
}
match.interval <- function(r, t) {
ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:%
foreach(t.row = 1:nrow(t)) %do%
match.criterion(r.row, t.row, r, t) %>%
as.logical(.) %>% which(.)
# which returns 1-D indices (row-major),
# convert these to 2-D indices in (row,col)
r.ind <- floor((ind - 1)/nrow(t)) + 1 ## 2.
# detect duplicates in r.ind and remove them from ind
dup <- duplicated(r.ind)
r.ind <- r.ind[!dup]
t.ind <- ind[!dup] - (r.ind - 1)*nrow(t) ## 3.
return(cbind(r.ind,t.ind))
}
get.area.matched <- function(r, t) {
match.ind <- match.interval(r, t)
area <- rep(NA,nrow(r))
area[match.ind[,1]] <- t[match.ind[,2], 3]
return(area)
}
res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do%
get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
cbind(reduced.list.pre.filtering,.)
colnames(res) <- c(colnames(reduced.list.pre.filtering),
sapply(seq_len(length(txt.import.matrix)),
function(i) {return(paste0("Area.[",i,"]"))}))
Hope this helps.
回答2:
Using non-equi
joins from current development version of data.table, v1.9.7 (See installation instructions), which allows non-equi conditions to be provided to joins:
require(data.table) # v1.9.7
names(ll) = c("Area1", "Area2")
A = rbindlist(lapply(ll, as.data.table), idcol = "id") ## (1)
B = as.data.table(mat)
B[, c("RT.minus", "RT.plus") := .(RT.-0.02, RT.+0.02)]
B[, c("m.z.minus", "m.z.plus") := .(m.z.-0.0002, m.z.+0.0002)] ## (2)
ans = A[B, .(id, X., RT. = i.RT., m.z. = i.m.z., Area.),
on = .(RT. >= RT.minus, RT. <= RT.plus,
m.z. >= m.z.minus, m.z. <= m.z.plus)] ## (3)
dcast(ans, RT. + m.z. ~ id) ## (4)
# or dcast(ans, RT. + m.z. ~ id, fill = 0)
# RT. m.z. Area1 Area2
# 1: 1.01 358.9777 2820.1 7820.1
# 2: 1.07 368.4238 NA 8271.8
# 3: 2.03 922.0092 5856.3 NA
# 4: 2.05 284.3295 6674.0 12674.0
# 5: 3.03 261.1299 27814.6 NA
[1] Name the list of matrices (called ll
here) and convert each of them to a data.table using lapply()
, and bind them row-wise using rbindlist
, and add the names as an extra column (idcol
). Call it A
.
[2] Convert the second matrix (called mat
here) to data.table as well. Add additional columns corresponding to the ranges/intervals you want to search for (since the on=
argument, as we'll see in the next step, can't handle expressions yet). Call it B
.
[3] Perform the conditional join/subset. For each row in B
, find the matching rows in A
corresponding to the condition provided to on=
argument, and extract the columns id, X., R.T. and m.z.
for those matching row indices.
[4] It's better to leave it at [3]. But if you'd like it as shown in your answer, we can reshape it into wide format. fill = 0
would replace NA
s in the result with 0
.
回答3:
This is a quick rough approach that might help, if I get what you're trying to do.
Unlist values from each variable of two matrices
areas <- unlist(lapply(txt.import.matrix, function(x) x$Area.))
rts <- unlist(lapply(txt.import.matrix, function(x) x$RT.))
mzs <- unlist(lapply(txt.import.matrix, function(x) x$m.z.))
Find indices of those values of RT and m.z. that are closest to value in third matrix/df:
rtmins <- lapply(reduced.list.pre.filtering$RT., function(x) which(abs(rts-x)==min(abs(rts-x))))
mzmins <- lapply(reduced.list.pre.filtering$m.z., function(x) which(abs(mzs-x)==min(abs(mzs-x))))
Use purrr
to quickly calculate which indices are in both (i.e. minimum difference for each):
inboth <- purrr::map2(rtmins,mzmins,intersect)
Get corresponding area value:
vals<-lapply(inboth, function(x) areas[x])
Use reshape2
to put into wide format:
vals2 <- reshape2::melt(vals)
vals2$number <- ave(vals2$L1, vals2$L1, FUN = seq_along)
vals.wide <-reshape2::dcast(vals2, L1 ~ number, value.var="value")
cbind(reduced.list.pre.filtering, vals.wide)
# RT. m.z. L1 1 2
#1 1.01 358.9777 1 2820.1 7820.1
#2 1.07 368.4238 2 8271.8 NA
#3 2.05 284.3295 3 6674.0 12674.0
#4 2.03 922.0092 4 5856.3 NA
#5 3.03 261.1299 5 27814.6 NA
This might give you some ideas. Could be easily adapted to check if shared minimum values exceed +/- a value.
来源:https://stackoverflow.com/questions/38426821/match-with-an-interval-and-extract-values-between-two-matrix-r