问题
I have two data frames, X and Y.
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950))
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850))
I want to obtain a data.frame Z which is a new data.frame with the info of X and the counts of Y between the range of each "X" row. For example, you can count 3 rows of "Y" which are between the range of the first row of "X" in chr1, so I have a 3 in "Z" in that row.
Z <- data.frame(V1 = c("chr1", "chr1", "chr2", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950),
Count = c(3, 0, 0, 1, 1, 1))
I would appreciate some help, because until today I have only managed to print the number of rows if "X" dataset has only one row, but I don't know how to achieve my goal. I suppose I have to use some conditional statements plus a for loop to iterate over the rows of "X", but I don't know how to do it.
What I have tried:
Tried to calculate the number of rows that match the criteria with only one row in "Y":
nrow(Y[Y$Start >= X$Start & Y$Stop <= X$Stop, ])
Worked when there is only 1 row in "X", but not when I tried to implement it in a for loop.
回答1:
you can do this using the tidyverse package.
First I would recommend to choose the option stringsAsFactors = FALSE.
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950), stringsAsFactors = F)
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850), stringsAsFactors = F)
library(tidyverse)
X %>%
mutate(count = pmap_int(list(V1, Start, Stop), ~filter(Y, V1 == ..1, Start >= ..2, Stop <=..3) %>% nrow))
V1 Start Stop count
1 chr1 0 230 3
2 chr1 540 720 0
3 chr1 920 1270 0
4 chr2 0 350 1
5 chr2 582 635 1
6 ch2 715 950 1
回答2:
Since you mention that you are a bioinformatician, I would point you to Bioconductor and the GenomicRanges package , built specifically for this type of problem.
library(GenomicRanges)
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950))
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850))
x <- GRanges(X$V1, ranges = IRanges(X$Start, X$Stop))
y <- GRanges(Y$V1, ranges = IRanges(Y$Start, Y$Stop))
countOverlaps(x, y)
z <- GRanges(x, count = countOverlaps(x, y))
as.data.frame(z)
# seqnames start end width strand count
#1 chr1 0 230 231 * 3
#2 chr1 540 720 181 * 0
#3 chr1 920 1270 351 * 0
#4 chr2 0 350 351 * 1
#5 chr2 582 635 54 * 1
#6 ch2 715 950 236 * 1
来源:https://stackoverflow.com/questions/58559246/r-trying-to-count-the-number-of-currencies-in-one-data-frame-based-on-the-posit