R: Trying to count the number of currencies in one data frame based on the positions of the other data frame

。_饼干妹妹 提交于 2020-01-23 01:39:27

问题


I have two data frames, X and Y.

X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950))

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850))

I want to obtain a data.frame Z which is a new data.frame with the info of X and the counts of Y between the range of each "X" row. For example, you can count 3 rows of "Y" which are between the range of the first row of "X" in chr1, so I have a 3 in "Z" in that row.

Z <- data.frame(V1 = c("chr1", "chr1", "chr2", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950),
                Count = c(3, 0, 0, 1, 1, 1))

I would appreciate some help, because until today I have only managed to print the number of rows if "X" dataset has only one row, but I don't know how to achieve my goal. I suppose I have to use some conditional statements plus a for loop to iterate over the rows of "X", but I don't know how to do it.

What I have tried:

  1. Tried to calculate the number of rows that match the criteria with only one row in "Y":

    nrow(Y[Y$Start >= X$Start & Y$Stop <= X$Stop, ])

Worked when there is only 1 row in "X", but not when I tried to implement it in a for loop.


回答1:


you can do this using the tidyverse package.

First I would recommend to choose the option stringsAsFactors = FALSE.

X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950), stringsAsFactors = F)

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850), stringsAsFactors = F)



library(tidyverse)
X %>%
  mutate(count = pmap_int(list(V1, Start, Stop), ~filter(Y, V1 == ..1,  Start >= ..2, Stop <=..3) %>% nrow))

    V1 Start Stop count
1 chr1     0  230     3
2 chr1   540  720     0
3 chr1   920 1270     0
4 chr2     0  350     1
5 chr2   582  635     1
6  ch2   715  950     1



回答2:


Since you mention that you are a bioinformatician, I would point you to Bioconductor and the GenomicRanges package , built specifically for this type of problem.

library(GenomicRanges)
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950))

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "ch2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850))


x <- GRanges(X$V1, ranges = IRanges(X$Start, X$Stop))
y <- GRanges(Y$V1, ranges = IRanges(Y$Start, Y$Stop))

countOverlaps(x, y)
z <- GRanges(x, count = countOverlaps(x, y))
as.data.frame(z)
#  seqnames start  end width strand count
#1     chr1     0  230   231      *     3
#2     chr1   540  720   181      *     0
#3     chr1   920 1270   351      *     0
#4     chr2     0  350   351      *     1
#5     chr2   582  635    54      *     1
#6      ch2   715  950   236      *     1


来源:https://stackoverflow.com/questions/58559246/r-trying-to-count-the-number-of-currencies-in-one-data-frame-based-on-the-posit

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!