问题
I've imported a UCSC alignability track into R using import.bw() (from the rtracklayer package) but am having trouble accessing the values I need.
For example: I want to provide a chromosome and a base and return the value at that position.
My object is called al100:
> al100
RangedData with 21591667 rows and 1 value column across 25 spaces
space ranges | score
<factor> <IRanges> | <numeric>
1 chr1 [10001, 10014] | 0.002777778
2 chr1 [10015, 10015] | 0.333333343
3 chr1 [10016, 10026] | 0.500000000
4 chr1 [10027, 10031] | 1.000000000
I want a function where I specify a chrosome and position and get back the score. This is trivial if I want one or two values, but a loop isn't going to work when I've got 7 million to look up; at 4/5 seconds per query, that's about 10 months, which is not an option.
For example, chr1, position 10011 would return the value 0.002777778 (where x is a separate object containing a list of chromosomes and positions)
The only method I've found so far is to ask if my position is equal or greater than the start and/or equal or equal to or less than the end of a range. Not very good.
score(al100["chr1"])[ which( start(al100["chr1"]<=x$POS[1])) & end(al100["chr1"]<=x$POS[1])) ]
回答1:
For a reproducible example
library(rtracklayer)
example(import.bw)
gffRD
gives
> head(gffRD, 3)
RangedData with 3 rows and 7 value columns across 1 space
space ranges | type source
<factor> <IRanges> | <factor> <factor>
1 Escherichia_coli_K-12_complete_genome [ 337, 2799] | CDS glimmer/tico
2 Escherichia_coli_K-12_complete_genome [2801, 3733] | CDS glimmer/tico
3 Escherichia_coli_K-12_complete_genome [3734, 5020] | CDS glimmer/tico
phase strand note shift score
<factor> <factor> <character> <numeric> <numeric>
1 NA + NA NA 5.347931
2 NA + NA NA 11.448764
3 NA + NA NA 6.230648
Define regions of interest
roi <- GRanges("Escherichia_coli_K-12_complete_genome",
IRanges(c(337, 3734), width=1))
then use findOverlaps
to map between gffRD
and roi
olaps <- findOverlaps(gffRD,roi)
df <- DataFrame(seqnames=seqnames(roi)[subjectHits(olaps)],
start=start(roi)[subjectHits(olaps)],
Score=score(gffRD)[queryHits(olaps)])
olaps
contains information about which queries match which subjects
> olaps
Hits of length 2
queryLength: 14
subjectLength: 2
queryHits subjectHits
<integer> <integer>
1 1 1
2 3 2
The data frame is
> df
DataFrame with 2 rows and 3 columns
seqnames start Score
<Rle> <integer> <numeric>
1 Escherichia_coli_K-12_complete_genome 337 5.347931
2 Escherichia_coli_K-12_complete_genome 3734 6.230648
来源:https://stackoverflow.com/questions/9908716/extracting-values-from-iranges-objects-in-r-bioconductor