How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R?

杀马特。学长 韩版系。学妹 提交于 2019-12-11 12:44:04

问题


How can I extract sequences from a FASTA file for each of the intervals defined in a BED file using R? The reference genome used is "Gallus gallus" that can be obtained by:

source("http://bioconductor.org/biocLite.R")
biocLite("BSgenome.Ggallus.UCSC.galGal4")
    library(BSgenome.Ggallus.UCSC.galGal4)

My data file is a result of gRanges package

library("GenomicRanges")

> olaps
GRanges object with 2141 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]    chr14 [ 1665929,  1673673]      *
     [2]    chr14 [ 2587465,  2595209]      *
     [3]    chr14 [ 8143785,  8151529]      *
     [4]    chr14 [ 9779705,  9787449]      *
     [5]    chr14 [10281129, 10288873]      *
     ...      ...                  ...    ...
  [2137]    chr24   [3280553, 3288297]      *
  [2138]    chr24   [3330889, 3338633]      *
  [2139]    chr24   [3005641, 3015321]      *
  [2140]    chr24   [3319273, 3327017]      *
  [2141]    chr24   [5549545, 5557289]      *
  -------
  seqinfo: 31 sequences from an unspecified genome; no seqlengths

That I can transform in data.table

olaps<- as.data.table(olaps)

Example to be used:

olaps<-"seqnames    start      end width strand
chr1  1665929  1673673  7745      *
chr1  2587465  2595209  7745      *
chr1  8143785  8151529  7745      *
chr2  9779705  9787449  7745      *
chr2 10281129 10288873  7745      *"
olaps<-read.table(text=olaps,header=T)

Expected outcome: something like this (fasta format):

>SEQUENCE_1
ACTGACTAGCATCGCAT...
>SEQUENCE_2
ACGTAGAGAGGGACATA...
>SEQUENCE_3...

I have tried to use this package unsuccessful until now:

source("http://bioconductor.org/biocLite.R")
biocLite("rtracklayer")

回答1:


This, should solve your trick:

First:

seq = BSgenome::getSeq(BSgenome.Ggallus.UCSC.galGal4, olaps)

to add names to the sequences:

names(seq) = paste0("SEQUENCE_", seq_along(seq)) 

To generate a ".fasta" from your sequences:

Biostrings::writeXStringSet(seq, "my.fasta")

More details were provided before:

https://support.bioconductor.org/p/77913/#77986



来源:https://stackoverflow.com/questions/35132118/how-can-i-extract-sequences-from-a-fasta-file-for-each-of-the-intervals-defined

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!