Read FASTA into a dataframe and extract subsequences of FASTA file

前端 未结 3 1703
慢半拍i
慢半拍i 2020-12-14 20:20

I have a small fasta file of DNA sequences which looks like this:

>NM_000016 700 200 234
ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC

>NM         


        
相关标签:
3条回答
  • 2020-12-14 20:53
    library("Biostrings")
    
    fastaFile <- readDNAStringSet("my.fasta")
    seq_name = names(fastaFile)
    sequence = paste(fastaFile)
    df <- data.frame(seq_name, sequence)
    
    0 讨论(0)
  • You should have a look at the Biostrings package.

    library("Biostrings")
    
    s = readDNAStringSet("nm.fasta")
    subseq(s, start=c(1, 2, 3), end=c(3, 6, 5))
    
    0 讨论(0)
  • 2020-12-14 21:00

    inspired by sgibb's answer above, I answer the first question as follow:

    #read fasta file into R as a dataframe: 1st column as "RefSeqID", 2nd column as "seq"
    
    library("Biostrings")
    fasta2dataframe=function(fastaFile){
    s = readDNAStringSet(fastaFile)
    RefSeqID = names(s)
    RefSeqID = sub(" .*", "", RefSeqID) 
    #erase all characters after the first space: regular expression matches a space followed by any sequence of characters and sub replaces that with a string having zero  characters 
    
    for (i in 1:length(s)){
    seq[i]=toString(s[i])
    }
    
    RefSeqID_seq=data.frame(RefSeqID,seq)
    return(RefSeqID_seq)
    }
    

    Example:

    mydf = fasta2dataframe(myFastaFile.fasta)
    
    0 讨论(0)
提交回复
热议问题