Using a .fasta file to compute relative content of sequences

问题

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.

Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.

Apparently it's something like this:

>label
sequence
>label
sequence
>label
sequence

My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.

Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.

I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!

回答1:

I advice you check links below:

fasta perl on stackoverflow

BioPerl HowTo

A crash ourse in perl and dna

回答2:

I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions

I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.

Generally:

Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.

A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..

回答3:

Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>

#NR>1 means only look at lines above 1 because you said the sequence starts on line 2 
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
    for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases        
       total++
    } 
    {
    for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
        if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:            
            c++; else
        if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
            g++
    }
    END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs       
        print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
    }

来源：https://stackoverflow.com/questions/9716991/using-a-fasta-file-to-compute-relative-content-of-sequences

标签

perl

Sequence

frequency

fasta