fasta | 易学教程

Biopython SeqIO to Pandas Dataframe

阅读更多关于 Biopython SeqIO to Pandas Dataframe

I have a FASTA file that can easily be parsed by SeqIO.parse . I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import SeqIO import pandas as pd # parse sequence fasta file identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta", "fasta")] lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta", "fasta")] #converting lists to pandas Series s1 = Series(identifiers, name='ID') s2 = Series(lengths, name='length') #Gathering Series

chaos game for DNA sequences

阅读更多关于 chaos game for DNA sequences

I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars]; Graphics[{PointSize[Tiny], Point[pts]}] the fasta sequence that I have is just a sequence of letters

Sequence length of FASTA file

阅读更多关于 Sequence length of FASTA file

I have the following FASTA file: >header1 CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC >header2 GGT >header3 TTATGAT My desired output: >header1 117 >header2 3 >header3 7 # 3 sequences, total length 127. This is my code: awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa The output I get with this code is: >header1 60 57 >header2 3 >header3 7 I need a small modification in order to deal with multiple sequence lines. I also need a way to have the total sequences and total length. Any suggestion will

Using a .fasta file to compute relative content of sequences

阅读更多关于 Using a .fasta file to compute relative content of sequences

问题 So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak. Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format. Apparently it's something like this: >label sequence >label sequence >label sequence My goal is to write a script to open and

Reading in file block by block using specified delimiter in python

阅读更多关于 Reading in file block by block using specified delimiter in python

I have an input_file.fa file like this ( FASTA format): > header1 description data data data >header2 description more data data data I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1: > header1 description data data data Of course I could just read in the file like this and split: with open("1.fa") as f: for block in f.read().split(">"): pass But I want to avoid the reading the whole file into memory , because the files are often large. I can read in the file line by line of course: with open("input_file.fa") as f:

extract sequences from multifasta file by ID in file using awk

阅读更多关于 extract sequences from multifasta file by ID in file using awk

问题 I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs. FASTA file seq.fasta: >7P58X:01332:11636 TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT CAAGTCCCTGTTCGGGCGCC >7P58X:01334:11605 TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC CCTGTTCGGGCGCCACTGCTAG >7P58X:01334:11613 ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC >7P58X:01334:11635 TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC GAGCG >7P58X:01336:11621 ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT

Reading .fasta sequences to extract nucleotide data, and then writing to a TabDelimited file

阅读更多关于 Reading .fasta sequences to extract nucleotide data, and then writing to a TabDelimited file

问题 Before I continue, I thought I'd refer readers to my previous problems with Perl, being a beginner to all of this. These were my posts over the past few days, in chronological order: How do I average column values from a tab-separated data... (Solved) Why do I see no computed results in my output file? (Solved) Using a .fasta file to compute relative content of sequences Now as I've stated above, thanks to help from a few of you, I've managed to figure out the first two queries and I've

Printing a sequence from a fasta file

阅读更多关于 Printing a sequence from a fasta file

问题 I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example: >sequence1 ACTGACTGACTGACTG >sequence2 ACTGACTGACTGACTG ACTGACTGACTGACTG >sequence3 ACTGACTGACTGACTG The way I'm currently getting the sequence I need is

Converting FASTQ to FASTA with SED/AWK

阅读更多关于 Converting FASTQ to FASTA with SED/AWK

问题 I have a data in that always comes in block of four in the following format (called FASTQ): @SRR018006.2016 GA2:6:1:20:650 length=36 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN +SRR018006.2016 GA2:6:1:20:650 length=36 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+! @SRR018006.19405469 GA2:6:100:1793:611 length=36 ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR018006.19405469 GA2:6:100:1793:611 length=36 7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/ Is there a simple sed/awk/bash way to convert them into this format (called

Perl6 : What is the best way for dealing with very big files?

阅读更多关于 Perl6 : What is the best way for dealing with very big files?

Last week I decided to give a try to Perl6 and started to reimplement one of my program. I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5. My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data. The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time. my class fasta { has Str $.file is required; has %!seq; submethod TWEAK() { my $id; my $s; for $!file.IO.lines ->