fasta

Biopython SeqIO to Pandas Dataframe

亡梦爱人 提交于 2019-12-03 03:46:22
I have a FASTA file that can easily be parsed by SeqIO.parse . I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import SeqIO import pandas as pd # parse sequence fasta file identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta", "fasta")] lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta", "fasta")] #converting lists to pandas Series s1 = Series(identifiers, name='ID') s2 = Series(lengths, name='length') #Gathering Series

chaos game for DNA sequences

只愿长相守 提交于 2019-12-03 03:25:01
I have tried the mathematica code for making the chaos game for DNA sequences posted in this address: http://facstaff.unca.edu/mcmcclur/blog/GeneCGR.html which is like this: genome = Import["c:\data\sequence.fasta", "Sequence"]; genome = StringReplace[ToString[genome], {"{" -> "", "}" -> ""}]; chars = StringCases[genome, "G" | "C" | "T" | "A"]; f[x_, "A"] := x/2; f[x_, "T"] := x/2 + {1/2, 0}; f[x_, "G"] := x/2 + {1/2, 1/2}; f[x_, "C"] := x/2 + {0, 1/2}; pts = FoldList[f, {0.5, 0.5}, chars]; Graphics[{PointSize[Tiny], Point[pts]}] the fasta sequence that I have is just a sequence of letters

Sequence length of FASTA file

大憨熊 提交于 2019-12-03 00:48:41
I have the following FASTA file: >header1 CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC >header2 GGT >header3 TTATGAT My desired output: >header1 117 >header2 3 >header3 7 # 3 sequences, total length 127. This is my code: awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa The output I get with this code is: >header1 60 57 >header2 3 >header3 7 I need a small modification in order to deal with multiple sequence lines. I also need a way to have the total sequences and total length. Any suggestion will

Using a .fasta file to compute relative content of sequences

不打扰是莪最后的温柔 提交于 2019-12-02 08:11:02
问题 So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak. Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format. Apparently it's something like this: >label sequence >label sequence >label sequence My goal is to write a script to open and

Reading in file block by block using specified delimiter in python

放肆的年华 提交于 2019-12-01 18:36:50
I have an input_file.fa file like this ( FASTA format): > header1 description data data data >header2 description more data data data I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1: > header1 description data data data Of course I could just read in the file like this and split: with open("1.fa") as f: for block in f.read().split(">"): pass But I want to avoid the reading the whole file into memory , because the files are often large. I can read in the file line by line of course: with open("input_file.fa") as f:

extract sequences from multifasta file by ID in file using awk

流过昼夜 提交于 2019-12-01 15:50:18
问题 I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs. FASTA file seq.fasta: >7P58X:01332:11636 TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT CAAGTCCCTGTTCGGGCGCC >7P58X:01334:11605 TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC CCTGTTCGGGCGCCACTGCTAG >7P58X:01334:11613 ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC >7P58X:01334:11635 TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC GAGCG >7P58X:01336:11621 ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT

Reading .fasta sequences to extract nucleotide data, and then writing to a TabDelimited file

好久不见. 提交于 2019-12-01 09:06:09
问题 Before I continue, I thought I'd refer readers to my previous problems with Perl, being a beginner to all of this. These were my posts over the past few days, in chronological order: How do I average column values from a tab-separated data... (Solved) Why do I see no computed results in my output file? (Solved) Using a .fasta file to compute relative content of sequences Now as I've stated above, thanks to help from a few of you, I've managed to figure out the first two queries and I've

Printing a sequence from a fasta file

丶灬走出姿态 提交于 2019-12-01 07:24:49
问题 I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example: >sequence1 ACTGACTGACTGACTG >sequence2 ACTGACTGACTGACTG ACTGACTGACTGACTG >sequence3 ACTGACTGACTGACTG The way I'm currently getting the sequence I need is

Converting FASTQ to FASTA with SED/AWK

百般思念 提交于 2019-11-30 12:50:13
问题 I have a data in that always comes in block of four in the following format (called FASTQ): @SRR018006.2016 GA2:6:1:20:650 length=36 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN +SRR018006.2016 GA2:6:1:20:650 length=36 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+! @SRR018006.19405469 GA2:6:100:1793:611 length=36 ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR018006.19405469 GA2:6:100:1793:611 length=36 7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/ Is there a simple sed/awk/bash way to convert them into this format (called

Perl6 : What is the best way for dealing with very big files?

随声附和 提交于 2019-11-30 11:26:21
Last week I decided to give a try to Perl6 and started to reimplement one of my program. I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5. My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data. The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time. my class fasta { has Str $.file is required; has %!seq; submethod TWEAK() { my $id; my $s; for $!file.IO.lines ->