fasta

parsing a fasta file using a generator ( python )

大城市里の小女人 提交于 2019-12-27 17:44:43
问题 I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError Here is the generator def readFastaEntry( fp ): name = "" seq = "" for line in fp: if line.startswith( ">" ): tmp = [] tmp.append( name ) tmp.append( seq ) name = line seq = "" yield tmp else: seq = seq.join( line ) and here is the

Use AWK to search through fasta file, given a second file containing sequence names

浪子不回头ぞ 提交于 2019-12-25 07:27:17
问题 I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below). seq.fasta >Clone_18 GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT >Clone_23 GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA >Clone_27-1 GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC >Clone_27-2

How to translate a FASTA sequence from dict/ how to make function output a string?

瘦欲@ 提交于 2019-12-25 06:44:52
问题 Firstly I can't use BioPython :( I need to translate a bunch of FASTA sequences from a FASTA file and translate them to protein sequence. FASTA file is like this; >some info ACCGGGCTAAA >other info ACCGCCAATTT So I can create a function that outputs only the DNA sequence but when I try to translate it I get the following error; "TypeError: object of type '_io.TextIOWrapper' has no len()" I have no ide how to resolve this. Any help is immensely appreciated!!!!! Also I am taking my first Python

How to define the range 'r(n,)' using a variable in match() function with awk

两盒软妹~` 提交于 2019-12-25 01:39:06
问题 I am processing text files with thousands of records per file. Each record is made up of two lines: a header that starts with ">" and followed by a line with a long string of characters "-AGTCNR". Here is how a simple file looks like: >ACML500-12|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_-2 ----TAAGATTTTGACTTCTTCCCCCATCATCAAGAAGAATTGT------- >ACRJP458-10|Lep|gs-NA|sp-NA|subsp-NA|co-Buru|site-NA|lat_N -----------TCCCTTTAATACTAGGAGCCCCTGACATAGCCTTTCCTAAATAAT----- >ASILO303-17|Dip|gs-Par|sp

Why is python writing out in Chinese characters?

不问归期 提交于 2019-12-24 10:57:39
问题 This is my first question on Stack Overflow and so I want to apologise first if my question is not formatted correctly. I am not particularly experienced with coding, but am trying to solve a specific problem with my work. I am trying to replace the headers of a large fasta file (used for aligning DNA sequences). I have a txt file containing the fasta alignment (alignment.txt), which has contents like this: >418035201_b1_168_m12_gag__Assembly_8 ATGGGTGCGAGAGCGTCAGTATTAAGTGGGGGAAA......

sort fasta by sequence size

不羁的心 提交于 2019-12-23 20:47:19
问题 I currently want to sort a hudge fasta file (+10**8 lines and sequences) by sequence size. fasta is a clear defined format in biology use to store sequence (genetic or proteic): >id1 sequence 1 # could be on several line >id2 sequence 2 ... I have run a tools that give me in tsv format: the Identifiant, the length, and the position in bytes of the identifiant. for now what I am doing is to sort this file by the length column then I parse this file and use seek to retrieve the corresponding

How to extract short sequence using window with specific step size?

穿精又带淫゛_ 提交于 2019-12-23 02:29:52
问题 The code below extract short sequence in every sequence with the window size 4. How to shift the window by step size 2 and extract 4 base pairs? Example code from Bio import SeqIO with open("testA_out.fasta","w") as f: for seq_record in SeqIO.parse("testA.fasta", "fasta"): i = 0 while ((i+4) < len(seq_record.seq)) : f.write(">" + str(seq_record.id) + "\n") f.write(str(seq_record.seq[i:i+4]) + "\n") i += 2 Example Input of testA.fasta >human1 ACCCGATTT Example Output of testA_out >human1 ACCC

how use matchpattern() to find certain aminoacid in a file with many sequence(.fasta) in R

喜你入骨 提交于 2019-12-21 06:27:14
问题 I have a file (mydata.txt) that contains many exon sequences with fasta format. I want to find start ('atg') and stop ('taa','tga','tag') codons for each DNA sequence (considering the frame). I tried using matchPattern ( a function from the Biostrings R package) to find theses amino acids: As an example mydata.txt could be: >a atgaatgctaaccccaccgagtaa >b atgctaaccactgtcatcaatgcctaa >c atggcatgatgccgagaggccagaataggctaa >d atggtgatagctaacgtatgctag >e atgccatgcgaggagccggctgccattgactag file=read

Sequence length of FASTA file

邮差的信 提交于 2019-12-20 10:51:31
问题 I have the following FASTA file: >header1 CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC >header2 GGT >header3 TTATGAT My desired output: >header1 117 >header2 3 >header3 7 # 3 sequences, total length 127. This is my code: awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa The output I get with this code is: >header1 60 57 >header2 3 >header3 7 I need a small modification in order to deal with