fasta

Read FASTA into a dataframe and extract subsequences of FASTA file

社会主义新天地 提交于 2019-11-30 06:56:47
问题 I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start,

Perl6 : What is the best way for dealing with very big files?

时光毁灭记忆、已成空白 提交于 2019-11-29 17:05:18
问题 Last week I decided to give a try to Perl6 and started to reimplement one of my program. I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5. My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data. The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time. my class fasta

Read FASTA into a dataframe and extract subsequences of FASTA file

时间秒杀一切 提交于 2019-11-28 23:28:36
I have a small fasta file of DNA sequences which looks like this: >NM_000016 700 200 234 ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC >NM_000775 700 124 236 CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG >NM_003820 700 111 222 ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA Questions: 1) How can I read this fasta file into R as a dataframe where each row is a sequence record, the 1st column is the refseqID and the 2nd column is the sequence. 2) How to extract subsequence at (start, end) location? NM_000016 1 3 #"ACA" NM_000775 2 6 #"TAACC" NM_003820 3 5 #"TTC" You should have a look

Remove line breaks in a FASTA file

混江龙づ霸主 提交于 2019-11-28 07:03:57
I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file: >accession1 ATGGCCCATG GGATCCTAGC >accession2 GATATCCATG AAACGGCTTA I'd like to convert it into this: >accession1 ATGGCCCATGGGATCCTAGC >accession2 GATATCCATGAAACGGCTTA I found a potential solution on this site , which looks like this: cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta However, this places an extra line break between each entry, so file looks like this:

How to randomly extract FASTA sequences using Python?

 ̄綄美尐妖づ 提交于 2019-11-28 02:11:16
I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me? A.fasta >chr1:1310706-1310726 GACGGTTTCCGGTTAGTGGAA >chr1:901959-901979 GAGGGCTTTCTGGAGAAGGAG >chr1:983001-983021 GTCCGCTTGCGGGACCTGGGG >chr1:984333-984353 CTGGAATTCCGGGCGCTGGAG >chr1:1154147-1154167 GAGATCGTCCGGGACCTGGGT Expected Output >chr1

parsing a fasta file using a generator ( python )

匆匆过客 提交于 2019-11-26 22:54:02
I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError Here is the generator def readFastaEntry( fp ): name = "" seq = "" for line in fp: if line.startswith( ">" ): tmp = [] tmp.append( name ) tmp.append( seq ) name = line seq = "" yield tmp else: seq = seq.join( line ) and here is the caller stub more will be added after this part works fp = open( sys.argv[1], 'r' ) for seq in readFastaEntry

How to randomly extract FASTA sequences using Python?

こ雲淡風輕ζ 提交于 2019-11-26 22:10:14
问题 I have the following sequences which is in a fasta format with sequence header and its nucleotides. How can I randomly extract the sequences. For example I would like to randomly select 2 sequences out of the total sequences. There are tools provided to do so is to extract according to percentage but not the number of sequences. Can anyone help me? A.fasta >chr1:1310706-1310726 GACGGTTTCCGGTTAGTGGAA >chr1:901959-901979 GAGGGCTTTCTGGAGAAGGAG >chr1:983001-983021 GTCCGCTTGCGGGACCTGGGG >chr1