biopython

Biopython SeqIO to Pandas Dataframe

亡梦爱人 提交于 2019-12-03 03:46:22
I have a FASTA file that can easily be parsed by SeqIO.parse . I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.) from Bio import SeqIO import pandas as pd # parse sequence fasta file identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta", "fasta")] lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta", "fasta")] #converting lists to pandas Series s1 = Series(identifiers, name='ID') s2 = Series(lengths, name='length') #Gathering Series

Traceback in Smith-Wateman algorithm with affine gap penalty

≡放荡痞女 提交于 2019-12-03 03:06:00
I'm trying to implement the Smith-Waterman algorithm for local sequence alignment using the affine gap penalty function. I think I understand how to initiate and compute the matrices required for calculating alignment scores, but am clueless as to how to then traceback to find the alignment. To generate the 3 matrices required I have the following code for j in range(1, len2): for i in range(1, len1): fxOpen = F[i][j-1] + gap xExtend = Ix[i][j-1] + extend Ix[i][j] = max(fxOpen, xExtend) fyOpen = F[i-1][j] + gap yExtend = Iy[i-1][j] + extend Iy[i][j] = max(fyOpen, yExtend) matchScore = (F[i-1]

How to call module written with argparse in iPython notebook

南笙酒味 提交于 2019-12-03 01:20:12
I am trying to pass BioPython sequences to Ilya Stepanov's implementation of Ukkonen's suffix tree algorithm in iPython's notebook environment. I am stumbling on the argparse component. I have never had to deal directly with argparse before. How can I use this without rewriting main()? By the by, this writeup of Ukkonen's algorithm is fantastic . I've had a similar problem before, but using optparse instead of argparse . You don't need to change anything in the original script, just assign a new list to sys.argv like so: if __name__ == "__main__": from Bio import SeqIO path = '/path/to

blast against genomes in biopython

折月煮酒 提交于 2019-12-02 19:46:06
问题 from Bio.Blast import NCBIXML from Bio.Blast import NCBIWWW result_handle = NCBIWWW.qblast( "blastn", "nr", "CACTTATTTAGTTAGCTTGCAACCCTGGATTTTTGTTTACTGGAGAGGCC", entrez_query='"Beutenbergia cavernae DSM 12333" [Organism]') blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records: for alignment in blast_record.alignments: for hsp in alignment.hsps: print(hsp.query[0:75] + '...') print(hsp.match[0:75] + '...') print(hsp.sbjct[0:75] + '...') this does not give me an output,

How do I set the PYTHONPATH on Cygwin?

妖精的绣舞 提交于 2019-12-01 19:09:53
问题 In the Biopython installation instructions, it says that if Biopython doesn't work I'm supposed to do this: export PYTHONPATH = $PYTHONPATH':/directory/where/you/put/Biopython' I tried doing that in Cygwin from the ~ directory using the name of the Biopython directory (or everything of it past the ~ directory), but when I tested it by going into the Python interpreter and typing in From Bio.Seq import Seq It said the module doesn't exist. How do I make it so that I don't have to be in the

How can I merge overlapping strings in python?

☆樱花仙子☆ 提交于 2019-12-01 05:44:57
问题 I have some strings, ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV'] These strings partially overlap each other. If you manually overlapped them you would get: SGALWDVPSPV I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using

Installation of biopython - python 3.3 not found in registry

故事扮演 提交于 2019-12-01 02:22:11
问题 I am trying to install biopython to run with Python 3.3 on a Windows7 computer. I have downloaded the biopython executable biopython-1.61.win32-py3.3-beta.exe. When I attempt to run the executable, however, I get the message "Python version 3.3 is required, which is not found in the registry." Python version 3.3 is present on my computer. I have been running programs through it for a month or two. It was installed from the file python-3.3.0.amd64.msi, and is located in the Program Files (x86)

How to extract chains from a PDB file?

≯℡__Kan透↙ 提交于 2019-12-01 00:49:41
I would like to extract chains from pdb files. I have a file named pdb.txt which contains pdb IDs as shown below. The first four characters represent PDB IDs and last character is the chain IDs. 1B68A 1BZ4B 4FUTA I would like to 1) read the file line by line 2) download the atomic coordinates of each chain from the corresponding PDB files. 3) save the output to a folder. I used the following script to extract chains. But this code prints only A chains from pdb files. for i in 1B68 1BZ4 4FUT do wget -c "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=pdb&compression=NO&structureId="

Phylo BioPython building trees

蓝咒 提交于 2019-11-30 20:52:33
I trying to build a tree with BioPython, Phylo module. What I've done so far is this image: each name has a four digit number followed by - and a number: this number refer to the number of times that sequence is represented. That means 1578 - 22, that node should represent 22sequences. the file with the sequences aligned: file the file with the distance to build a tree: file So now I known how to change each size of the node. Each node has a different size, this is easy doing an array of the different values: fh = open(MEDIA_ROOT + "groupsnp.txt") list_size = {} for line in fh: if '>' in line:

Extract sequences from a FASTA file based on entries in a separate file

拥有回忆 提交于 2019-11-30 16:30:52
I have two files. File 1: a FASTA file with gene sequences, formated like this example: >PITG_00002 | Phytophthora infestans T30-4 conserved hypothetical protein (426 nt) ATGCATCGCTCGGGTTCCGCACGGAAAGCCCAAGGTCTGGGATTACGGGGTGGTGGTCGG TTACACTTGGAATAACCTCGCAAATTCAGAATCTCTACAGGCTACGTTCGCGGATGGAAC >PITG_00003 | Phytophthora infestans T30-4 protein kinase (297 nt) ATGACGGCTGGGGTCGGTACGCCCTACTGGATCGCACCGGAGATTCTTGAAGGCAAACGG TACACTGAGCAAGCGGATATTTACTCGTTCGGAGTGGTTTTATCCGAGCTGGACACGTGC AAGATGCCGTTCTCTGACGTCGTTACGGCAGAGGGAAAGAAACCCAAACCAGTTCAGATC >PITG_00004 | Phytophthora infestans T30-4 protein kinase