bioinformatics

Longest repeated (k times) substring

白昼怎懂夜的黑 提交于 2019-12-05 14:27:55
问题 I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered. This is for the Rosalind project problem LREP. I'm trying to find the longest k-peated substring in a string and I've been provided the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with >=k descendants, and finally find the deepest of those nodes. Theory-wise I'm set. I've

How do I decide which way to backtrack in the Smith–Waterman algorithm?

倾然丶 夕夏残阳落幕 提交于 2019-12-05 13:39:41
I am trying to implement local sequence alignment in Python using the Smith–Waterman algorithm . Here's what I have so far. It gets as far as building the similarity matrix : import sys, string from numpy import * f1=open(sys.argv[1], 'r') seq1=f1.readline() f1.close() seq1=string.strip(seq1) f2=open(sys.argv[2], 'r') seq2=f2.readline() f2.close() seq2=string.strip(seq2) a,b =len(seq1),len(seq2) penalty=-1; point=2; #generation of matrix for local alignment p=zeros((a+1,b+1)) # table calculation and matrix generation for i in range(1,a+1): for j in range(1,b+1): vertical_score =p[i-1][j]

AWK: extract lines if column in file 1 falls within a range declared in two columns in other file

孤街醉人 提交于 2019-12-05 08:00:51
问题 Currently I'm struggling with an AWK problem that I haven't been able to solve yet. I have one huge file (30GB) with genomic data that holds a list with positions (declared in col 1 and 2) and a second list that holds a number of ranges (declared in col 3, 4 and 5). I want to extract all lines in the first file where the position falls within the range declared in the seconds file. As the position is only unique within a certain chromosome (chr) first it has to be tested if the chr's are

'StringCut' to the left or right of a defined position using Mathematica

两盒软妹~` 提交于 2019-12-05 02:42:08
On reading this question , I thought the following problem would be simple using StringSplit Given the following string, I want to 'cut' it to the left of every "D" such that: I get a List of fragments (with sequence unchanged) StringJoin @fragments gives back the original string (but is does not matter if I have to reorder the fragments to obtain this). That is, sequence within each fragment is important, and I do not want to lose any characters. (The example I am interested in is a protein sequence (string) where each character represents an amino acid in one-letter code. I want to obtain

A more complex version of “How can I tell if a string repeats itself in Python?”

若如初见. 提交于 2019-12-05 01:38:23
I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string. For example, find all the repetitive motifs in string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT' Here the repetitive motifs: 'AAAC ACGTACGT AATTCC GTGTGT CCCC TATACGTATACG TTT' So, the output should be something like this: output = {'ACGT': {'repeat': 2, 'region': (5,13)}, 'GT': {'repeat': 3, 'region': (19,24)}, 'TATACG': {'repeat': 2, 'region': (29,40)}} This example comes from a typical biological phenomena termed microsatellite which is present into the DNA. UPDATE 1:

SeqIO.parse on a fasta.gz

…衆ロ難τιáo~ 提交于 2019-12-05 01:28:09
New to coding. New to Pytho/biopython; this is my first question online, ever. How do I open a compressed fasta.gz file to extract info and perform calcuations in my function. Here is a simplified example of what I'm trying to do (I've tried different ways), and what the error is. The gzip command I'm using doesn't seem to work.? with gzip.open("practicezip.fasta.gz", "r") as handle: for record in SeqIO.parse(handle, "fasta"): print(record.id) Traceback (most recent call last): File "<ipython-input-192-a94ad3309a16>", line 2, in <module> for record in SeqIO.parse(handle, "fasta"): File "C:

Extract sample data from VCF files

时间秒杀一切 提交于 2019-12-05 00:41:20
I have a large Variant Call format (VCF) file (> 4GB) which has data for several samples. I have browsed Google, Stackoverflow as well as tried the VariantAnnotation package in R to somehow extract data only for a particular sample, but have not found any information on how to do that in R. Did anybody try anything like that, or maybe knows of another package that would enable this? In VariantAnnotation use a ScanVcfParam to specify the data that you'd like to extract. Using the sample VCF file included with the package library(VariantAnnotation) vcfFile = system.file(package=

Python find longest ORF in DNA sequence

僤鯓⒐⒋嵵緔 提交于 2019-12-04 21:22:36
Can someone show me a straightforward solution for how to calculate the longest open reading frame (ORF) in a DNA sequence? ATG is the start codon (i.e., the beginning of an ORF) and TAG , TGA , and TAA are stop codons (i.e., the end of an ORF). Here's some code that produces errors (and uses an external module called BioPython): import sys from Bio import SeqIO currentCid = '' buffer = [] for record in SeqIO.parse(open(sys.argv[1]),"fasta"): cid = str(record.description).split('.')[0][1:] if currentCid == '': currentCid = cid else: if cid != currentCid: buffer.sort(key = lambda x : len(x[1]))

How do I generate all possible Newick Tree permutations for a set of species given an outgroup?

对着背影说爱祢 提交于 2019-12-04 20:40:54
问题 How do I generate all possible Newick Tree permutations for a set of species given an outgroup? For those who don't know what Newick tree format is, a good description is available at: https://en.wikipedia.org/wiki/Newick_format I want to create all possible Newick Tree permutations for a set of species given an outgroup. The number of leaf nodes I expect to process are most likely 4, 5, or 6 leaf nodes. Both "Soft" and "hard" polytomies are allowed. https://en.wikipedia.org/wiki/Polytomy

How can I get taxonomic rank names from taxid?

喜夏-厌秋 提交于 2019-12-04 15:20:58
This question is related to: How to get taxonomic specific ids for kingdom, phylum, class, order, family, genus and species from taxid? The solution given there works but I would like to have the names for each taxonomic ids for defined ranks. I have found this on ete3 which can do the job: names = ncbi.get_taxid_translator(lineage) print [names[taxid] for taxid in lineage] but not being python programmer, I am failing to incorporate this into the code given in the link above. Here is what I have tried: import csv from ete3 import NCBITaxa ncbi = NCBITaxa() def get_desired_ranks(taxid, desired