bioinformatics

Counting genetic mutations in dictionary using python

让人想犯罪 __ 提交于 2019-12-13 00:40:30
问题 I have data in this format: >abc12 ATCGACAG >def34 ACCGACG etc. I have stored each gene into a dictionary with the lines beginning with > as values. So the dictionary is something like {'abc12':'ATCGACAG', etc.} Now I want to be able to compare each gene, so that it counts the number of A's, T's, C's, or G's at each site. The only thing I can come up with is to break the dictionary into lists for each nucleotide site and using zip() with a counter. Is this the best way, and if so, how do I

How to make a variable by extracting specific line?

≡放荡痞女 提交于 2019-12-12 20:14:12
问题 I have data like below with SNP names (rs number or c#_pos#) included in gene names (e.g. ABCB9). In SNPs named as c#_pos000000, range of # is 1 to 22 (chromosome number) ABCB9 rs11057374 rs7138100 c22_pos41422393 rs12309481 END ABCC10 rs1214748 END HDAC9 rs928578 rs10883039 END HCN2 rs12428035 rs9561933 c2_pos102345 rs3848077 rs3099362 END by using this data, I want to make the output like below rs11057374 ABCB9 rs7138100 ABCB9 c22_pos41422393 ABCB9 rs12309481 ABCB9 rs1214748 ABCC10 rs928578

use edit distance on arrays in perl

旧街凉风 提交于 2019-12-12 16:28:56
问题 I am attempting to compare the edit distance between two arrays. I have tried using Text:Levenshtein. #!/usr/bin/perl -w use strict; use Text::Levenshtein qw(distance); my @words = qw(four foo bar); my @list = qw(foo fear); my @distances = distance(@list, @words); print "@distances\n"; #results: 3 2 0 3 I however want the results to appear as follows: 2 0 3 2 3 2 Taking the first element of @list through the array of @words and doing the same through out the rest of the elements of @list. I

Oscillating processing speed in a python script using pysam.TabixFile to annotate reads

寵の児 提交于 2019-12-12 11:35:09
问题 The initial question I'm writing a bioinformatics script in python (3.5) that parses a large (sorted and indexed) bam file representing sequencing reads aligned on a genome, associates genomic information ("annotations") to these reads, and counts the types of annotations encountered. I'm measuring the speed at which my script processes aligned reads (over batches of 1000 reads), and I obtain the following speed variations: What could explain this pattern? My intuition would make me bet on

Generating Synthetic DNA Sequence with Substitution Rate

纵饮孤独 提交于 2019-12-12 08:07:40
问题 Given these inputs: my $init_seq = "AAAAAAAAAA" #length 10 bp my $sub_rate = 0.003; my $nof_tags = 1000; my @dna = qw( A C G T ); I want to generate: One thousand length-10 tags Substitution rate for each position in a tag is 0.003 Yielding output like: AAAAAAAAAA AATAACAAAA ..... AAGGAAAAGA # 1000th tags Is there a compact way to do it in Perl? I am stuck with the logic of this script as core: #!/usr/bin/perl my $init_seq = "AAAAAAAAAA" #length 10 bp my $sub_rate = 0.003; my $nof_tags = 1000

Parsing a text file with multiple columns

旧街凉风 提交于 2019-12-12 06:28:54
问题 I am attempting to extract each of the 11 columns in the following file: http://bioinfo.mc.vanderbilt.edu/TSGene/Human_716_TSGs.txt ...into a list of scalars for a beginning level college bioinformatics project. My effort, please see below, is effective but not perfect since the amount of whitespace varies between columns (please see the top of the file for details). use strict; use warnings; open FH, '<', 'tsg.txt' or die $!; my $data = do {local $/; <FH>}; close FH or die $!; my($id, $sym,

XTC file reading error

倾然丶 夕夏残阳落幕 提交于 2019-12-12 04:20:04
问题 #include "xdrfile/xdrfile_xtc.h" #include "xdrfile/xdrfile.h" #include<stdio.h> int main() { int nat; int step; float time; float prec; int status; matrix box; rvec k[3]; XDRFILE* xfp=xdrfile_open("test2.xtc","r"); status=read_xtc(xfp,nat,&step,&time,box,k,&prec); xdrfile_close(xfp); return 0; } I tried to run the code using the xtc library to read a trajectory frame of GROMACS... I am getting an error, Segmentation error Can you please help??? 回答1: Looking at this code Second parameter nat

Nucleotides separator in the pairwise sequence alignment bio python

情到浓时终转凉″ 提交于 2019-12-12 03:24:16
问题 I have RNA sequences that contain different modified nucleotides and residues. Some of them for example N79, 8XU, SDG, I . I want to pairwise align them using biopython's pairwise2.align.localms . Is it possible to make input not as a string but as list for example in order to accurately account for these modified bases? What is the correct technique? 回答1: Biopython's pairwise2 module works on strings of letters, which can be anything - for example: >>> from Bio import pairwise2 >>> from Bio

Search for motifs with degenerate positions

元气小坏坏 提交于 2019-12-12 02:53:36
问题 I have a 15-mer nucleotide motif that uses degenerate nucleotide sequences. Example: ATNTTRTCNGGHGCN. I would search a set of sequences for the occurrence of this motif. However, my other sequences are exact sequences, i.e. they have no ambiguity. I have tried doing a for loop within the sequences to search for this, but I have not been able to do non-exact searches. The code I use is modeled after the code on the Biopython cookbook. for pos,seq in m.instances.search(test_seq): print pos, seq

Selecting highest count of element except when…

半城伤御伤魂 提交于 2019-12-12 02:47:10
问题 So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output. So with this input: @extract = ------------------------------------------------------------------MGG--------------------------------------------------------------