bioinformatics

Regex Protein Digestion

别等时光非礼了梦想. 提交于 2019-12-09 16:16:12
问题 So, I'm digesting a protein sequence with an enzyme (for your curiosity, Asp-N) which cleaves before the proteins coded by B or D in a single-letter coded sequence. My actual analysis uses String#scan for the captures. I'm trying to figure out why the following regular expression doesn't digest it correctly... (\w*?)(?=[BD])|(.*\b) where the antecedent (.*\b) exists to capture the end of the sequence. For: MTMDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN This should give something like: [MTM,

Remove item from list based on the next item in same list

拈花ヽ惹草 提交于 2019-12-09 14:06:13
问题 I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example: ABCDE ABCDEFG ABCDEFGH ABCDEFGHIJKLMNO CEST DBTSFDE DBTSFDEO EOEUDNBNUW EOEUDNBNUWD EAEUDNBNUW FEOEUDNBNUW FG FGH I would like to remove those shorter overlap and just keep the longest one so the desired output would look like this: ABCDEFGHIJKLMNO CEST DBTSFDEO EAEUDNBNUW FEOEUDNBNUWD FGH How can I do it? My code looks

Create an “index” for each element of a group with data.table

主宰稳场 提交于 2019-12-09 09:06:55
问题 My data is grouped by the IDs in V6 and ordered by position (V1:V3): dt V1 V2 V3 V4 V5 V6 1: chr1 3054233 3054733 . + ENSMUSG00000090025 2: chr1 3102016 3102125 . + ENSMUSG00000064842 3: chr1 3205901 3207317 . - ENSMUSG00000051951 4: chr1 3206523 3207317 . - ENSMUSG00000051951 5: chr1 3213439 3215632 . - ENSMUSG00000051951 6: chr1 3213609 3216344 . - ENSMUSG00000051951 7: chr1 3214482 3216968 . - ENSMUSG00000051951 8: chr1 3421702 3421901 . - ENSMUSG00000051951 9: chr1 3466587 3466687 . +

Organizing the output of my shell script into tables within the text file

丶灬走出姿态 提交于 2019-12-09 07:39:34
问题 I am working with a unix shell script that does genome construction then creates a phylogeny. Depending on the genome assembler you use, the final output (the phylogeny) may change. I wish to compare the effects of using various genome assemblers. I have developed some metrics to compare them on, but I need help organizing them so I can run useful analyses. I would like to import my data into excel in columns. This is the script I am using to output data: echo "Enter the size (Mb or Gb) of

Is there a function that can calculate a score for aligned sequences given the alignment parameters?

我与影子孤独终老i 提交于 2019-12-09 06:20:40
问题 I try to score the already-aligned sequences. Let say seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE' seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE' with given parameters substitution matrix : blosum62 gap open penalty : -5 gap extension penalty : -1 I did look through the biopython cookbook but all i can get is substitution matrix blogsum62 but I feel that it must have someone already implemented this kind of library. So can anyone suggest any libraries or shortest code that can

How do I change this to “idiomatic” Perl?

江枫思渺然 提交于 2019-12-09 06:07:31
问题 I am beginning to delve deeper into Perl, but am having trouble writing "Perl-ly" code instead of writing C in Perl. How can I change the following code to use more Perl idioms, and how should I go about learning the idioms? Just an explanation of what it is doing: This routine is part of a module that aligns DNA or amino acid sequences(using Needelman-Wunch if you care about such things). It creates two 2d arrays, one to store a score for each position in the two sequences, and one to keep

Only call function if PyMOL running

跟風遠走 提交于 2019-12-08 19:05:29
I have a script that performs some calculations on a protein. When it's finished, a method imports the pymol module, and uses the pymol.cmd API to display results in a PyMOL session. The process is something akin to the following: def display_results(results, protein_fn): import pymol pymol.cmd.load(protein_fn) pymol.cmd.alter(...) ... protein_fn = "1abc.ent" results = analyze_protein(protein_fn) display_results(results, protein_fn) However, my script doesn't necessarily need to display the results in PyMOL, and I'd like this to only be done if PyMOL is installed and running. It's easy to

Converting NMR ascii file to peak list

人盡茶涼 提交于 2019-12-08 15:17:16
问题 I have some Bruker NMR spectra that i am using to create a program as part of a project. My program needs to work on the actual spectrum. So i converted the 1r files of the Bruker NMR spectra to ASCII. For Carnitine this is what the ascii file looks like(this is not the complete list. The complete list runs into thousands of lines. This is only a snapshot): -0.807434 -23644 -0.807067 -22980 -0.806701 -22967 -0.806334 -24513 -0.805967 -27609 -0.805601 -31145 -0.805234 -33951 -0.804867 -35553

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

眉间皱痕 提交于 2019-12-08 13:09:55
问题 I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it. -The problem- I have a multifasta file: >seq1 ABCDEFG >seq2 HIJKLMN and the expected output is: >REVseq1 GFEDCBA >REVseq2 NMLKJIH The script is here: $NUM_COL = 80; ## set the column width of output file $infile = shift; ## grab input sequence file name from command line $outfile = "test1.txt"; ## name output file, prepend with “REV” open

compare multiple hashes for common keys merge values

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-08 13:04:29
I have a working bit of code here where I am comparing the keys of six hashes together to find the ones that are common amongst all of them. I then combine the values from each hash into one value in a new hash. What I would like to do is make this scaleable. I would like to be able to easily go from comparing 3 hashes to 100 without having to go back into my code and altering it. Any thoughts on how I would achieve this? The rest of the code already works well for different input amounts, but this is the one part that has me stuck. my $comparison = List::Compare->new([keys %{$posHashes[0]}],