bioinformatics

Save complete web page (incl css, images) using python/selenium

空扰寡人 提交于 2020-04-29 07:20:20
问题 I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want: from selenium import webdriver URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome' SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA

How to select lines of file based on multiple conditions of another file in R?

雨燕双飞 提交于 2020-04-17 21:45:40
问题 I have 2 genetic datasets. I filter file1 based on a column in file2. However, I also need to account for a second column in file2 and I'm not sure how to do this. The condition for file 1 row extraction is that only rows that have a chromosome position either more than 5000 larger or more than 5000 smaller than any chromosome positions for variants on the same chromosome in file 2 are selected. For example my data looks like: File 1: Variant Chromsome Chromosome Position Variant1 2 14000

How to select lines of file based on multiple conditions of another file in R?

若如初见. 提交于 2020-04-17 21:45:22
问题 I have 2 genetic datasets. I filter file1 based on a column in file2. However, I also need to account for a second column in file2 and I'm not sure how to do this. The condition for file 1 row extraction is that only rows that have a chromosome position either more than 5000 larger or more than 5000 smaller than any chromosome positions for variants on the same chromosome in file 2 are selected. For example my data looks like: File 1: Variant Chromsome Chromosome Position Variant1 2 14000

Snakemake - dynamically derive the targets from input files

纵饮孤独 提交于 2020-03-23 04:01:57
问题 I have a large number of input files organized like this: data/ ├── set1/ │ ├── file1_R1.fq.gz │ ├── file1_R2.fq.gz │ ├── file2_R1.fq.gz │ ├── file2_R2.fq.gz | : │ └── fileX_R2.fq.gz ├── another_set/ │ ├── asdf1_R1.fq.gz │ ├── asdf1_R2.fq.gz │ ├── asdf2_R1.fq.gz │ ├── asdf2_R2.fq.gz | : │ └── asdfX_R2.fq.gz : └── many_more_sets/ ├── zxcv1_R1.fq.gz ├── zxcv1_R2.fq.gz : └── zxcvX_R2.fq.gz If you are familiar with bioinformatics - these are of course fastq files from paired end sequencing runs.

Consolidate similar patterns into single consensus pattern

拥有回忆 提交于 2020-01-24 14:15:07
问题 In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here. I have the following items: a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long); some long protein sequences, aka my reference. I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.) import sys import re from itertools import chain, izip #

How to remove duplicates in list of objects without __hash__

蹲街弑〆低调 提交于 2020-01-24 04:10:26
问题 I have a list of custom objects from which I want to remove the duplicates. Normally, you would do this by defining both __eq__ and __hash__ for your objects and then taking the set of the list of objects. I have defined __eq__ , but I can't figure out a good way to implement __hash__ such that it returns the same value for objects that are equal. More specifically, I have a class that is derived from the Tree class from the ete3 toolkit. I have defined two objects to be equal if their

How can I count the frequency of letters

柔情痞子 提交于 2020-01-22 16:54:21
问题 I have a data like this >sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1 RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR >sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2

How to find a query that you can find with FASTA but not with BLAST and vice versa?

岁酱吖の 提交于 2020-01-17 03:53:26
问题 I need to find a sequence or sequences that should give results (hits) in Fasta but not in Blast, or vice versa. And I am kinda lost. What should I look for while searching this sequence(s)? 回答1: When you say find a sequence by BLAST or FASTA I assume you mean find a hit in the database? I think FASTA might be better at finding alignments between dissimilar sequences than BLAST but BLAST is better at aligning similar sequences. 来源: https://stackoverflow.com/questions/26491285/how-to-find-a

randomForest Error: NA not permitted in predictors (but no NAs in data)

你离开我真会死。 提交于 2020-01-16 19:16:08
问题 So I am attempting to run the 'genie3' algorithm (ref: http://homepages.inf.ed.ac.uk/vhuynht/software.html) in R which uses the 'randomForest' method. I am running into the following Error: > weight.matrix<-get.weight.matrix(tmpLog2FC, input.idx=1:4551) Starting RF computations with 1000 trees/target gene, and 67 candidate input genes/tree node Computing gene 1/11805 Show Traceback Rerun with Debug Error in randomForest.default(x, y, mtry = mtry, ntree = nb.trees, importance = TRUE, : NA not

Sequence Alignment Algorithm with a group of characters instead of one character

徘徊边缘 提交于 2020-01-14 13:28:09
问题 Summary: I'm beginning with some details about alignment algorithms, and at the end, I ask my question. If you know about alignment algorithm pass the beginning. Consider we have two strings like: ACCGAATCGA ACCGGTATTAAC There is some algorithms like: Smith-Waterman Or Needleman–Wunsch, that align this two sequence and create a matrix. take a look at the result in the following section: Smith-Waterman Matrix § § A C C G A A T C G A § 0 0 0 0 0 0 0 0 0 0 0 A 0 4 0 0 0 4 4 0 0 0 4 C 0 0 13 9 4