bioinformatics | 易学教程

Save complete web page (incl css, images) using python/selenium

阅读更多关于 Save complete web page (incl css, images) using python/selenium

问题 I am using Python/Selenium to submit genetic sequences to an online database, and want to save the full page of results I get back. Below is the code that gets me to the results I want: from selenium import webdriver URL = 'https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome' SEQUENCE = 'CCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACAGCTCAAACACAAAGTTACCTAAACTATAGAAGGACA

How to select lines of file based on multiple conditions of another file in R?

阅读更多关于 How to select lines of file based on multiple conditions of another file in R?

问题 I have 2 genetic datasets. I filter file1 based on a column in file2. However, I also need to account for a second column in file2 and I'm not sure how to do this. The condition for file 1 row extraction is that only rows that have a chromosome position either more than 5000 larger or more than 5000 smaller than any chromosome positions for variants on the same chromosome in file 2 are selected. For example my data looks like: File 1: Variant Chromsome Chromosome Position Variant1 2 14000

How to select lines of file based on multiple conditions of another file in R?

阅读更多关于 How to select lines of file based on multiple conditions of another file in R?

Snakemake - dynamically derive the targets from input files

阅读更多关于 Snakemake - dynamically derive the targets from input files

问题 I have a large number of input files organized like this: data/ ├── set1/ │ ├── file1_R1.fq.gz │ ├── file1_R2.fq.gz │ ├── file2_R1.fq.gz │ ├── file2_R2.fq.gz | : │ └── fileX_R2.fq.gz ├── another_set/ │ ├── asdf1_R1.fq.gz │ ├── asdf1_R2.fq.gz │ ├── asdf2_R1.fq.gz │ ├── asdf2_R2.fq.gz | : │ └── asdfX_R2.fq.gz : └── many_more_sets/ ├── zxcv1_R1.fq.gz ├── zxcv1_R2.fq.gz : └── zxcvX_R2.fq.gz If you are familiar with bioinformatics - these are of course fastq files from paired end sequencing runs.

Consolidate similar patterns into single consensus pattern

阅读更多关于 Consolidate similar patterns into single consensus pattern

问题 In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here. I have the following items: a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long); some long protein sequences, aka my reference. I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.) import sys import re from itertools import chain, izip #

How to remove duplicates in list of objects without hash

阅读更多关于 How to remove duplicates in list of objects without __hash__

问题 I have a list of custom objects from which I want to remove the duplicates. Normally, you would do this by defining both __eq__ and __hash__ for your objects and then taking the set of the list of objects. I have defined __eq__ , but I can't figure out a good way to implement __hash__ such that it returns the same value for objects that are equal. More specifically, I have a class that is derived from the Tree class from the ete3 toolkit. I have defined two objects to be equal if their

How can I count the frequency of letters

阅读更多关于 How can I count the frequency of letters

问题 I have a data like this >sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1 RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR >sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2

How to find a query that you can find with FASTA but not with BLAST and vice versa?

阅读更多关于 How to find a query that you can find with FASTA but not with BLAST and vice versa?

问题 I need to find a sequence or sequences that should give results (hits) in Fasta but not in Blast, or vice versa. And I am kinda lost. What should I look for while searching this sequence(s)? 回答1: When you say find a sequence by BLAST or FASTA I assume you mean find a hit in the database? I think FASTA might be better at finding alignments between dissimilar sequences than BLAST but BLAST is better at aligning similar sequences. 来源： https://stackoverflow.com/questions/26491285/how-to-find-a

randomForest Error: NA not permitted in predictors (but no NAs in data)

阅读更多关于 randomForest Error: NA not permitted in predictors (but no NAs in data)

问题 So I am attempting to run the 'genie3' algorithm (ref: http://homepages.inf.ed.ac.uk/vhuynht/software.html) in R which uses the 'randomForest' method. I am running into the following Error: > weight.matrix<-get.weight.matrix(tmpLog2FC, input.idx=1:4551) Starting RF computations with 1000 trees/target gene, and 67 candidate input genes/tree node Computing gene 1/11805 Show Traceback Rerun with Debug Error in randomForest.default(x, y, mtry = mtry, ntree = nb.trees, importance = TRUE, : NA not

Sequence Alignment Algorithm with a group of characters instead of one character

阅读更多关于 Sequence Alignment Algorithm with a group of characters instead of one character

问题 Summary: I'm beginning with some details about alignment algorithms, and at the end, I ask my question. If you know about alignment algorithm pass the beginning. Consider we have two strings like: ACCGAATCGA ACCGGTATTAAC There is some algorithms like: Smith-Waterman Or Needleman–Wunsch, that align this two sequence and create a matrix. take a look at the result in the following section: Smith-Waterman Matrix § § A C C G A A T C G A § 0 0 0 0 0 0 0 0 0 0 0 A 0 4 0 0 0 4 4 0 0 0 4 C 0 0 13 9 4