bioinformatics

running BLAST (bl2seq) without creating sequence files

喜夏-厌秋 提交于 2019-12-07 05:30:13
问题 I have a script that performs BLAST queries (bl2seq) The script works like this: Get sequence a, sequence b write sequence a to filea write sequence b to fileb run command 'bl2seq -i filea -j fileb -n blastn' get output from STDOUT, parse repeat 20 million times The program bl2seq does not support piping. Is there any way to do this and avoid writing/reading to the harddrive? I'm using Python BTW. 回答1: How do you know bl2seq does not support piping.? By the way, pipes is an OS feature, not

How to fix 'String index out of range' error

感情迁移 提交于 2019-12-07 01:03:27
I am trying to write a code which replaces repeating symbols in a string with a symbol and number of its repeats (like that: "aaaaggggtt" --> "a4g4t2"). But I'm getting string index out of range error(( seq = input() i = 0 j = 1 v = 1 while j<=len(seq)-1: if seq[i] == seq[j]: v += 1 i += 1 j += 1 elif seq[i] != seq[j]: seq.replace(seq[i-v:j], seq[i] + str(v)) v = 1 i += 1 j += 1 print(seq) line 6, in if seq[i] == seq[j]: IndexError: string index out of range UPD: After changing len(seq) to len(seq)-1 there is no more string index error, but the code still doesn't work. Input: aaaaggggtt Output

Extract sample data from VCF files

时间秒杀一切 提交于 2019-12-06 19:50:17
问题 I have a large Variant Call format (VCF) file (> 4GB) which has data for several samples. I have browsed Google, Stackoverflow as well as tried the VariantAnnotation package in R to somehow extract data only for a particular sample, but have not found any information on how to do that in R. Did anybody try anything like that, or maybe knows of another package that would enable this? 回答1: In VariantAnnotation use a ScanVcfParam to specify the data that you'd like to extract. Using the sample

merge two data.frame with condition in R

折月煮酒 提交于 2019-12-06 15:48:53
I would like to compare two data sets df1 and df2 in such a way that, the unique characters in df2$ID should be added as a new column in df1 and assign df2$Xp value for each gene, if the coordinates of df1 overlaps with the coordinates of df2: df1 <- read.table(text=" Gene chr Start End Gm12724 4 1000 1105 Zfhx2 4 1254 1369 Usp17lc 7 5004 5412 Lingo1 7 5698 5789 Sart3 7 5987 6041 Olfr978 4 1452 1564 ", header=T) df2 <- read.table(text=" ID chr Start End Xp S8411 4 989 1258 0.312 S8411 4 1300 1800 0.144 S8411 7 5641 6874 0.136 S8413 4 1307 1360 -1.999 ",header=T) expected output df3 <- read

Python find longest ORF in DNA sequence

荒凉一梦 提交于 2019-12-06 15:33:33
问题 Can someone show me a straightforward solution for how to calculate the longest open reading frame (ORF) in a DNA sequence? ATG is the start codon (i.e., the beginning of an ORF) and TAG , TGA , and TAA are stop codons (i.e., the end of an ORF). Here's some code that produces errors (and uses an external module called BioPython): import sys from Bio import SeqIO currentCid = '' buffer = [] for record in SeqIO.parse(open(sys.argv[1]),"fasta"): cid = str(record.description).split('.')[0][1:] if

How to set a for -loop in R

天大地大妈咪最大 提交于 2019-12-06 12:18:19
问题 I am a biologist and have less knowledge of programming. I have series of files(fasta format files) for which I need to apply an R package. each file contents as follows: FILE_1.FASTA >>TTBK2_Hsap ,(CK1/TTBK) MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT FILE_2.FASTA >>TTBK2_Hsap ,(CK1/TTBK) MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT and the package

How can I get taxonomic rank names from taxid?

痞子三分冷 提交于 2019-12-06 12:08:01
问题 This question is related to: How to get taxonomic specific ids for kingdom, phylum, class, order, family, genus and species from taxid? The solution given there works but I would like to have the names for each taxonomic ids for defined ranks. I have found this on ete3 which can do the job: names = ncbi.get_taxid_translator(lineage) print [names[taxid] for taxid in lineage] but not being python programmer, I am failing to incorporate this into the code given in the link above. Here is what I

Find, replace, and increment at each occurence of string

北城余情 提交于 2019-12-06 11:59:04
问题 I'm relatively new to scripting and apologize in advance for this painfully simple problem. I believe I've searched pretty thoroughly, but apparently no other answers or cookbooks have been explicit enough for me to understand (like here - still couldn't get it). I have a file that is made up of strings of letters (DNA, if you care), one string per line. Above each string I've inserted another line to identify the underlying string. For those of you who are bioinformaticians, I'm trying to

Is it possible to install bioconductor package 'rain' in R Jupyter notebook?

与世无争的帅哥 提交于 2019-12-06 10:41:30
I want to install the bioconductor rain package for R in Jupyter notebook. I am not able to install this package in Jupyter notebook following instructions given on the website linked above - in an R Jupiter notebook: source("https://bioconductor.org/biocLite.R") biocLite("rain") I get the following error: Warning message: In install.packages(pkgs = doing, lib = lib, ...): installation of package ‘gmp’ had non-zero exit statusWarning message: In install.packages(pkgs = doing, lib = lib, ...): installation of package ‘rain’ had non-zero exit status I was able to install a different bioconductor

Grouping ecological data in R

前提是你 提交于 2019-12-06 08:36:49
问题 I'm looking at some ecological data (diet) and trying to work out how to group by Predator. I would like to be able to extract the data so that I can look at the weights of each individual prey for each species for each predator, i.e work out the mean weight of each species eaten by e.g Predator 117. I've put a sample of my data below. Predator PreySpecies PreyWeight 1 114 10 4.2035496 2 114 10 1.6307026 3 115 1 407.7279775 4 115 1 255.5430495 5 117 10 4.2503708 6 117 10 3.6268814 7 117 10 6