bioinformatics

Snakemake: Error when trying to generate multiple output files

£可爱£侵袭症+ 提交于 2019-12-08 13:01:48
问题 I'm writing a snakemake pipeline to take publicly available sra files, convert them to fastq files then run them through alignment, peak calling and LD score regression. I'm having an issue in the rule called SRA2fastq below in which I use parallel-fastq-dump to convert SRA files to paired end fastq files. This rule generates two outputs for each SRA file, SRRXXXXXXX_1 , and SRRXXXXXXX_2 . Here is my config file: samples: fullard2018_NpfcATAC_1: SRR5367824 fullard2018_NpfcATAC_2: SRR5367798

Comparing one column value to all columns in linux enviroment

丶灬走出姿态 提交于 2019-12-08 09:45:39
问题 So I have two files , one VCF that looks like 88 Chr1 25 C - 3 2 1 1 88 Chr1 88 A T 7 2 1 1 88 Chr1 92 A C 16 4 1 1 and another with genes that looks like GENEID Start END GENE_ID 11 155 GENE_ID 165 999 I want a script that looks if there is a gene position (3rd column of VCF file) within the range of second and third position of the second file and then to print it out. What I did so far was to join the files and do awk '{if (3>$12 && $3< $13) print }' > out What I did only compares current

(BioPython) How do I stop MemoryError: Out of Memory exception?

我是研究僧i 提交于 2019-12-08 09:41:42
问题 I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later). My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair. Traceback (most recent call last): File "C:\Users\Harry\Documents

Unable to parse just sequences from FASTA file

二次信任 提交于 2019-12-08 07:04:22
问题 How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n' from sequences? I have this code: with open('sequence.fasta', 'r') as f : while True: line1=f.readline() line2=f.readline() line3=f.readline() if not line3: break fct([line1[i:i+100] for i in range(0, len(line1), 100)]) fct([line2[i:i+100] for i in range(0, len(line2), 100)]) fct([line3[i:i+100] for i in range(0, len(line3), 100)]) Output: ['>gi|2765658|emb|Z78533.1|CIZ78533 C

merge two data.frame with condition in R

[亡魂溺海] 提交于 2019-12-08 06:51:17
问题 I would like to compare two data sets df1 and df2 in such a way that, the unique characters in df2$ID should be added as a new column in df1 and assign df2$Xp value for each gene, if the coordinates of df1 overlaps with the coordinates of df2: df1 <- read.table(text=" Gene chr Start End Gm12724 4 1000 1105 Zfhx2 4 1254 1369 Usp17lc 7 5004 5412 Lingo1 7 5698 5789 Sart3 7 5987 6041 Olfr978 4 1452 1564 ", header=T) df2 <- read.table(text=" ID chr Start End Xp S8411 4 989 1258 0.312 S8411 4 1300

Only call function if PyMOL running

百般思念 提交于 2019-12-08 05:40:05
问题 I have a script that performs some calculations on a protein. When it's finished, a method imports the pymol module, and uses the pymol.cmd API to display results in a PyMOL session. The process is something akin to the following: def display_results(results, protein_fn): import pymol pymol.cmd.load(protein_fn) pymol.cmd.alter(...) ... protein_fn = "1abc.ent" results = analyze_protein(protein_fn) display_results(results, protein_fn) However, my script doesn't necessarily need to display the

How to fix 'String index out of range' error

情到浓时终转凉″ 提交于 2019-12-08 05:09:49
问题 I am trying to write a code which replaces repeating symbols in a string with a symbol and number of its repeats (like that: "aaaaggggtt" --> "a4g4t2"). But I'm getting string index out of range error(( seq = input() i = 0 j = 1 v = 1 while j<=len(seq)-1: if seq[i] == seq[j]: v += 1 i += 1 j += 1 elif seq[i] != seq[j]: seq.replace(seq[i-v:j], seq[i] + str(v)) v = 1 i += 1 j += 1 print(seq) line 6, in if seq[i] == seq[j]: IndexError: string index out of range UPD: After changing len(seq) to

Is it possible to install bioconductor package 'rain' in R Jupyter notebook?

左心房为你撑大大i 提交于 2019-12-08 00:51:58
问题 I want to install the bioconductor rain package for R in Jupyter notebook. I am not able to install this package in Jupyter notebook following instructions given on the website linked above - in an R Jupiter notebook: source("https://bioconductor.org/biocLite.R") biocLite("rain") I get the following error: Warning message: In install.packages(pkgs = doing, lib = lib, ...): installation of package ‘gmp’ had non-zero exit statusWarning message: In install.packages(pkgs = doing, lib = lib, ...):

R: How to change the column names in a data frame based on a specification

一曲冷凌霜 提交于 2019-12-07 13:32:18
问题 I have a data frame, the start of it is below: SM_H1455 SM_V1456 SM_K1457 SM_X1461 SM_K1462 ENSG00000000419.8 290 270 314 364 240 ENSG00000000457.8 252 230 242 220 106 ENSG00000000460.11 154 158 162 136 64 ENSG00000000938.7 20106 18664 19764 15640 19024 ENSG00000000971.11 30 10 4 2 10 Note that there are many more cols and rows. Here's what I want to do: I want to change the name of the columns. The most important information in a column's name, e.g. SM_H1455, is the 4th character of the

How do I decide which way to backtrack in the Smith–Waterman algorithm?

倖福魔咒の 提交于 2019-12-07 09:35:31
问题 I am trying to implement local sequence alignment in Python using the Smith–Waterman algorithm. Here's what I have so far. It gets as far as building the similarity matrix: import sys, string from numpy import * f1=open(sys.argv[1], 'r') seq1=f1.readline() f1.close() seq1=string.strip(seq1) f2=open(sys.argv[2], 'r') seq2=f2.readline() f2.close() seq2=string.strip(seq2) a,b =len(seq1),len(seq2) penalty=-1; point=2; #generation of matrix for local alignment p=zeros((a+1,b+1)) # table