bioinformatics

Python: Multiple Consensus sequences

拟墨画扇 提交于 2019-12-11 04:42:35
问题 starting from a list of dna sequences, I must have in return all the possible consensus (the resulting sequence with the highest nucleotide frequency in each position) sequences. If in some positions the nucleotides have the same highest frequency, I must obtain all possible combinations with the highest frequency. I also must have in return the profile matrix ( a matrix with the frequencies of each nucleotide for each sequence). This is my code so far (but it returns only one consensus

Python-script, which should translate 1000 DNA-Sequences to proteins by 1152 different codontables, don't work

怎甘沉沦 提交于 2019-12-11 04:33:05
问题 Now I'm working on bioinformatics project for my diploma work. I have written Python-script, which should translate the list of strings of 1000 DNA-Sequences to proteins by 1152 different codontables (genetics codes). This codontables are contained in a list of dictionaries, which were received by shuffling of keys and values (codons and amino acids). I want, what this script translates 1000 Sequences in 1152 ways in one go mileage in IPython console or just in Python-3.6 IDLE. This script

Can a snakemake input rule be defined with different paths/wildcards

江枫思渺然 提交于 2019-12-11 04:08:54
问题 I want to know if one can define a input rule that has dependencies on different wildcards. To elaborate, I am running this Snakemake pipeline on different fastq files using qsub which submits each job to a different node: fastqc on original fastq - no downstream dependency on other jobs adapter/quality trimming to generate trimmed fastq fastqc_after on trimmed fastq (output from step 2) and no downstream dependency star-rsem pipeline on trimmed fastq (output from step 2 above) rsem and

Convert text file to plink PED and MAP format

和自甴很熟 提交于 2019-12-11 03:45:59
问题 I have the following data (small part of it) named "short2_pre_snp_tumor.txt" rs987435 C G 1 1 1 0 2 rs345783 C G 0 0 1 0 0 rs955894 G T 1 1 2 2 1 rs6088791 A G 1 2 0 0 1 rs11180435 C T 1 0 1 1 1 rs17571465 A T 1 2 2 2 2 rs17011450 C T 2 2 2 2 2 rs6919430 A C 2 1 2 2 2 rs2342723 C T 0 2 0 0 0 rs11992567 C T 2 2 2 2 2 and I need to get the PED and MAP file using Python, as R is really slow in case of large dataset. I have the following code in R: tm <- proc.time() d<-read.table("short2_pre_snp

count the number of a certain triplet in a file (DNA codon analysis)

99封情书 提交于 2019-12-11 03:22:55
问题 This question is actually for DNA codon analysis, to put it in a simple way, let's say I have a file like this: atgaaaccaaag... and I want to count the number of 'aaa' triplet present in this file. Importantly, the triplets start from the very beginning (which means atg,aaa,cca,aag,...) So the result should be 1 instead of 2 'aaa' in this example. Is there any Python or Shellscript methods to do this? Thanks! 回答1: first readin the file with open("some.txt") as f: file_data = f.read() then

How to make the bash script work with one command after another?

走远了吗. 提交于 2019-12-11 01:08:35
问题 I have a bash script like below. First it will take sorted.bam files as input and use "stringtie" tool give each sample gtf as output. Then path for each sample gtf will be given into mergelist.txt. and then use "stringtie merge" on them to get "stringtie_merged.gtf". I totally have 40 sorted.bam files. for sample in /path/*.sorted.bam do dir="/pathto/hisat2_output" dir2="/pathto/folder" base=`basename $sample '.sorted.bam'` "stringtie -p 8 -G gencode.v27.primary_assembly.annotation_nochr.gtf

Error in creating a volcano plot in MATLAB

末鹿安然 提交于 2019-12-10 23:54:09
问题 I am a complete newbie to MATLAB and the first task I have is to create a volcano plot. I have been using the documentation to understand about it and get started. I tried to run it on dummy values - a=[1 2 3] b=[4.6 2.7 4.5] c=[0.05 0.33 0.45] And then I ran - SigStructure = mavolcanoplot(a, b, c) My understanding is that a represents the gene expression values for condition 1, b for condition 2, and c is the list of p-values for the 3 data points in a and b . However running this code gives

R: How to Parallelize multi-panel plotting with lattice in R 3.2.1?

十年热恋 提交于 2019-12-10 22:18:48
问题 I am new to R programming and wanted to know how to run in parallel plot on 12 trellis objects made with lattice package. Basically, after a lot of pre-processing steps, I have the following commands: plot(adhd_plot, split = c(1,1,4,3)) #plot adhd trellis object at 1,1 in a grid of 4 by 3 i.e 4 COLUMNS x 3 ROWS plot(bpd_plot, split = c(2,1,4,3), newpage = F) #plot bpd trellis object in 2nd Column in a grid of 4colx3row plot(bmi_plot, split = c(3,1,4,3), newpage = F) plot(dbp_plot, split = c(4

How to merge two pandas dataframes (or transfer values) by comparing ranges of values

点点圈 提交于 2019-12-10 20:50:03
问题 In the following data: data01 = contig start end haplotype_block 2 5207 5867 1856 2 155667 155670 2816 2 67910 68022 2 2 68464 68483 3 2 525 775 132 2 118938 119559 1157 data02 = contig start last feature gene_id gene_name transcript_id 2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1 2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1 2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1 2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1 2 916 1201 exon scaffold_200001

Snakemake: rule for using many inputs for one output with multiple sub-groups

核能气质少年 提交于 2019-12-10 19:25:33
问题 I have a working pipeline I'm using for downloading, aligning and performing variant calling on public sequencing data. The problem is that it can currently only work on a per-sample basis ( i.e sample as each individual sequencing experiment). It doesn't work if I want to perform variant calling on a group of experiments (such as biological and/or technical replicates of a sample). I've tried to solving it, but I couldn't get it work. Here's a simplification of the alignment rule: rule