bioinformatics

Can I use K-means algorithm on a string?

泪湿孤枕 提交于 2019-12-02 20:48:16
I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation. I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to

How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline

橙三吉。 提交于 2019-12-02 04:29:00
问题 How to use a list in Snakemake tabular config. I use Snakemake Tabular (mapping with BWA mem) configuration to describe my sequencing units (libraries sequenced on separate lines). At the next stage of analysis I have to merge sequencing units (mapped .bed files) and take merged .bam files (one for each sample). Now I'm using YAML config for describing of what units belong to what samples. But I wish to use Tabular config for this purpose, I'm not clear how to write and recall a list

Find point-to-range overlaps

不羁岁月 提交于 2019-12-02 03:05:29
问题 I have a dataframe df1: df1 <- read.table(text=" Chr06 79641 Chr06 82862 Chr06 387314 Chr06 656098 Chr06 678491 Chr06 1018696", header=FALSE, stringsAsFactors=FALSE) I would like to check if each row in df1 is contained in a range in df2. the column2 in df2 is the start of a range, and column3 is the end of a range. no overlapping between ranges (between rows). The data in df2 are sorted by Column1 and column2. I wrote a loop for this but I am not happy to it because It runs so long time if I

Bash: replace part of filename

雨燕双飞 提交于 2019-12-02 01:31:58
I have a command I want to run on all of the files of a folder, and the command's syntax looks like this: tophat -o <output_file> <input_file> What I would like to do is a script that loops over all the files in an arbitrary folder and also uses the input file names to create similar, but different, output file names. The file names looks like this: input name desired output name path/to/sample1.fastq path/to/sample1.bam path/to/sample2.fastq path/to/sample2.bam Getting the input to work seems simple enough: for f in *.fastq do tophat -o <output_file> $f done I tried using output=${f,.fastq,

Scanf_s warning? Skips User Inputs (topics: Runge-Kutta, Epidemic Simulation)

狂风中的少年 提交于 2019-12-01 23:22:44
This is my first post and I have to admit, I am terrible at programming. I am that guy in the class that works his tail off, but can never seem to grasp programming as well as the rest of my classmates. So please be nice, I will try to explain my problem below. I have the following code (comments removed), but when I run it I get a warning similar to the one listed below. Also, when I run the program, the first user inputted value is allowed, but then all of the sudden, it jumps to the end of the program, not allowing me to input the values for the other variables (e.g. the variable "beta"). I

Find point-to-range overlaps

半世苍凉 提交于 2019-12-01 22:46:41
I have a dataframe df1: df1 <- read.table(text=" Chr06 79641 Chr06 82862 Chr06 387314 Chr06 656098 Chr06 678491 Chr06 1018696", header=FALSE, stringsAsFactors=FALSE) I would like to check if each row in df1 is contained in a range in df2. the column2 in df2 is the start of a range, and column3 is the end of a range. no overlapping between ranges (between rows). The data in df2 are sorted by Column1 and column2. I wrote a loop for this but I am not happy to it because It runs so long time if I have a few thousands rows in df1. I would like to find a more efficient way to do this job (better no

Reading in file block by block using specified delimiter in python

放肆的年华 提交于 2019-12-01 18:36:50
I have an input_file.fa file like this ( FASTA format): > header1 description data data data >header2 description more data data data I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1: > header1 description data data data Of course I could just read in the file like this and split: with open("1.fa") as f: for block in f.read().split(">"): pass But I want to avoid the reading the whole file into memory , because the files are often large. I can read in the file line by line of course: with open("input_file.fa") as f:

extract sequences from multifasta file by ID in file using awk

流过昼夜 提交于 2019-12-01 15:50:18
问题 I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs. FASTA file seq.fasta: >7P58X:01332:11636 TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT CAAGTCCCTGTTCGGGCGCC >7P58X:01334:11605 TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC CCTGTTCGGGCGCCACTGCTAG >7P58X:01334:11613 ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC >7P58X:01334:11635 TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC GAGCG >7P58X:01336:11621 ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT

Pandas Convert 'NA' to NaN

半世苍凉 提交于 2019-12-01 15:16:44
I just picked up Pandas to do with some data analysis work in my biology research. Turns out one of the proteins I'm analyzing is called 'NA'. I have a matrix with pairwise 'HA, M1, M2, NA, NP...' on the column headers, and the same as "row headers" (for the biologists who might read this, I'm working with influenza). When I import the data into Pandas directly from a CSV file, it reads the "row headers" as 'HA, M1, M2...' and then NA gets read as NaN. Is there any way to stop this? The column headers are fine - 'HA, M1, M2, NA, NP etc...' Turn off NaN detection this way: pd.read_csv(filename,

Pandas Convert 'NA' to NaN

笑着哭i 提交于 2019-12-01 14:10:57
问题 I just picked up Pandas to do with some data analysis work in my biology research. Turns out one of the proteins I'm analyzing is called 'NA'. I have a matrix with pairwise 'HA, M1, M2, NA, NP...' on the column headers, and the same as "row headers" (for the biologists who might read this, I'm working with influenza). When I import the data into Pandas directly from a CSV file, it reads the "row headers" as 'HA, M1, M2...' and then NA gets read as NaN. Is there any way to stop this? The