bioinformatics | 易学教程

Can I use K-means algorithm on a string?

阅读更多关于 Can I use K-means algorithm on a string?

I am working on a python project where I study RNA structure evolution (represented as a string for example: "(((...)))" where the parenthesis represent basepairs). The point being is that I have an ideal structure and a population that evolves towards the ideal structure. I have implemented everything however I would like to add a feature where I can get the "number of buckets" ie the k most representative structures in the population at each generation. I was thinking of using the k-means algorithm but I am not sure how to use it with strings. I found scipy.cluster.vq but I don't know how to

How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline

阅读更多关于 How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline

问题 How to use a list in Snakemake tabular config. I use Snakemake Tabular (mapping with BWA mem) configuration to describe my sequencing units (libraries sequenced on separate lines). At the next stage of analysis I have to merge sequencing units (mapped .bed files) and take merged .bam files (one for each sample). Now I'm using YAML config for describing of what units belong to what samples. But I wish to use Tabular config for this purpose, I'm not clear how to write and recall a list

Find point-to-range overlaps

阅读更多关于 Find point-to-range overlaps

问题 I have a dataframe df1: df1 <- read.table(text=" Chr06 79641 Chr06 82862 Chr06 387314 Chr06 656098 Chr06 678491 Chr06 1018696", header=FALSE, stringsAsFactors=FALSE) I would like to check if each row in df1 is contained in a range in df2. the column2 in df2 is the start of a range, and column3 is the end of a range. no overlapping between ranges (between rows). The data in df2 are sorted by Column1 and column2. I wrote a loop for this but I am not happy to it because It runs so long time if I

Bash: replace part of filename

阅读更多关于 Bash: replace part of filename

I have a command I want to run on all of the files of a folder, and the command's syntax looks like this: tophat -o <output_file> <input_file> What I would like to do is a script that loops over all the files in an arbitrary folder and also uses the input file names to create similar, but different, output file names. The file names looks like this: input name desired output name path/to/sample1.fastq path/to/sample1.bam path/to/sample2.fastq path/to/sample2.bam Getting the input to work seems simple enough: for f in *.fastq do tophat -o <output_file> $f done I tried using output=${f,.fastq,

Scanf_s warning? Skips User Inputs (topics: Runge-Kutta, Epidemic Simulation)

阅读更多关于 Scanf_s warning? Skips User Inputs (topics: Runge-Kutta, Epidemic Simulation)

This is my first post and I have to admit, I am terrible at programming. I am that guy in the class that works his tail off, but can never seem to grasp programming as well as the rest of my classmates. So please be nice, I will try to explain my problem below. I have the following code (comments removed), but when I run it I get a warning similar to the one listed below. Also, when I run the program, the first user inputted value is allowed, but then all of the sudden, it jumps to the end of the program, not allowing me to input the values for the other variables (e.g. the variable "beta"). I

Find point-to-range overlaps

阅读更多关于 Find point-to-range overlaps

I have a dataframe df1: df1 <- read.table(text=" Chr06 79641 Chr06 82862 Chr06 387314 Chr06 656098 Chr06 678491 Chr06 1018696", header=FALSE, stringsAsFactors=FALSE) I would like to check if each row in df1 is contained in a range in df2. the column2 in df2 is the start of a range, and column3 is the end of a range. no overlapping between ranges (between rows). The data in df2 are sorted by Column1 and column2. I wrote a loop for this but I am not happy to it because It runs so long time if I have a few thousands rows in df1. I would like to find a more efficient way to do this job (better no

Reading in file block by block using specified delimiter in python

阅读更多关于 Reading in file block by block using specified delimiter in python

I have an input_file.fa file like this ( FASTA format): > header1 description data data data >header2 description more data data data I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1: > header1 description data data data Of course I could just read in the file like this and split: with open("1.fa") as f: for block in f.read().split(">"): pass But I want to avoid the reading the whole file into memory , because the files are often large. I can read in the file line by line of course: with open("input_file.fa") as f:

extract sequences from multifasta file by ID in file using awk

阅读更多关于 extract sequences from multifasta file by ID in file using awk

问题 I would like to extract sequences from the multifasta file that match the IDs given by separate list of IDs. FASTA file seq.fasta: >7P58X:01332:11636 TTCAGCAAGCCGAGTCCTGCGTCGTTACTTCGCTT CAAGTCCCTGTTCGGGCGCC >7P58X:01334:11605 TTCAGCAAGCCGAGTCCTGCGTCGAGAGTTCAAGTC CCTGTTCGGGCGCCACTGCTAG >7P58X:01334:11613 ACGAGTGCGTCAGACCCTTTTAGTCAGTGTGGAAAC >7P58X:01334:11635 TTCAGCAAGCCGAGTCCTGCGTCGAGAGATCGCTTT CAAGTCCCTGTTCGGGCGCCACTGCGGGTCTGTGTC GAGCG >7P58X:01336:11621 ACGCTCGACACAGACCTTTAGTCAGTGTGGAAATCT

Pandas Convert 'NA' to NaN

阅读更多关于 Pandas Convert 'NA' to NaN

I just picked up Pandas to do with some data analysis work in my biology research. Turns out one of the proteins I'm analyzing is called 'NA'. I have a matrix with pairwise 'HA, M1, M2, NA, NP...' on the column headers, and the same as "row headers" (for the biologists who might read this, I'm working with influenza). When I import the data into Pandas directly from a CSV file, it reads the "row headers" as 'HA, M1, M2...' and then NA gets read as NaN. Is there any way to stop this? The column headers are fine - 'HA, M1, M2, NA, NP etc...' Turn off NaN detection this way: pd.read_csv(filename,

Pandas Convert 'NA' to NaN

阅读更多关于 Pandas Convert 'NA' to NaN

问题 I just picked up Pandas to do with some data analysis work in my biology research. Turns out one of the proteins I'm analyzing is called 'NA'. I have a matrix with pairwise 'HA, M1, M2, NA, NP...' on the column headers, and the same as "row headers" (for the biologists who might read this, I'm working with influenza). When I import the data into Pandas directly from a CSV file, it reads the "row headers" as 'HA, M1, M2...' and then NA gets read as NaN. Is there any way to stop this? The