bioinformatics

Find length of overlap in strings [closed]

心不动则不痛 提交于 2021-02-16 13:40:33
问题 Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . Improve this question do you know any ready-to-use method to obtain length and also overlap of two strings? However only with R , maybe something from stringr ? I was looking here, unfortunately without succes. str1 <- 'ABCDE' str2 <- 'CDEFG' str_overlap(str1, str2) 'CDE' str_overlap

AttributeError: 'list' object has no attribute 'SeqRecord' - while trying to slice multiple sequences with Biopython>SeqIO from fasta file

不羁岁月 提交于 2021-02-11 17:52:31
问题 I am trying to generate varying length N and C termini Slices (1,2,3,4,5,6,7). But before I get there I am having problems just reading in my fasta files. I was following the 'Random subsequences' head tutorial from:https://biopython.org/wiki/SeqIO . But in this case there is only one sequence so maybe that is where I went wrong. The code with example sequences and my errors. Any help would be much appreciated. I am clearly out of my depth. It looks like there are a lot of similar problems

AttributeError: 'list' object has no attribute 'SeqRecord' - while trying to slice multiple sequences with Biopython>SeqIO from fasta file

女生的网名这么多〃 提交于 2021-02-11 17:51:04
问题 I am trying to generate varying length N and C termini Slices (1,2,3,4,5,6,7). But before I get there I am having problems just reading in my fasta files. I was following the 'Random subsequences' head tutorial from:https://biopython.org/wiki/SeqIO . But in this case there is only one sequence so maybe that is where I went wrong. The code with example sequences and my errors. Any help would be much appreciated. I am clearly out of my depth. It looks like there are a lot of similar problems

Getting P-Values of Zero in Cox Regression: R

﹥>﹥吖頭↗ 提交于 2021-02-11 13:35:45
问题 I am a student conducting a gene expression survival analysis in R. I have the expression data for 249 patients, and I am using 6,000 genes as well as their event-free survival times and vital state as response variables. When I tried to run the Cox regression on my dataset, I got extremely strange results (p-values of 0.00 and strange hazard ratios). I have checked over my code multiple times, but I am not able to catch my mistake (when I tried earlier with just one gene, it worked fine, but

Normalizing columns in R according to a formula

折月煮酒 提交于 2021-02-11 03:21:18
问题 Let's say I have a data frame of 1000 rows and 3 columns (column t0, t4 and t8). Each column represents a time point (0 hours, 4 hours and 8 hours). The data is gene expression: numeric (float): row.name t0 t4 t8 ENSG00000000419.8 1780.00 1837.00 1011.00 ENSG00000000457.9 859.00 348.39 179.00 ENSG00000000460.12 1333.00 899.00 508.00 I need to normalize the data according to a known result. I know that the average half-life of all rows (genes) should be 10 hours. So I need to find the

Query genes within regions

删除回忆录丶 提交于 2021-02-10 16:15:23
问题 I want to retrieve the genes that are present within a series of regions. Say, I have a bed file with query positions such like: 1 2665697 4665777 MIR201 1 10391435 12391516 MIR500 1 15106831 17106911 MIR122 1 23436535 25436616 MIR234 1 23436575 25436656 MIR488 I would like to get the genes that fall within those regions. I have tried using biomaRt , and bedtools intersect, but the output I get, is a list of genes corresponding to all the regions, not one by one, as the desired output I would

BioPython AlignIO ValueError says strings must be same length?

安稳与你 提交于 2021-02-10 11:30:46
问题 Input fasta-format text file: http://www.jcvi.org/cgi-bin/tigrfams/DownloadFile.cgi?file=/opt/www/www_tmp/tigrfams/fa_alignment_PF00205.txt #!/usr/bin/python from Bio import AlignIO seq_file = open('/path/to/fa_alignment_PF00205.txt') alignment = AlignIO.read(seq_file, "fasta") Error: ValueError: Sequences must all be the same length The input sequences shouldn't have to be the same length since on ClustalOmega you can align sequences of differing lengths. This also doesn't work...gets the

BioPython AlignIO ValueError says strings must be same length?

北城余情 提交于 2021-02-10 11:28:17
问题 Input fasta-format text file: http://www.jcvi.org/cgi-bin/tigrfams/DownloadFile.cgi?file=/opt/www/www_tmp/tigrfams/fa_alignment_PF00205.txt #!/usr/bin/python from Bio import AlignIO seq_file = open('/path/to/fa_alignment_PF00205.txt') alignment = AlignIO.read(seq_file, "fasta") Error: ValueError: Sequences must all be the same length The input sequences shouldn't have to be the same length since on ClustalOmega you can align sequences of differing lengths. This also doesn't work...gets the

Optimising my script which lookups into a big compressed file

六月ゝ 毕业季﹏ 提交于 2021-02-10 05:56:29
问题 I'm here again ! I would like to optimise my bash script in order to lower the time spent for each loop. Basically what it does is : getting an info from a tsv using that information to lookup with awk into a file printing the line and exporting it My issues are : 1) the files are 60GB compressed files : I need a software to uncompress it (I'm actually trying now to uncompress it, not sure I'll have enough space) 2) It is long to look into it anyway My ideas to improve it : 0) as said, if

What causes Python error 'bad escape \C'?

浪尽此生 提交于 2021-02-04 07:35:26
问题 I just wrote a function that will look at a text file and count all of the instances of True and False in the text file. Here is my file ATOM 43 CA LYS A 5 14.038 15.691 37.608 1.00 15.15 C True ATOM 52 CA CYS A 6 16.184 12.782 38.807 1.00 16.72 C True ATOM 58 CA GLU A 7 17.496 12.053 35.319 1.00 14.06 C False ATOM 67 CA VAL A 8 18.375 15.721 34.871 1.00 12.27 C True ATOM 74 CA PHE A 9 20.066 15.836 38.288 1.00 12.13 C False ATOM 85 CA GLN A 10 22.355 12.978 37.249 1.00 12.54 C False And here