fastq

bash: /bin/ls: Argument list too long

走远了吗. 提交于 2020-04-10 07:43:16
问题 I need to make a list of a large number of files (40,000 files) like below: ERR001268_1_100.fastq ERR001268_2_156.fastq ERR001753_2_78.fastq ERR001268_1_101.fastq ERR001268_2_157.fastq ERR001753_2_79.fastq ERR001268_1_102.fastq ERR001268_2_158.fastq ERR001753_2_7.fastq ERR001268_1_103.fastq ERR001268_2_159.fastq ERR001753_2_80.fastq my command is: ls ERR*_1_*.fastq |sed 's/\.fastq//g'|sort -n > masterlist However error is: bash: /bin/ls: Argument list too long However can I solve this problem

grep issues when using two files - I've tried everything

感情迁移 提交于 2019-12-25 18:31:56
问题 I have two files (recode and reads) that were built and saved with nano command and I want to compare what has on recode to reads and extract the lines in reads that overlaps. I have been trying to create a when loop with the previous logic on mind, but without success so far. The output data is not matching with the pattern specified in the loop while with grep/recode. The script was supposed to read each line in recode.txt compare to reads.fastq, extract each match line plus one line before

Bash: replace part of filename

风格不统一 提交于 2019-12-20 03:25:30
问题 I have a command I want to run on all of the files of a folder, and the command's syntax looks like this: tophat -o <output_file> <input_file> What I would like to do is a script that loops over all the files in an arbitrary folder and also uses the input file names to create similar, but different, output file names. The file names looks like this: input name desired output name path/to/sample1.fastq path/to/sample1.bam path/to/sample2.fastq path/to/sample2.bam Getting the input to work

Grep that tolerates mismatches to subset .fastq

心不动则不痛 提交于 2019-12-12 01:27:21
问题 I am working with bash on a linux cluster. I am trying to extract reads from a .fastq file if they contain a match to a queried sequence. Below is an example .fastq file containing three reads. $ cat example.fastq @SRR1111111.1 1/1 CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG + AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6 @SRR1111111.2 2/1 CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA + AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE

Loop to concatenate multiple pairs of files with almost the same name in UNIX

给你一囗甜甜゛ 提交于 2019-12-11 10:08:59
问题 I have a very basic question, but I can't get a solution. I have multiple files in the same directory and I would like to concatenate each pair of files. The names are: Sample1_R1_L001.fastq Sample1_R2_L001.fastq Sample2_R1_L001.fastq Sample2_R2_L001.fastq Sample3_R1_L001.fastq Sample3_R2_L001.fastq (etc...) The result I want is to concatenate by sample, such as cat Sample1_R1_L001.fastq Sample1_R2_L001.fastq > Sample1_concat.fastq I tried this loop, find . -name " _R?_ "|while read file; do

Bash script to concatenate text files with specific substrings in filenames

感情迁移 提交于 2019-12-11 03:25:27
问题 Within a certain directory I have many directories containing a bunch of text files. I’m trying to write a script that concatenates only those files in each directory that have the string ‘R1’ in their filename into one file within that specific directory, and those that have ‘R2’ in another . This is what I wrote but it’s not working. #!/bin/bash for f in */*.fastq; do if grep 'R1' $f ; then cat "$f" >> R1.fastq fi if grep 'R2' $f ; then cat "$f" >> R2.fastq fi done I get no errors and the

Peek into stream of Popen pipeline in Python

旧城冷巷雨未停 提交于 2019-12-10 23:17:10
问题 Background: Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline. I want to read a possibly gzipped file from a mounted remote storage (LAN) and if it is gzipped; gunzip it to a stream (i.e. using gunzip FILENAME -c ) and if the first character of the stream (file) is "@", route that entire stream into a filtering program that takes input on standard input, otherwise just pipe it directly to a file on local disk. I'd like to minimize the number of file reads/seeks from remote

How do I use parallel programming/multi threading in my bash script?

流过昼夜 提交于 2019-12-08 15:46:59
问题 This is my script: #!/bin/bash #script to loop through directories to merge fastq files sourcedir=/path/to/source destdir=/path/to/dest for f in $sourcedir/* do fbase=$(basename "$f") echo "Inside $fbase" zcat $f/*R1*.fastq.gz | gzip > $destdir/"$fbase"_R1.fastq.gz zcat $f/*R2*.fastq.gz | gzip > $destdir/"$fbase"_R2.fastq.gz done Here there are about 30 sub-directories in the directory 'source'. Each sub-directory has certain R1 .fastq.gz files and R2 .fastq.gz that I want to merge into one

Bash: replace part of filename

雨燕双飞 提交于 2019-12-02 01:31:58
I have a command I want to run on all of the files of a folder, and the command's syntax looks like this: tophat -o <output_file> <input_file> What I would like to do is a script that loops over all the files in an arbitrary folder and also uses the input file names to create similar, but different, output file names. The file names looks like this: input name desired output name path/to/sample1.fastq path/to/sample1.bam path/to/sample2.fastq path/to/sample2.bam Getting the input to work seems simple enough: for f in *.fastq do tophat -o <output_file> $f done I tried using output=${f,.fastq,

Converting FASTQ to FASTA with SED/AWK

百般思念 提交于 2019-11-30 12:50:13
问题 I have a data in that always comes in block of four in the following format (called FASTQ): @SRR018006.2016 GA2:6:1:20:650 length=36 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN +SRR018006.2016 GA2:6:1:20:650 length=36 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+! @SRR018006.19405469 GA2:6:100:1793:611 length=36 ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC +SRR018006.19405469 GA2:6:100:1793:611 length=36 7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/ Is there a simple sed/awk/bash way to convert them into this format (called