awk

Parse the large test files using awk

梦想与她 提交于 2020-01-15 12:22:28
问题 I am looking to parse a space delimited input text file using awk. The column code can have more than one row for each group. I would greatly appreciate any help with this. Input File: TR 1 Action Success/Failure 8.1.1.1 RunOne 80 48 8.1.1.2 RunTwo 80 49 8.1.1.3 RunThree 100 100 8.1.1.4 RunFour 20 19 8.1.1.5 RunFive 20 20 Action Time 16:47:42 Action2 Success/Failure 8.1.2.1 RunSix 80 49 8.1.2.2 RunSeven 80 80 8.1.2.3 RunEight 80 80 Action2 Time 03:26:31 TR 2 Action Success/Failure 8.1.1.1

Parse the large test files using awk

妖精的绣舞 提交于 2020-01-15 12:22:18
问题 I am looking to parse a space delimited input text file using awk. The column code can have more than one row for each group. I would greatly appreciate any help with this. Input File: TR 1 Action Success/Failure 8.1.1.1 RunOne 80 48 8.1.1.2 RunTwo 80 49 8.1.1.3 RunThree 100 100 8.1.1.4 RunFour 20 19 8.1.1.5 RunFive 20 20 Action Time 16:47:42 Action2 Success/Failure 8.1.2.1 RunSix 80 49 8.1.2.2 RunSeven 80 80 8.1.2.3 RunEight 80 80 Action2 Time 03:26:31 TR 2 Action Success/Failure 8.1.1.1

Parsing GenBank file

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-15 10:58:05
问题 Basically, a GenBank file consists on gene entries (announced by 'gene' followed by its corresponding 'CDS' entry (only one per gene) like the two I show here below. I would like to get locus_tag vs product in a tab-delimited two column file. 'gene' and 'CDS' are always preceded and followed by spaces. If this task can be easily performed using an already available tool, please let me know. Input file: gene complement(8972..9094) /locus_tag="HAPS_0004" /db_xref="GeneID:7278619" CDS complement

Extract data between two tags

佐手、 提交于 2020-01-15 10:34:21
问题 After searching and reading extensively, I managed to get half of the work done. Here is the string: <td class='bold vmiddle'> Owner CIDR: </td><td><span class='jtruncate-text'><a href="http://3.abcdef.com/ip-3/encoded/czovL215aXAubXMvdmlldy9pcF9hZGRyZXNzZXMvNDIuMjI0LjAuMA%3D%3D">42.224.0.0</a>/12</span></td> I need to extract the 42.224.0.0 and /12 to make a 42.224.0.0/12 . Now I managed to get 42.224.0.0 by using: sed -n 's/^.*<a.href="[^"]*">\([^<]*\).*/\1/p' but I'm at a loss how to

Extract data between two tags

寵の児 提交于 2020-01-15 10:34:06
问题 After searching and reading extensively, I managed to get half of the work done. Here is the string: <td class='bold vmiddle'> Owner CIDR: </td><td><span class='jtruncate-text'><a href="http://3.abcdef.com/ip-3/encoded/czovL215aXAubXMvdmlldy9pcF9hZGRyZXNzZXMvNDIuMjI0LjAuMA%3D%3D">42.224.0.0</a>/12</span></td> I need to extract the 42.224.0.0 and /12 to make a 42.224.0.0/12 . Now I managed to get 42.224.0.0 by using: sed -n 's/^.*<a.href="[^"]*">\([^<]*\).*/\1/p' but I'm at a loss how to

How to sort out duplicates from a massive list using sort, uniq or awk?

依然范特西╮ 提交于 2020-01-15 10:33:28
问题 I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues. Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates. I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of

How to sort out duplicates from a massive list using sort, uniq or awk?

好久不见. 提交于 2020-01-15 10:33:28
问题 I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues. Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates. I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of

How to sort out duplicates from a massive list using sort, uniq or awk?

别说谁变了你拦得住时间么 提交于 2020-01-15 10:33:28
问题 I have a 12Gb file of combined hash lists. I need to find the duplicates in it but I've been having some issues. Some 920 (uniq'd) lists were merged using cat *.txt > _uniq_combined.txt resulting in a huge list of hashes. Once merged, the final list WILL contain duplicates. I thought I had it figured out with awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt && say finished ya jabroni awk '!seen[$0]++' _uniq_combined.txt > _AWK_duplicates.txt results in a file with a size of

Is it possible to have different behavior for first and second input files to awk?

风格不统一 提交于 2020-01-15 07:40:10
问题 For example, suppose I run the following command: gawk -f AppendMapping.awk Reference.tsv TrueInput.tsv Assume the names of files WILL change. While iterating through the first file, I want to create a mapping. map[$16]=$18 While iterating through the second file, I want to use the mapping. print $1, map[$2] What's the best way to achieve this behavior (ie, different behavior for each input file)? 回答1: As you probably know NR stores the current line number; as you may or may not know, it's

Combining two very large files ignoring the first sentence

耗尽温柔 提交于 2020-01-15 07:16:40
问题 I want to combine two giant file each few hundred megabyte into a single file while ignoring the first line. I wanted to use awk as I thought it should be the most optimized way. the way I'm doing it only ignores the first line of second file. Any idea how to do make work or if there's a faster way to do it? awk 'FNR!=NR && FNR==1 {next} 1' 'FNR!=NR && FNR==1 {next} 2' s_mep_{1,2}.out >> s_mep.out 回答1: $ awk 'FNR>1' file{1,2} > file_12 回答2: With sed (sed '1d' file_1 ; sed '1d' file_2) > new