file-processing | 易学教程

reading input from text file into array of structures in c

阅读更多关于 reading input from text file into array of structures in c

问题 My structure definition is, typedef struct { int taxid; int geneid; char goid[20]; char evidence[4]; char qualifier[20]; char goterm[50]; char pubmed; char category[20]; } gene2go; I have tab-seperated text file called `"gene2go.txt". Each line of this file contains taxID , geneID , goID , evidence , qualifier , goterm , pubmed and category information. Each line of the file will be kept in a structure. When the program is run, it will first read the content of the input file into an array of

Randomly Pick Lines From a File Without Slurping It With Unix

阅读更多关于 Randomly Pick Lines From a File Without Slurping It With Unix

问题 I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it? awk 'BEGIN{srand()} !/^$/{ a[c++]=$0} END { for ( i=1;i<=c ;i++ ) { num=int(rand() * c) if ( a[num] ) { print a[num] delete a[num] d++ } if ( d == c/100 ) break } }' file 回答1: if you have that many lines, are you sure you want exactly 1% or a statistical

BASH - remove line if first column content appears in another file

阅读更多关于 BASH - remove line if first column content appears in another file

问题 If I have two files. File A looks like: a 1 a 2 a 3 b 4 c 5 and I have file B which has content: a b For everything that appears in file B and also appears in column 1 in file A, I would like to remove those lines. So the expected output for file A should be: c 5 Any help is greatly appreciated! 回答1: GNU Awk: awk 'ARGIND == 1 { del[$0]++ } ARGIND == 2 && !del[$1]' B A When processing the first file ( ARGIND is 1), enter $0 (each entire line) into an associative array del by incrementing its

Nodejs Read very large file(~10GB), Process line by line then write to other file

阅读更多关于 Nodejs Read very large file(~10GB), Process line by line then write to other file

问题 I have a 10 GB log file in a particular format, I want to process this file line by line and then write the output to other file after applying some transformations . I am using node for this operation. Though this method is fine but it takes a hell lot of time to do this. I was able to do this within 30-45 mins in JAVA, but in node it is taking more than 160 minutes to do the same job. Following is the code: Following is the initiation code which reads each line from the input. var path = '.

How to format an output in Python?

阅读更多关于 How to format an output in Python?

问题 I am having difficulty in formatting some code in Python: My code is here: keys = ['(Lag)=(\d+\.?\d*)','\t','(Autocorrelation Index): (\d+\.?\d*)', '(Autocorrelation Index): (\d+\.?\d*)', '(Semivariance): (\d+\.?\d*)'] import re string1 = ''.join(open("dummy.txt").readlines()) found = [] for key in keys: found.extend(re.findall(key, string1)) for result in found: print '%s = %s' % (result[0],result[1]) raw_input() So far, I am getting this output: Lag = 1 Lag = 2 Lag = 3 Autocorrelation Index

How can I get exactly n random lines from a file with Perl?

阅读更多关于 How can I get exactly n random lines from a file with Perl?

问题 Following up on this question, I need to get exactly n lines at random out of a file (or stdin ). This would be similar to head or tail , except I want some from the middle. Now, other than looping over the file with the solutions to the linked question, what's the best way to get exactly n lines in one run? For reference, I tried this: #!/usr/bin/perl -w use strict; my $ratio = shift; print $ratio, "\n"; while () { print if ((int rand $ratio) == 1); } where $ratio is the rough percentage of

Error while using '<file>.readlines()' function

阅读更多关于 Error while using '.readlines()' function

问题 The goal was to import the infile, read it, and print only two lines into the outfile.This is the code I had in IDLE: def main(): infile = open('names.py', "r") outfile = open('orgnames.py', "w") for i in range (2): line = ("names.py".readlines()) print (line[:-1], infile = outfile) infile.close() outfile.close() main() This is the error message I keep getting: Traceback (most recent call last): File "C:/Python33/studentnames6.py", line 11, in <module> main() File "C:/Python33/studentnames6

Perl file processing on SHIFT_JIS encoded Japanese files

阅读更多关于 Perl file processing on SHIFT_JIS encoded Japanese files

问题 I have a set of SHIFT_JIS (Japanese) encoded csv file from Windows, which I am trying to process on a Linux server running Perl v5.10.1 using regular expressions to make string replacements. Here is my requirement: I want the Perl script’s regular expressions being human readable (at least to a Japanese person) Ie. like this: s/北/0/g; Instead of it littered with some hex codes s/\x{4eba}/0/g; Right now, I am editing the Perl script in Notepad++ on Windows, and pasting in the string I need to

split 10 billion line file into 5,000 files by column value in Perl or Python

阅读更多关于 split 10 billion line file into 5,000 files by column value in Perl or Python

问题 I have a 10 billion line tab-delimited file that I want to split into 5,000 sub-files, based on a column (first column). How can I do this efficiently in Perl or Python? This has been asked here before but all the approaches open a file for each row read, or they put all the data in memory. 回答1: This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records It keeps a

Performance issue with MultiResourcePartitioner in Spring Batch

阅读更多关于 Performance issue with MultiResourcePartitioner in Spring Batch

问题 I have a spring batch project that reads a huge zip file containing more than 100.000 xml files. I am using MultiResourcePartitioner, and I have a Memory issue and my batch fails with java.lang.OutOfMemoryError: GC overhead limit exceeded. It seems like if all the xml files are loaded in memory and not garbaged after processing. Is there a performant way to do this ? Thanks. 来源： https://stackoverflow.com/questions/38793243/performance-issue-with-multiresourcepartitioner-in-spring-batch