file-processing

reading input from text file into array of structures in c

江枫思渺然 提交于 2019-12-19 09:14:06
问题 My structure definition is, typedef struct { int taxid; int geneid; char goid[20]; char evidence[4]; char qualifier[20]; char goterm[50]; char pubmed; char category[20]; } gene2go; I have tab-seperated text file called `"gene2go.txt". Each line of this file contains taxID , geneID , goID , evidence , qualifier , goterm , pubmed and category information. Each line of the file will be kept in a structure. When the program is run, it will first read the content of the input file into an array of

Randomly Pick Lines From a File Without Slurping It With Unix

放肆的年华 提交于 2019-12-17 17:24:44
问题 I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it? awk 'BEGIN{srand()} !/^$/{ a[c++]=$0} END { for ( i=1;i<=c ;i++ ) { num=int(rand() * c) if ( a[num] ) { print a[num] delete a[num] d++ } if ( d == c/100 ) break } }' file 回答1: if you have that many lines, are you sure you want exactly 1% or a statistical

BASH - remove line if first column content appears in another file

旧街凉风 提交于 2019-12-13 08:26:29
问题 If I have two files. File A looks like: a 1 a 2 a 3 b 4 c 5 and I have file B which has content: a b For everything that appears in file B and also appears in column 1 in file A, I would like to remove those lines. So the expected output for file A should be: c 5 Any help is greatly appreciated! 回答1: GNU Awk: awk 'ARGIND == 1 { del[$0]++ } ARGIND == 2 && !del[$1]' B A When processing the first file ( ARGIND is 1), enter $0 (each entire line) into an associative array del by incrementing its

Nodejs Read very large file(~10GB), Process line by line then write to other file

自作多情 提交于 2019-12-12 13:12:42
问题 I have a 10 GB log file in a particular format, I want to process this file line by line and then write the output to other file after applying some transformations . I am using node for this operation. Though this method is fine but it takes a hell lot of time to do this. I was able to do this within 30-45 mins in JAVA, but in node it is taking more than 160 minutes to do the same job. Following is the code: Following is the initiation code which reads each line from the input. var path = '.

How to format an output in Python?

丶灬走出姿态 提交于 2019-12-12 10:12:33
问题 I am having difficulty in formatting some code in Python: My code is here: keys = ['(Lag)=(\d+\.?\d*)','\t','(Autocorrelation Index): (\d+\.?\d*)', '(Autocorrelation Index): (\d+\.?\d*)', '(Semivariance): (\d+\.?\d*)'] import re string1 = ''.join(open("dummy.txt").readlines()) found = [] for key in keys: found.extend(re.findall(key, string1)) for result in found: print '%s = %s' % (result[0],result[1]) raw_input() So far, I am getting this output: Lag = 1 Lag = 2 Lag = 3 Autocorrelation Index

How can I get exactly n random lines from a file with Perl?

孤街醉人 提交于 2019-12-12 08:54:00
问题 Following up on this question, I need to get exactly n lines at random out of a file (or stdin ). This would be similar to head or tail , except I want some from the middle. Now, other than looping over the file with the solutions to the linked question, what's the best way to get exactly n lines in one run? For reference, I tried this: #!/usr/bin/perl -w use strict; my $ratio = shift; print $ratio, "\n"; while () { print if ((int rand $ratio) == 1); } where $ratio is the rough percentage of

Error while using '<file>.readlines()' function

荒凉一梦 提交于 2019-12-12 05:56:14
问题 The goal was to import the infile, read it, and print only two lines into the outfile.This is the code I had in IDLE: def main(): infile = open('names.py', "r") outfile = open('orgnames.py', "w") for i in range (2): line = ("names.py".readlines()) print (line[:-1], infile = outfile) infile.close() outfile.close() main() This is the error message I keep getting: Traceback (most recent call last): File "C:/Python33/studentnames6.py", line 11, in <module> main() File "C:/Python33/studentnames6

Perl file processing on SHIFT_JIS encoded Japanese files

寵の児 提交于 2019-12-12 03:53:44
问题 I have a set of SHIFT_JIS (Japanese) encoded csv file from Windows, which I am trying to process on a Linux server running Perl v5.10.1 using regular expressions to make string replacements. Here is my requirement: I want the Perl script’s regular expressions being human readable (at least to a Japanese person) Ie. like this: s/北/0/g; Instead of it littered with some hex codes s/\x{4eba}/0/g; Right now, I am editing the Perl script in Notepad++ on Windows, and pasting in the string I need to

split 10 billion line file into 5,000 files by column value in Perl or Python

耗尽温柔 提交于 2019-12-11 12:17:26
问题 I have a 10 billion line tab-delimited file that I want to split into 5,000 sub-files, based on a column (first column). How can I do this efficiently in Perl or Python? This has been asked here before but all the approaches open a file for each row read, or they put all the data in memory. 回答1: This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records It keeps a

Performance issue with MultiResourcePartitioner in Spring Batch

我的梦境 提交于 2019-12-11 06:58:22
问题 I have a spring batch project that reads a huge zip file containing more than 100.000 xml files. I am using MultiResourcePartitioner, and I have a Memory issue and my batch fails with java.lang.OutOfMemoryError: GC overhead limit exceeded. It seems like if all the xml files are loaded in memory and not garbaged after processing. Is there a performant way to do this ? Thanks. 来源: https://stackoverflow.com/questions/38793243/performance-issue-with-multiresourcepartitioner-in-spring-batch