I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC
In this case, reservoir sampling to get exactly k values is trivial enough with awk
that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk
program for sampling:
NR < k {
reservoir[NR] = $0;
}
NR >= k {
i = int(NR * rand());
if (i < k) {
reservoir[i] = $0;
}
}
END {
for (i in reservoir) {
print reservoir[i];
}
}
Then figuring out what k is has to be done separately, for example by setting awk -v 'k=int('$(dc -e "$(cat FILE | wc -l) 0.01 * n")')'