I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC
If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l
), one to select the sample:
file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
BEGIN {srand(); p = int(n * percent)}
rand() * n-- < p {p--; print}' < "$file"