Randomly Pick Lines From a File Without Slurping It With Unix

前端 未结 10 938
忘了有多久
忘了有多久 2020-12-07 11:40

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC

10条回答
  •  攒了一身酷
    2020-12-07 11:44

    In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:

    NR < k {
        reservoir[NR] = $0;
    }
    NR >= k {
        i = int(NR * rand());
        if (i < k) {
            reservoir[i] = $0;
        }
    }
    END {
        for (i in reservoir) {
            print reservoir[i];
        }
    }
    

    Then figuring out what k is has to be done separately, for example by setting awk -v 'k=int('$(dc -e "$(cat FILE | wc -l) 0.01 * n")')'

提交回复
热议问题