Randomly Pick Lines From a File Without Slurping It With Unix

前端 未结 10 973
忘了有多久
忘了有多久 2020-12-07 11:40

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC

10条回答
  •  时光说笑
    2020-12-07 12:01

    I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

    This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

    Original thread: Submitted for your review -- random sampling in awk.

    # Waterman's Algorithm R for random sampling
    # by way of Knuth's The Art of Computer Programming, volume 2
    
    BEGIN {
        if (!n) {
            print "Usage: sample.awk -v n=[size]"
            exit
        }
        t = n
        srand()
    
    }
    
    NR <= n {
        pool[NR] = $0
        places[NR] = NR
        next
    
    }
    
    NR > n {
        t++
        M = int(rand()*t) + 1
        if (M <= n) {
            READ_NEXT_RECORD(M)
        }
    
    }
    
    END {
        if (NR < n) {
            print "sample.awk: Not enough records for sample" \
                > "/dev/stderr"
            exit
        }
        # gawk needs a numeric sort function
        # since it doesn't have one, zero-pad and sort alphabetically
        pad = length(NR)
        for (i in pool) {
            new_index = sprintf("%0" pad "d", i)
            newpool[new_index] = pool[i]
        }
        x = asorti(newpool, ordered)
        for (i = 1; i <= x; i++)
            print newpool[ordered[i]]
    
    }
    
    function READ_NEXT_RECORD(idx) {
        rec = places[idx]
        delete pool[rec]
        pool[NR] = $0
        places[idx] = NR  
    } 
    

提交回复
热议问题