Python random N lines from large file (no duplicate lines)

前端未结

关注

 5  896

I need to use python to take N number of lines from large txt file. These files are basically tab delimited tables. My task has the following constraints:

相关标签:

5条回答

遇见更好的自我

2020-12-12 00:14
Untested (and requires reading the file twice):
```
import random

N = 5000
with open('file.in') as fin:
    line_count = sum(1 for i in fin)
    fin.seek(0)
    to_take = set(random.sample(xrange(line_count), N))
    for lineno, line in enumerate(fin):
        if lineno in to_take:
            pass # use it
```
However, since you mention that lines are "roughly" the same size, then you could use os.path.getsize and divide it by the average line length (whether already known, or sniffed from N many lines from the file), then use that to generate line_count - it'd be close enough for a random sample.

You could also mmap the file and use a combination of filesize, average line length, best guess of number of lines, and a random line number to 'seek' and then, just search backwards or forwards to the next start of line. (Since mmap will enable you to treat it like a string, you'll be able to use .index with an offset or use re if you really wanted to).
0 讨论(0)
发布评论:

提交评论
- 加载中...
说谎

2020-12-12 00:22
If you need a uniform sample of N lines in your file, you need to know the exact number of lines to pick from; seeking at random doesn't do this, longer lines skew the results in favour of lines directly following the longest lines.

Luckily, you only need to read your file once to pick those N lines. You basically pick your N first lines (in random order), then randomly replace picked lines with new ones with a diminishing probability based on the number of lines read.

For N == 1, the chance that the nth line read replaces the previous random pick is randint(0, n) < 1, so, the second line has a 50% chance of being picked, the third has a 33.33% chance, etc. For larger N, replace one of the already picked lines in your set at random as more lines are read, with the same distribution.

In Python random lines from subfolders, Blkknght wrote a very helpful function for picking a random sample of size N from an iterable:
```
import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results
```
This is trivial to combine with your requirements to preserve a set of headers:
```
from itertools import islice

with open(options.input) as input:
    with open(options.output, 'w') as output:

        # Handling of header lines
        # Use islice to avoid buffer issues with .readline()
        for line in islice(input, int(options.header)):
            output.write(line)

        # Pick a random sample
        for line in random_sample(int(args[0]), input):
            output.write(line)
```
This will read your whole file in one go, pick a uniform random sample, and write it out to the output file. Thus, this has Θ(L) complexity, with L being the number of lines in the file.
0 讨论(0)
发布评论:

提交评论
- 加载中...

死守一世寂寞

2020-12-12 00:24

There is only one way of avoiding a sequential read of all the file up to the last line you are sampling - I am surprised that none of the answers up to now mentioned it:

You have to seek to an arbitrary location inside the file, read some bytes, if you have a typical line length, as you said, 3 or 4 times that value should do it. Then split the chunk you read on the new line characters ("\n"), and pick the second field - that is a line in a random position.

Also, in order to be able to consistently seek into the file, it should be opened in "binary read" mode, thus, the conversion of the end of line markers should be taken care of manually.

This technique can't give you the line number that was read, thus you keep the selected line offset in the file to avoid repetition:

#! /usr/bin/python
# coding: utf-8

import random, os


CHUNK_SIZE = 1000
PATH = "/var/log/cron"

def pick_next_random_line(file, offset):
    file.seek(offset)
    chunk = file.read(CHUNK_SIZE)
    lines = chunk.split(os.linesep)
    # Make some provision in case yiou had not read at least one full line here
    line_offset = offset + len(os.linesep) + chunk.find(os.linesep) 
    return line_offset, lines[1]

def get_n_random_lines(path, n=5):
    lenght = os.stat(path).st_size
    results = []
    result_offsets = set()
    with open(path) as input:
        for x in range(n):
            while True:
                offset, line = pick_next_random_line(input, random.randint(0, lenght - CHUNK_SIZE))
                if not offset in result_offsets:
                    result_offsets.add(offset)
                    results.append(line)
                    break
    return results

if __name__ == "__main__":
    print get_n_random_lines(PATH)

0 讨论(0)

小鲜肉

2020-12-12 00:27

I believe it would be faster to randomly choose N line numbers, and then to go over the file once, line by line and take the lines who's number is in your list. Currently you have to seek to the random place for each random number so it's O(N*M) where M is the size of the file. What I suggest is O(M).

0 讨论(0)
发布评论:

提交评论
- 加载中...
情书的邮戳

2020-12-12 00:33
- Obvious improvement would be to use set() for your usedPositions variable - lookup will be faster, and since you need to handle up to 10^6 used positions, lookup time is not irrelevant.
- Use xrange instead of range in a for loop. Allocating full list of integers doesn't seem necessary.
0 讨论(0)
发布评论:

提交评论
- 加载中...