I need to use python to take N number of lines from large txt file. These files are basically tab delimited tables. My task has the following constraints:
Untested (and requires reading the file twice):
import random
N = 5000
with open('file.in') as fin:
line_count = sum(1 for i in fin)
fin.seek(0)
to_take = set(random.sample(xrange(line_count), N))
for lineno, line in enumerate(fin):
if lineno in to_take:
pass # use it
However, since you mention that lines are "roughly" the same size, then you could use os.path.getsize
and divide it by the average line length (whether already known, or sniffed from N many lines from the file), then use that to generate line_count
- it'd be close enough for a random sample.
You could also mmap
the file and use a combination of filesize, average line length, best guess of number of lines, and a random line number to 'seek' and then, just search backwards or forwards to the next start of line. (Since mmap
will enable you to treat it like a string, you'll be able to use .index
with an offset or use re
if you really wanted to).
If you need a uniform sample of N lines in your file, you need to know the exact number of lines to pick from; seeking at random doesn't do this, longer lines skew the results in favour of lines directly following the longest lines.
Luckily, you only need to read your file once to pick those N lines. You basically pick your N first lines (in random order), then randomly replace picked lines with new ones with a diminishing probability based on the number of lines read.
For N == 1, the chance that the nth line read replaces the previous random pick is randint(0, n) < 1
, so, the second line has a 50% chance of being picked, the third has a 33.33% chance, etc. For larger N, replace one of the already picked lines in your set at random as more lines are read, with the same distribution.
In Python random lines from subfolders, Blkknght wrote a very helpful function for picking a random sample of size N from an iterable:
import random
def random_sample(n, items):
results = []
for i, v in enumerate(items):
r = random.randint(0, i)
if r < n:
if i < n:
results.insert(r, v) # add first n items in random order
else:
results[r] = v # at a decreasing rate, replace random items
if len(results) < n:
raise ValueError("Sample larger than population.")
return results
This is trivial to combine with your requirements to preserve a set of headers:
from itertools import islice
with open(options.input) as input:
with open(options.output, 'w') as output:
# Handling of header lines
# Use islice to avoid buffer issues with .readline()
for line in islice(input, int(options.header)):
output.write(line)
# Pick a random sample
for line in random_sample(int(args[0]), input):
output.write(line)
This will read your whole file in one go, pick a uniform random sample, and write it out to the output file. Thus, this has Θ(L) complexity, with L being the number of lines in the file.
There is only one way of avoiding a sequential read of all the file up to the last line you are sampling - I am surprised that none of the answers up to now mentioned it:
You have to seek to an arbitrary location inside the file, read some bytes, if you have a typical line length, as you said, 3 or 4 times that value should do it. Then split the chunk you read on the new line characters ("\n"), and pick the second field - that is a line in a random position.
Also, in order to be able to consistently seek into the file, it should be opened in "binary read" mode, thus, the conversion of the end of line markers should be taken care of manually.
This technique can't give you the line number that was read, thus you keep the selected line offset in the file to avoid repetition:
#! /usr/bin/python
# coding: utf-8
import random, os
CHUNK_SIZE = 1000
PATH = "/var/log/cron"
def pick_next_random_line(file, offset):
file.seek(offset)
chunk = file.read(CHUNK_SIZE)
lines = chunk.split(os.linesep)
# Make some provision in case yiou had not read at least one full line here
line_offset = offset + len(os.linesep) + chunk.find(os.linesep)
return line_offset, lines[1]
def get_n_random_lines(path, n=5):
lenght = os.stat(path).st_size
results = []
result_offsets = set()
with open(path) as input:
for x in range(n):
while True:
offset, line = pick_next_random_line(input, random.randint(0, lenght - CHUNK_SIZE))
if not offset in result_offsets:
result_offsets.add(offset)
results.append(line)
break
return results
if __name__ == "__main__":
print get_n_random_lines(PATH)
I believe it would be faster to randomly choose N line numbers, and then to go over the file once, line by line and take the lines who's number is in your list. Currently you have to seek to the random place for each random number so it's O(N*M) where M is the size of the file. What I suggest is O(M).
set()
for your usedPositions
variable - lookup will be faster, and since you need to handle up to 10^6 used positions, lookup time is not irrelevant.xrange
instead of range
in a for loop. Allocating full list of integers doesn't seem necessary.