Python random lines from subfolders

前端未结

关注

 3  569

太阳男子 2020-12-09 22:30

I have many tasks in .txt files in multiple sub folders. I am trying to pick up a total 10 tasks randomly from these folders, their contained files and finally a text line w

3条回答

予麋鹿 (楼主)

2020-12-09 23:26

Here's a simple solution that makes just one pass through the files per sample. If you know exactly how many items you will be sampling from the files, it is probably optimal.

First off is the sample function. This uses the same algorithm that @NedBatchelder linked to in a comment on an earlier answer (though the Perl code shown there only selected a single line, rather than several). It selects values from of an iterable of lines, and only requires the currently selected lines to be kept in memory at any given time (plus the next candidate line). It raises a ValueError if the iterable has fewer values than the requested sample size.

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

edit: In another question, user @DzinX noticed that the use of insert in this code makes the performance bad (O(N^2)) if you're sampling a very large number of values. His improved version which avoids that issue is here. /edit

Now we just need to make a suitable iterable of items for our function to sample from. Here's how I'd do it using a generator. This code will only keep one file open at a time, and it does not need more than one line in memory at a time. The optional exclude parameter, if present, should be a set containing lines that have been selected on a previous run (and so should not be yielded again).

import os

def lines_generator(base_folder, exclude = None):
    for dirpath, dirs, files in os.walk(base_folder):
        for filename in files:
            if filename.endswith(".txt"):
                fullPath = os.path.join(dirpath, filename)
                with open(fullPath) as f:
                     for line in f:
                         cleanLine = line.strip()
                         if exclude is None or cleanLine not in exclude:
                             yield cleanLine

Now, we just need a wrapper function to tie those two pieces together (and manage a set of seen lines). It can return a single sample of size n or a list of count samples, taking advantage of the fact that a slice from a random sample is also a random sample.

_seen = set()

def get_sample(n, count = None):
    base_folder = r"C:\Tasks"
    if count is None:
        sample = random_sample(n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return sample
    else:
        sample = random_sample(count * n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return [sample[i * n:(i + 1) * n] for i in range(count)]

Here's how it can be used:

def main():
    s1 = get_sample(10)
    print("Sample1:", *s1, sep="\n")

    s2, s3 = get_sample(10,2) # get two samples with only one read of the files
    print("\nSample2:", *s2, sep="\n")
    print("\nSample3:", *s3, sep="\n")

    s4 = get_sample(5000) # this will probably raise a ValueError!

0 讨论(0)

查看其它3个回答