Python random lines from subfolders

前端 未结 3 569
太阳男子
太阳男子 2020-12-09 22:30

I have many tasks in .txt files in multiple sub folders. I am trying to pick up a total 10 tasks randomly from these folders, their contained files and finally a text line w

3条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-09 23:26

    Here's a simple solution that makes just one pass through the files per sample. If you know exactly how many items you will be sampling from the files, it is probably optimal.

    First off is the sample function. This uses the same algorithm that @NedBatchelder linked to in a comment on an earlier answer (though the Perl code shown there only selected a single line, rather than several). It selects values from of an iterable of lines, and only requires the currently selected lines to be kept in memory at any given time (plus the next candidate line). It raises a ValueError if the iterable has fewer values than the requested sample size.

    import random
    
    def random_sample(n, items):
        results = []
    
        for i, v in enumerate(items):
            r = random.randint(0, i)
            if r < n:
                if i < n:
                    results.insert(r, v) # add first n items in random order
                else:
                    results[r] = v # at a decreasing rate, replace random items
    
        if len(results) < n:
            raise ValueError("Sample larger than population.")
    
        return results
    

    edit: In another question, user @DzinX noticed that the use of insert in this code makes the performance bad (O(N^2)) if you're sampling a very large number of values. His improved version which avoids that issue is here. /edit

    Now we just need to make a suitable iterable of items for our function to sample from. Here's how I'd do it using a generator. This code will only keep one file open at a time, and it does not need more than one line in memory at a time. The optional exclude parameter, if present, should be a set containing lines that have been selected on a previous run (and so should not be yielded again).

    import os
    
    def lines_generator(base_folder, exclude = None):
        for dirpath, dirs, files in os.walk(base_folder):
            for filename in files:
                if filename.endswith(".txt"):
                    fullPath = os.path.join(dirpath, filename)
                    with open(fullPath) as f:
                         for line in f:
                             cleanLine = line.strip()
                             if exclude is None or cleanLine not in exclude:
                                 yield cleanLine
    

    Now, we just need a wrapper function to tie those two pieces together (and manage a set of seen lines). It can return a single sample of size n or a list of count samples, taking advantage of the fact that a slice from a random sample is also a random sample.

    _seen = set()
    
    def get_sample(n, count = None):
        base_folder = r"C:\Tasks"
        if count is None:
            sample = random_sample(n, lines_generator(base_folder, _seen))
            _seen.update(sample)
            return sample
        else:
            sample = random_sample(count * n, lines_generator(base_folder, _seen))
            _seen.update(sample)
            return [sample[i * n:(i + 1) * n] for i in range(count)]
    

    Here's how it can be used:

    def main():
        s1 = get_sample(10)
        print("Sample1:", *s1, sep="\n")
    
        s2, s3 = get_sample(10,2) # get two samples with only one read of the files
        print("\nSample2:", *s2, sep="\n")
        print("\nSample3:", *s3, sep="\n")
    
        s4 = get_sample(5000) # this will probably raise a ValueError!
    

提交回复
热议问题