Why does takewhile() skip the first line?

问题

I have a file like this:

1
2
3
TAB
1
2
3
TAB

I want to read the lines between TAB as blocks.

import itertools

def block_generator(file):
    with open(file) as lines:
        for line in lines:
            block = list(itertools.takewhile(lambda x: x.rstrip('\n') != '\t',
                                             lines))
            yield block

I want to use it as such:

blocks = block_generator(myfile)
for block in blocks:
    do_something(block)

The blocks i get all start with the second line like [2,3] [2,3], why?

回答1:

Here is another approach using groupby

from itertools import groupby
def block_generator(filename):
    with open(filename) as lines:
        for pred,block in groupby(lines, "\t\n".__ne__):
            if pred:
                yield block

回答2:

Here you go, tested code. Uses while True: to loop, and lets itertools.takewhile() do everything with lines. When itertools.takewhile() reaches the end of input, it returns an iterator that does nothing except raise StopIteration, which list() simply turns into an empty list, so a simple if not block: test detects the empty list and breaks out of the loop.

import itertools

def not_tabline(line):
    return '\t' != line.rstrip('\n')

def block_generator(file):
    with open(file) as lines:
        while True:
            block = list(itertools.takewhile(not_tabline, lines))
            if not block:
                break
            yield block

for block in block_generator("test.txt"):
    print "BLOCK:"
    print block

As noted in a comment below, this has one flaw: if the input text has two lines in a row with just the tab character, this loop will stop processing without reading all the input text. And I cannot think of any way to handle this cleanly; it's really unfortunate that the iterator you get back from itertools.takewhile() uses StopIteration both as the marker for the end of a group and as what you get at end-of-file. To make it worse, I cannot find any way to ask a file iterator object whether it has reached end-of-file or not. And to make it even worse, itertools.takewhile() seems to advance the file iterator to end-of-file instantly; when I tried to rewrite the above to check on our progress using lines.tell() it was already at end-of-file after the first group.

I suggest using the itertools.groupby() solution. It's cleaner.

回答3:

I think the problem is that you are taking lines in your lambda function rather than line. What is your expected output?

回答4:

itertools.takewhile implicitly iterates over the lines of the file in order to grab chunks, but so does for line in lines:. Each time through the loop, a line is grabbed, thrown away (since there is no code that uses line), and then some more are blocked together.

来源：https://stackoverflow.com/questions/7278327/why-does-takewhile-skip-the-first-line

标签

python

generator