What is the most efficient way to get first and last line of a text file?

问题

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?

Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.

回答1:

docs for io module

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.

Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:

for line in fh:
    pass
last = line

You don't need to bother with the binary flag you could just use open(fname).

ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sample and run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.

回答2:

You could open the file for reading and read the first line using the builtin readline(), then seek to the end of file and step backwards until you find the line's preceding EOL and read the last line from there.

with open(file, "rb") as f:
    first = f.readline()        # Read the first line.
    f.seek(-2, os.SEEK_END)     # Jump to the second last byte.
    while f.read(1) != b"\n":   # Until EOL is found...
        f.seek(-2, os.SEEK_CUR) # ...jump back the read byte plus one more.
    last = f.readline()         # Read last line.

Jumping to the second last byte instead of the last one prevents that you return directly because of a trailing EOL. While you're stepping backwards you'll also want to step two bytes since the reading and checking for EOL pushes the position forward one step.

When using seek the format is fseek(offset, whence=0) where whence signifies to what the offset is relative to. Quote from docs.python.org:

SEEK_SET or 0 = seek from the start of the stream (the default); offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour.

SEEK_CUR or 1 = “seek” to the current position; offset must be zero, which is a no-operation (all other values are unsupported).

SEEK_END or 2 = seek to the end of the stream; offset must be zero (all other values are unsupported).

Running it through timeit 10k times on a file with 6k lines totalling 200kB gave me 1.62s vs 6.92s when comparing to the for-loop beneath that was suggested earlier. Using a 1.3GB sized file, still with 6k lines, a hundred times resulted in 8.93 vs 86.95.

with open(file, "rb") as f:
    first = f.readline()     # Read the first line.
    for last in f: pass      # Loop through the whole file reading it all.

回答3:

Here's a modified version of SilentGhost's answer that will do what you want.

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

No need for an upper bound for line length here.

回答4:

Can you use unix commands? I think using head -1 and tail -n 1 are probably the most efficient methods. Alternatively, you could use a simple fid.readline() to get the first line and fid.readlines()[-1], but that may take too much memory.

回答5:

This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:

def tail(filepath):
    """
    @author Marco Sulla (marcosullaroma@gmail.com)
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

It's ispired by Trasp's answer and AnotherParker's comment.

回答6:

First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]

回答7:

w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

The for loop runs through the lines and x gets the last line on the final iteration.

回答8:

Nobody mentioned using reversed:

f=open(file,"r")
r=reversed(f.readlines())
last_line_of_file = r.next()

回答9:

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

回答10:

Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argument will be raised.

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

回答11:

Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseek some amount from SEEK_END find the second to last line ending and then readline() the last line.

回答12:

with open(filename, "rb") as f:#Needs to be in binary mode for the seek from the end to work
    first = f.readline()
    if f.read(1) == '':
        return first
    f.seek(-2, 2)  # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found...
        f.seek(-2, 1)  # ...jump back the read byte plus one more.
    last = f.readline()  # Read last line.
    return last

The above answer is a modified version of the above answers which handles the case that there is only one line in the file

来源：https://stackoverflow.com/questions/3346430/what-is-the-most-efficient-way-to-get-first-and-last-line-of-a-text-file

标签

python

file

seek