可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I need to read a large file, line by line. Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines()
because it will create a very large list in the memory.
How will the code below work for this case? Is xreadlines
itself reading one by one into memory? Is the generator expression needed?
f = (line for line in open("log.txt").xreadlines()) # how much is loaded in memory? f.next()
Plus, what can I do to read this in reverse order, just as the Linux tail
command?
I found:
http://code.google.com/p/pytailer/
and
"python head, tail and backward read by lines of a text file"
Both worked very well!
回答1:
I provided this answer because Keith's, while succinct, doesn't close the file explicitly
with open("log.txt") as infile: for line in infile: do_something_with(line)
回答2:
All you need to do is use the file object as an iterator.
for line in open("log.txt"): do_something_with(line)
Even better is using context manager in recent Python versions.
with open("log.txt") as fileobject: for line in fileobject: do_something_with(line)
This will automatically close the file as well.
回答3:
You are better off using an iterator instead. Relevant: http://docs.python.org/library/fileinput.html
From the docs:
import fileinput for line in fileinput.input("filename"): process(line)
This will avoid copying the whole file into memory at once.
回答4:
An old school approach:
fh = open(file_name, 'rt') line = fh.readline() while line: # do stuff with line line = fh.readline() fh.close()
回答5:
I couldn't believe that it could be as easy as @john-la-rooy's answer made it seem. So, I recreated the cp
command using line by line reading and writing. It's CRAZY FAST.
#!/usr/bin/env python3.6 import sys with open(sys.argv[2], 'w') as outfile: with open(sys.argv[1]) as infile: for line in infile: outfile.write(line)
回答6:
How about this? Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line. If you are reading the file line by line, you are not making efficient use of the cached information.
Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.
def chunks(file,size=1024): while 1: startat=fh.tell() print startat #file's object current position from the start fh.seek(size,1) #offset from current postion -->1 data=fh.readline() yield startat,fh.tell()-startat #doesnt store whole list in memory if not data: break if os.path.isfile(fname): try: fh=open(fname,'rb') except IOError as e: #file --> permission denied print "I/O error({0}): {1}".format(e.errno, e.strerror) except Exception as e1: #handle other exceptions such as attribute errors print "Unexpected error: {0}".format(e1) for ele in chunks(fh): fh.seek(ele[0])#startat data=fh.read(ele[1])#endat print data
回答7:
Thank you! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files. This solved the problem. But to get each line, I had to do a couple extra steps. Each line was preceded by a "b'" which I guess that it was in binary format. Using "decode(utf-8)" changed it ascii.
Then I had to remove a "=\n" in the middle of each line.
Then I split the lines at the new line.
b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed data_list = data_chunk.split('\n') #List containing lines in chunk #print(data_list,'\n') #time.sleep(1) for j in range(len(data_list)): #iterate through data_list to get each item i += 1 line_of_data = data_list[j] print(line_of_data)
Here is the code starting just above "print data" in Arohi's code.
回答8:
The blaze project has come a long way over the last 6 years. It has a simple API covering a useful subset of pandas features.
dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations.
import dask.dataframe as dd df = dd.read_csv('filename.csv') df.head(10) # return first 10 rows df.tail(10) # return last 10 rows # iterate rows for idx, row in df.iterrows(): ... # group by my_field and return mean df.groupby(df.my_field).value.mean().compute() # slice by column df[df.my_field=='XYZ'].compute()
回答9:
Please try this:
with open('filename','r',buffering=100000) as f: for line in f: print line
回答10:
f=open('filename','r').read() f1=f.split('\n') for i in range (len(f1)): do_something_with(f1[i])
hope this helps.