I have a large text file that I need to parse into a pipe delimited text file using python. The file looks like this (basically):
product/productId: D7SDF9S9
review/userId: asdf9uas0d8u9f
review/score: 5.0
review/some text here
product/productId: D39F99
review/userId: fasd9fasd9f9f
review/score: 4.1
review/some text here
Each record is separated by two newline charters /n
. I have written a parser below.
with open ("largefile.txt", "r") as myfile:
fullstr = myfile.read()
allsplits = re.split("\n\n",fullstr)
articles = []
for i,s in enumerate(allsplits[0:]):
splits = re.split("\n.*?: ",s)
productId = splits[0]
userId = splits[1]
profileName = splits[2]
helpfulness = splits[3]
rating = splits[4]
time = splits[5]
summary = splits[6]
text = splits[7]
fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"\n")
return
The problem is the file I am reading in is so large that I run out of memory before it can complete.
I suspect it's bambing out at the allsplits = re.split("\n\n",fullstr)
line.
Can someone let me know of a way to just read in one record at a time, parse it, write it to a file, and then move to the next record?
Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv
module for ease of writing out your pipe-delimited records.
The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.
import csv
import re
fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')
with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
writer = csv.DictWriter(fw, fields, delimiter='|')
record = {}
for line in myfile:
if not line.strip() and record:
# empty line is the end of a record
writer.writerow(record)
record = {}
continue
field, value = line.split(': ', 1)
record[field.partition('/')[-1].strip()] = value.strip()
if record:
# handle last record
writer.writerow(record)
This code does assume that the file contains text before a colon of the form category/key
, so product/productId
, review/userId
, etc. The part after the slash is used for the CSV columns; the fields
list at the top reflects these keys.
Alternatively, you can remove that fields
list and use a csv.writer
instead, gathering the record values in a list instead:
import csv
import re
with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
writer = csv.writer(fw, delimiter='|')
record = []
for line in myfile:
if not line.strip() and record:
# empty line is the end of a record
writer.writerow(record)
record = []
continue
field, value = line.split(': ', 1)
record.append(value.strip())
if record:
# handle last record
writer.writerow(record)
This version requires that record fields are all present and are written to the file in a fixed order.
Use "readline()" to read the fields of a record one by one. Or you can use read(n) to read "n" bytes.
Don't read the whole file into memory at once, instead iterate over it line by line, also use Python's csv
module to parse the records:
import csv
with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:
writer = csv.writer(outfile, delimiter='|')
for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
writer.writerow(values)
A couple things to note here:
- Use
with
to open files. Why? Because usingwith
ensures that the file isclose()
d, even if an exception interrupts the script.
Thus:
with open('myfile.txt') as f:
do_stuff_to_file(f)
is equivalent to:
f = open('myfile.txt')
try:
do_stuff_to_file(f)
finally:
f.close()
To be continued... (I'm out of time ATM)
来源:https://stackoverflow.com/questions/21653738/parsing-large-9gb-file-using-python