Parsing large (9GB) file using python

耗尽温柔 提交于 2019-11-30 15:32:50

Don't read the whole file into memory in one go; produce records by making use of those newlines. Write the data with the csv module for ease of writing out your pipe-delimited records.

The following code reads the input file line at a time, and writes out CSV rows per record as you go along. It never holds more than one line in memory, plus one record being constructed.

import csv
import re

fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')

with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')

    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue

        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()

    if record:
        # handle last record
        writer.writerow(record)

This code does assume that the file contains text before a colon of the form category/key, so product/productId, review/userId, etc. The part after the slash is used for the CSV columns; the fields list at the top reflects these keys.

Alternatively, you can remove that fields list and use a csv.writer instead, gathering the record values in a list instead:

import csv
import re

with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')

    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue

        field, value = line.split(': ', 1)
        record.append(value.strip())

    if record:
        # handle last record
        writer.writerow(record)

This version requires that record fields are all present and are written to the file in a fixed order.

Use "readline()" to read the fields of a record one by one. Or you can use read(n) to read "n" bytes.

Don't read the whole file into memory at once, instead iterate over it line by line, also use Python's csv module to parse the records:

import csv

with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:

    writer = csv.writer(outfile, delimiter='|')

    for record in csv.reader(infile, delimiter='\n', lineterminator='\n\n'):
        values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
        writer.writerow(values)

A couple things to note here:

  • Use with to open files. Why? Because using with ensures that the file is close()d, even if an exception interrupts the script.

Thus:

with open('myfile.txt') as f:
    do_stuff_to_file(f)

is equivalent to:

f = open('myfile.txt')
try:
    do_stuff_to_file(f)
finally:
    f.close()

To be continued... (I'm out of time ATM)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!