BeautifulSoup MemoryError When Opening Several Files in Directory

时光毁灭记忆、已成空白 提交于 2019-12-31 01:51:17

问题


Context: Every week, I receive a list of lab results in the form of an html file. Each week, there are about 3,000 results with each set of results having between two and four tables associated with them. For each result/trial, I only care about some standard information that is stored in one of these tables. That table can be uniquely identified because the first cell, first column always has the text "Lab Results".

Problem: The following code works great when I do each file at a time. That is, instead of doing a for loop over the directory, I point get_data = open() to a specific file. However, I want to grab the data from the past few years and would rather not do each file individually. Therefore, I used the glob module and a for loop to cycle through all the files in the directory. The issue I am having is I get a MemoryError by the time I get to the third file in the directory.

The Question: Is there a way to clear/reset the memory between each file? That way, I could cycle through all the files in the directory and not paste in each file name individually. As you can see in the code below, I tried clearing the variables with del, but that did not work.

Thank you.

from bs4 import BeautifulSoup
import glob
import gc

for FileName in glob.glob("\\Research Results\\*"):

    get_data = open(FileName,'r').read()

    soup = BeautifulSoup(get_data)

    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    del get_data
    del soup
    del tables
    gc.collect()

print ("done")

回答1:


I´m a very beginner programmer and I faced the same problem. I did three things that seemed to solve the problem:

  1. Also call garbage collection('gc.collect()') at the beginning of the iteration
  2. transforme the parsing on a iteration, so all the global variables will become local variables and will be deleted at the end of the function.
  3. Use soupe.decompose()

I think the second change probably solved it, but I didn´t have time to check it and I don´t want to change a working code.

For the this code, the solution would be something like this:

from bs4 import BeautifulSoup
import glob
import gc

def parser(file):
    gc.collect()

    get_data = open(file,'r').read()

    soup = BeautifulSoup(get_data)
    get_data.close()
    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    soup.decompose()
    gc.collect()
    return None


for filename in glob.glob("\\Research Results\\*"):
    parser(filename)

print ("done")


来源:https://stackoverflow.com/questions/29904161/beautifulsoup-memoryerror-when-opening-several-files-in-directory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!