The Parsing of HTML files at the same directory in the Python

。_饼干妹妹 提交于 2020-07-10 10:32:49

问题


I have designed the code parsing HTML files:

from bs4 import BeautifulSoup
import re
import os
from os.path import join

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                clean_tokens = [t for t in text2
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']

                FinalResult = set()
                for somewords in range(len(tokensToCheck)):
                    if tokensToCheck[somewords] not in removementWords:
                        FinalResult.add(tokensToCheck[somewords])

` I have struggled in these case:

1) It saves the code in different lists, while I need one list with all results from various files;

2) As a result, I cannot delete the doubles from different files

How can I handle them?


回答1:


I think I found where you were wrong. Here's the code I changed a little bit.

from bs4 import BeautifulSoup
import re
import os
from os.path import join

# definition position should be here so that it can collect all results into one.
FinalResult = set() 

for (dirname, dirs, files) in os.walk('.'):
    for filename in files:
        if filename.endswith('.html'):
            thefile = os.path.join(dirname, filename)
            with open(thefile, 'r') as f:
                contents = f.read()
                soup = BeautifulSoup(contents, 'lxml')
                Initialtext = soup.get_text()
                MediumText = Initialtext.lower().split()

                clean_tokens = [t for t in text2
                                if re.match(r'[^\W\d]*$', t)]

                removementWords = ['here', 'than']

                # FinalResult = set() - definition position is wrong
                for somewords in range(len(tokensToCheck)):
                    if tokensToCheck[somewords] not in removementWords:
                        FinalResult.add(tokensToCheck[somewords])


来源:https://stackoverflow.com/questions/62187151/the-parsing-of-html-files-at-the-same-directory-in-the-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!