Extracting text from HTML file using Python

后端 未结 30 2566
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  春和景丽
    2020-11-22 04:40

    Here's the code I use on a regular basis.

    from bs4 import BeautifulSoup
    import urllib.request
    
    
    def processText(webpage):
    
        # EMPTY LIST TO STORE PROCESSED TEXT
        proc_text = []
    
        try:
            news_open = urllib.request.urlopen(webpage.group())
            news_soup = BeautifulSoup(news_open, "lxml")
            news_para = news_soup.find_all("p", text = True)
    
            for item in news_para:
                # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
                para_text = (' ').join((item.text).split())
    
                # COMBINE LINES/PARAGRAPHS INTO A LIST
                proc_text.append(para_text)
    
        except urllib.error.HTTPError:
            pass
    
        return proc_text
    

    I hope that helps.

提交回复
热议问题