Extract News article content from stored .html pages
问题 I am reading text from html files and doing some analysis. These .html files are news articles. Code: html = open(filepath,'r').read() raw = nltk.clean_html(html) raw.unidecode(item.decode('utf8')) Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python? I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one