Extracting text from HTML file using Python

后端 未结 30 2613
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  不要未来只要你来
    2020-11-22 04:16

    Another non-python solution: Libre Office:

    soffice --headless --invisible --convert-to txt input1.html
    

    The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...

提交回复
热议问题