Extracting text from HTML file using Python

后端未结

关注

 30  2791

一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答

不要未来只要你来 (楼主)

2020-11-22 04:16
Another non-python solution: Libre Office:
```
soffice --headless --invisible --convert-to txt input1.html
```
The reason I prefer this one over other alternatives is that every HTML paragraph gets converted into a single text line (no line breaks), which is what I was looking for. Other methods require post-processing. Lynx does produce nice output, but not exactly what I was looking for. Besides, Libre Office can be used to convert from all sorts of formats...
0 讨论(0)

查看其它30个回答
发布评论:

提交评论
- 加载中...