bash command to convert html page to a text file

前端未结

关注

 10  1171

I am a beginner to linux. Would you please help me how to convert an html page to a text file. the text file will remove any images and links from the webpage. I want to use

相关标签:

10条回答

梦如初夏

2020-12-09 08:16

I used python-boilerpipe and it works very well, so far...

0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-09 08:17
On OSX you can use the command line tool called textutil to batch convert html files to txt format:
```
textutil -convert txt *.html
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-12-09 08:21
Easiest way is to use something like this which the dump (in short is the text version of viewable html)

remote file
```
lynx --dump www.google.com > file.txt
links -dump www.google.com
```
local file
```
lynx --dump ./1.html > file.txt
links -dump ./1.htm
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

情书的邮戳

2020-12-09 08:21

Bash script to recursively convert html page to text file. Applied to httpd-manual. Makes grep -Rhi 'LoadModule ssl' /usr/share/httpd/manual_dump -A 10 work convenient.

#!/bin/sh
# Adapted from ewwink, recursive html to txt dump
# Made to kind of recursively (4 levels) dump the /usr/share/httpd manual to a dump httpd manual directory into a txt dump including dir
# put this script in /usr/share/httpd for it to work (after installing httpd-manual rpm)

for file in ./manual/*{,/*,/*/*,/*/*/*}.html
do
new=`basename $file .html`
mkdir -p ./manual_dump/${new}
lynx --dump $file > ./manual_dump/${new}.txt
done

0 讨论(0)

南笙

2020-12-09 08:25
Using sed
```
sed -e 's/<[^>]*>//g' foo.html
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2020-12-09 08:28

I think links is the most common tool to do this. Check man links and search for plain text or similar. -dump is my guess, search for that too. The software comes with most distributions.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页