I am a beginner to linux. Would you please help me how to convert an html page to a text file. the text file will remove any images and links from the webpage. I want to use
I used python-boilerpipe and it works very well, so far...
On OSX you can use the command line tool called textutil to batch convert html files to txt format:
textutil -convert txt *.html
Easiest way is to use something like this which the dump (in short is the text version of viewable html)
remote file
lynx --dump www.google.com > file.txt
links -dump www.google.com
local file
lynx --dump ./1.html > file.txt
links -dump ./1.htm
Bash script to recursively convert html page to text file. Applied to httpd-manual. Makes grep -Rhi 'LoadModule ssl' /usr/share/httpd/manual_dump -A 10 work convenient.
#!/bin/sh
# Adapted from ewwink, recursive html to txt dump
# Made to kind of recursively (4 levels) dump the /usr/share/httpd manual to a dump httpd manual directory into a txt dump including dir
# put this script in /usr/share/httpd for it to work (after installing httpd-manual rpm)
for file in ./manual/*{,/*,/*/*,/*/*/*}.html
do
new=`basename $file .html`
mkdir -p ./manual_dump/${new}
lynx --dump $file > ./manual_dump/${new}.txt
done
Using sed
sed -e 's/<[^>]*>//g' foo.html
I think links is the most common tool to do this. Check man links and search for plain text or similar. -dump is my guess, search for that too. The software comes with most distributions.