I am a beginner to linux. Would you please help me how to convert an html page to a text file. the text file will remove any images and links from the webpage. I want to use
batch mode for local htm & html file, lynx
required
#!/bin/sh
# h2t, convert all htm and html files of a directory to text
for file in `ls *.htm`
do
new=`basename $file htm`
lynx -dump $file > ${new}txt
done
#####
for file in `ls *.html`
do
new=`basename $file html`
lynx -dump $file > ${new}txt
done
You have html2text.py on command line.
Usage: html2text.py [(filename|url) [encoding]]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--ignore-links don't include any formatting for links
--ignore-images don't include any formatting for images
-g, --google-doc convert an html-exported Google Document
-d, --dash-unordered-list
use a dash rather than a star for unordered list items
-b BODY_WIDTH, --body-width=BODY_WIDTH
number of characters per output line, 0 for no wrap
-i LIST_INDENT, --google-list-indent=LIST_INDENT
number of pixels Google indents nested lists
-s, --hide-strikethrough
hide strike-through text. only relevent when -g is
specified as well
You could get nodejs and globally install the module html-to-text:
npm install -g html-to-text
Then use it like this:
html-to-text < stuff.html > stuff.txt
in ubuntu/debian html2text
is a good select. http://linux.die.net/man/1/html2text