bash command to convert html page to a text file

前端 未结 10 1163
醉梦人生
醉梦人生 2020-12-09 07:38

I am a beginner to linux. Would you please help me how to convert an html page to a text file. the text file will remove any images and links from the webpage. I want to use

相关标签:
10条回答
  • 2020-12-09 08:32

    batch mode for local htm & html file, lynx required

    #!/bin/sh
    # h2t, convert all htm and html files of a directory to text 
    
    for file in `ls *.htm`
    do
    new=`basename $file htm`
    lynx -dump $file > ${new}txt 
    done
    #####
    for file in `ls *.html`
    do
    new=`basename $file html`
    lynx -dump $file > ${new}txt 
    done
    
    0 讨论(0)
  • 2020-12-09 08:34

    You have html2text.py on command line.

    Usage: html2text.py [(filename|url) [encoding]]

    Options:
      --version             show program's version number and exit
      -h, --help            show this help message and exit
      --ignore-links        don't include any formatting for links
      --ignore-images       don't include any formatting for images
      -g, --google-doc      convert an html-exported Google Document
      -d, --dash-unordered-list
                            use a dash rather than a star for unordered list items
      -b BODY_WIDTH, --body-width=BODY_WIDTH
                            number of characters per output line, 0 for no wrap
      -i LIST_INDENT, --google-list-indent=LIST_INDENT
                            number of pixels Google indents nested lists
      -s, --hide-strikethrough
                            hide strike-through text. only relevent when -g is
                            specified as well
    
    0 讨论(0)
  • 2020-12-09 08:40

    You could get nodejs and globally install the module html-to-text:

    npm install -g html-to-text
    

    Then use it like this:

    html-to-text < stuff.html > stuff.txt
    
    0 讨论(0)
  • 2020-12-09 08:40

    in ubuntu/debian html2text is a good select. http://linux.die.net/man/1/html2text

    0 讨论(0)
提交回复
热议问题