bash command to convert html page to a text file

前端 未结 10 1162
醉梦人生
醉梦人生 2020-12-09 07:38

I am a beginner to linux. Would you please help me how to convert an html page to a text file. the text file will remove any images and links from the webpage. I want to use

相关标签:
10条回答
  • 2020-12-09 08:16

    I used python-boilerpipe and it works very well, so far...

    0 讨论(0)
  • 2020-12-09 08:17

    On OSX you can use the command line tool called textutil to batch convert html files to txt format:

    textutil -convert txt *.html
    
    0 讨论(0)
  • 2020-12-09 08:21

    Easiest way is to use something like this which the dump (in short is the text version of viewable html)

    remote file

    lynx --dump www.google.com > file.txt
    links -dump www.google.com
    

    local file

    lynx --dump ./1.html > file.txt
    links -dump ./1.htm
    
    0 讨论(0)
  • 2020-12-09 08:21

    Bash script to recursively convert html page to text file. Applied to httpd-manual. Makes grep -Rhi 'LoadModule ssl' /usr/share/httpd/manual_dump -A 10 work convenient.

    #!/bin/sh
    # Adapted from ewwink, recursive html to txt dump
    # Made to kind of recursively (4 levels) dump the /usr/share/httpd manual to a dump httpd manual directory into a txt dump including dir
    # put this script in /usr/share/httpd for it to work (after installing httpd-manual rpm)
    
    for file in ./manual/*{,/*,/*/*,/*/*/*}.html
    do
    new=`basename $file .html`
    mkdir -p ./manual_dump/${new}
    lynx --dump $file > ./manual_dump/${new}.txt
    done
    
    0 讨论(0)
  • 2020-12-09 08:25

    Using sed

    sed -e 's/<[^>]*>//g' foo.html
    
    0 讨论(0)
  • 2020-12-09 08:28

    I think links is the most common tool to do this. Check man links and search for plain text or similar. -dump is my guess, search for that too. The software comes with most distributions.

    0 讨论(0)
提交回复
热议问题