Crawl website using wget and limit total number of crawled links

℡╲_俬逩灬. 提交于 2019-12-05 19:24:14

You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/

Olivier Delouya
  1. Create a fifo file (mknod /tmp/httpipe p)
  2. do a fork
    • in the child do wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
    • in the father: read line by line /tmp/httpipe
    • parse the output =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
    • count the lines; after 100 lines just close the file, it will break the pipe
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!