Spider a Website and Return URLs Only

后端未结

关注

 3  1509

遥遥无期 2020-11-29 16:31

I\'m looking for a way to pseudo-spider a website. The key is that I don\'t actually want the content, but rather a simple list of URIs. I can get reasonably close to this i

3条回答

一向 (楼主)

2020-11-29 17:26
The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.
```
wget --spider --force-html -r -l2 $url 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.$css\|js\|png\|gif\|jpg$$' \
  > urls.m3u
```
This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meed my needs.

The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...