Spider a Website and Return URLs Only

后端 未结 3 1509
遥遥无期
遥遥无期 2020-11-29 16:31

I\'m looking for a way to pseudo-spider a website. The key is that I don\'t actually want the content, but rather a simple list of URIs. I can get reasonably close to this i

3条回答
  •  一向
    一向 (楼主)
    2020-11-29 17:26

    The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.

    wget --spider --force-html -r -l2 $url 2>&1 \
      | grep '^--' | awk '{ print $3 }' \
      | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
      > urls.m3u
    

    This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meed my needs.

    The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

提交回复
热议问题