问题
I used wget to 'download' a site.
wget -r http://www.xyz.com
i) It returns a .css file, a .js file, and index.php and an image img1.jpg
ii) However, there exist more images under xyz.com. I typed www.xyz.com/Img2.jpg and hence
got an image.
iii) But index.php refers to a single image, i.e. img1.jpg.
iv) A robot file accompanies it that contains Disallow:
What change should be made in the command line to return everything under xyz.com, that are not
referenced in index.php, but are static in the directory.
回答1:
Not possible. How should wget
know about other files in the directory unless you have a link to the file somewhere?
来源:https://stackoverflow.com/questions/6520321/web-crawling-and-robots-txt