Why does WWW::Mechanize GET certain pages but not others?

断了今生、忘了曾经 提交于 2020-01-04 04:21:11

问题


I'm new to Perl/HTML things. I'm trying to use $mech->get($url) to get something from a periodic table on http://en.wikipedia.org/wiki/Periodic_table but it kept returning error message like this:

Error GETing http://en.wikipedia.org/wiki/Periodic_table: Forbidden at PeriodicTable.pl line 13

But $mech->get($url) works fine if $url is http://search.cpan.org/.

Any help will be much appreciated!


Here is my code:

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";

$mech->get( $table_url );

回答1:


It's because Wikipedia deny access to some programs based on the User-Agent supplied on the request.

You can alias yourself to appear as a 'normal' web browser by setting the agent after instantiation and before the get(), for example:

$mech->agent( 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8' );

That worked for me with the URL in your posting. Shorter strings will probably work too.

(You should remove the trailing slash from the URL too I think.)

WWW::Mechanize is a subclass of LWP::UserAgent - see docs there for more info, including on the agent() method.

You should limit your use of this method of access though. Wikipedia explicitly deny access to some spiders in their robots.txt file. The default user agent for LWP::UserAgent (which starts with libwww) is in the list.




回答2:


When you have these sorts of problems, you need to watch the HTTP transactions so you can see what the webserver is sending back to you. In this case, you'd see that Mech connects and gets a response, but Wikipedia is declining to respond to your bot. I like HTTP Scoop on the Mac.



来源:https://stackoverflow.com/questions/3690671/why-does-wwwmechanize-get-certain-pages-but-not-others

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!