问题
Hi i have a domain i'd like to parse with cUrl and here is the case:
When i go on domain http://register.metsad.ee/avalik/info_teatis.php?too_id=2942704201
it redirects me to [ register.metsad.ee/avalik/info_teatis.php?too_id=2942704201 ]
its the same thing without http:// www. code i use to parse is:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$src = 'http://register.metsad.ee/avalik/info_teatis.php?too_id=2942704201';
And then $c = get_data($src);
echo $c;
For resoult i get a blank white page. I also tried with Simple_Html_Dom parser like this:
echo file_get_html($src)->plaintext;
But still i get a blank white page. When i trie to parse without http:// then there is an error that
Warning: file_get_contents(register.metsad.ee/avalik/info_teatis.php?too_id=2942704201) [function.file-get-contents]: failed to open stream: Result too large in C:\xampp\htdocs\Trash\metsakontroll\system\c_simple_html_dom.php on line 70
cUrl gives still white screen, no effect. When i tried to parse it like a folder like this:
http://www.metsad.ee/register/avalik/info_teatis.php?too_id=2942704201 then server says Not Found
i searched the whole internet =/ any ideas how to read that page via cUrl or Simple_html_dom ?
回答1:
There is some kind of protection on register.metsad.ee side. Thay return empty response until User-Agent
header is set.
Failed call (empty response):
feedbee@server:~$ telnet register.metsad.ee 80
Trying 213.184.43.115...
Connected to register.metsad.ee.
Escape character is '^]'.
GET /avalik/info_teatis.php?too_id=2942704201 HTTP/1.1
Host: register.metsad.ee
HTTP/1.1 200 OK
Date: Thu, 13 Dec 2012 20:07:11 GMT
Server: Apache
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Successfull call (HTML page returned):
feedbee@server:~$ telnet register.metsad.ee 80
GET http://register.metsad.ee/avalik/info_teatis.php?too_id=2942704201 HTTP/1.1
Host: register.metsad.ee
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
HTTP/1.1 200 OK
Date: Thu, 13 Dec 2012 20:13:07 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: SNS=a0e425c2aec17c38be3716b366f75749; path=/
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
762
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
...
So you need to add the next line to:
curl_setopt($ch, So you need to add CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0"); for example (or any other user agent string).
来源:https://stackoverflow.com/questions/13867430/curl-a-domain-without-http-www