How does a web crawler work?

人走茶凉 提交于 2019-12-12 02:15:47

问题


Using some basic website scraping, I am trying to prepare a database for price comparison which will ease users' search experiences. Now, I have several questions:

Should I use file_get_contents() or curl to get the contents of the required web page?

$link = "http://xyz.com";
$res55 = curl_init($link);
curl_setopt ($res55, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($res55, CURLOPT_FOLLOWLOCATION, true); 
$result = curl_exec($res55);

Further, every time I crawl a web page, I fetch a lot of links to visit next. This may take a long time (days if you crawl big websites like Ebay). In that case, my PHP code will time-out. What should be the automated way to do this? Is there a way to prevent PHP from timing out by making changes on the server, or is there another solution?


回答1:


So, in that case my PHP code will time-out and it won't continue that long.

Are you doing this in the code that's driving your web page? That is, when someone makes a request, are you crawling right then and there to build the response? If so, then yes there is definitely a better way.

If you have a list of the sites you need to crawl, you can set up a scheduled job (using cron for example) to run a command-line application (not a web page) to crawl the sites. At that point you should parse out the data you're looking for and store it in a database. Your site would then just need to point to that database.

This is an improvement for two reasons:

  1. Performance
  2. Code Design

Performance: In a request/response system like a web site, you want to minimize I/O bottlenecks. The response should take as little time as possible. So you want to avoid in-line work wherever possible. By offloading this process to something outside the context of the website and using a local database, you turn a series of external service calls (slow) to a single local database call (much faster).

Code Design: Separation of concerns. This setup modularizes your code a little bit more. You have one module which is in charge of fetching the data and another which is in charge of displaying the data. Neither of them should ever need to know or care about how the other accomplishes its tasks. So if you ever need to replace one (such as finding a better scraping method) you won't also need to change the other.




回答2:


curl is the good options. file_get_contents is for reading files on your server

You can set the timeout in curl to 0 in order to have unlimited timeout. You have to set the timeout on Apache too




回答3:


I recommend curl for reading website contents.

To avoid the PHP script timing out, you can use set_time_limit. The advantage of this is that you can set the timeout for every server connection to terminate the script, since calling the method resets the previous timeout countdown. No time limit will be applied if 0 is passed as the parameter.

Alternatively, you can set timeout in the php configuration property max_execution_time, but note that this will apply to all php scripts rather than just the crawler.

http://php.net/manual/en/function.set-time-limit.php




回答4:


I'd opt for cURL since you get much more flexibility and you can enable compression and http keep-alive with cURL.

But why re-invent the wheel? Check out PHPCrawl. It uses sockets (fsockopen) to download URLs but supports multiple crawlers at once (on Linux) and has a lot of options for crawling that probably meet all of your needs. They take care of timeouts for you as well and have good examples available for basic crawlers.




回答5:


You could reinvent the wheel here, but why not look at a framework like PHPCrawl or Sphider? (although the latter may not be exactly what you're looking for)

Per the documentation, file_get_contents works best for reading files on the server, so I strongly suggest you use curl instead. As for fixing any timeout issues, set_time_limit is the option you want. set_time_limit(0) should prevent your script from timing out.

You'll want to set the timeout in Apache as well, however. Edit your httpd.conf and change the line that reads Timeout to Timeout 0 for an infinite timeout.



来源:https://stackoverflow.com/questions/11834414/how-does-a-web-crawler-work

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!