multi-thread, multi-curl crawler in PHP

让人想犯罪 __ 提交于 2019-12-04 04:57:47
rdlowrey

DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.

The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.

Limiting the number of concurrent connections is easily accomplished. Consider:

<?php

use Artax\Client, Artax\Response;

require dirname(__DIR__) . '/autoload.php';

$client = new Client;

// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);

$requests = array(
    'so-home'    => 'http://stackoverflow.com',
    'so-php'     => 'http://stackoverflow.com/questions/tagged/php',
    'so-python'  => 'http://stackoverflow.com/questions/tagged/python',
    'so-http'    => 'http://stackoverflow.com/questions/tagged/http',
    'so-html'    => 'http://stackoverflow.com/questions/tagged/html',
    'so-css'     => 'http://stackoverflow.com/questions/tagged/css',
    'so-js'      => 'http://stackoverflow.com/questions/tagged/javascript'
);

$onResponse = function($requestKey, Response $r) {
    echo $requestKey, ' :: ', $r->getStatus();
};

$onError = function($requestKey, Exception $e) {
    echo $requestKey, ' :: ', $e->getMessage();
}

$client->requestMulti($requests, $onResponse, $onError);

IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.

you could try something like this, haven't checked it, but you should get the idea

$request_pool = array();

function CreateHandle($url) {
    $handle = curl_init($url);

    // set curl options here

    return $handle;
}

function Process($data) {
    global $request_pool;

    // do something with data

    array_push($request_pool , CreateHandle($some_new_url));
}

function RunMulti() {
    global $request_pool;

    $multi_handle = curl_multi_init();

    $active_request_pool = array();

    $running = 0;
    $active_request_count = 0;
    $active_request_max = 10; // adjust as necessary
    do {
        $waiting_request_count = count($request_pool);
        while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
            $request = array_shift($request_pool);
            curl_multi_add_handle($multi_handle , $request);
            $active_request_pool[(int)$request] = $request;

            $waiting_request_count--;
            $active_request_count++;
        }

        curl_multi_exec($multi_handle , $running);
        curl_multi_select($multi_handle);
        while($info = curl_multi_info_read($multi_handle)) {
            $curl_handle = $info['handle'];
            call_user_func('Process' , curl_multi_getcontent($curl_handle));
            curl_multi_remove_handle($multi_handle , $curl_handle);
            curl_close($curl_handle);
            $active_request_count--;
        }

    } while($active_request_count > 0 || $waiting_request_count > 0);

    curl_multi_close($multi_handle);
}

You should look for some more robust solution to your problem. RabbitMQ is a very good solution I used. There is also Gearman but I think it is your choice.

I prefer RabbitMQ.

I will share with you my code which I have used to collect email addresses from certain website. You can modify it to fit your needs. There were some problems with relative URL's there. And I do not use CURL here.

<?php
error_reporting(E_ALL);
$home   = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');

function scan_page($home, $full_url, &$writer) {

    static $done = array();
    $done[] = $full_url;

    // Scan only internal links. Do not scan all the internet!))
    if (strpos($full_url, $home) === false) {
        return false;
    }
    $html = @file_get_contents($full_url);
    if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
        return false;
    }

    echo $full_url . '<br />';

    preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+@([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);

    if (!empty($emails) && is_array($emails)) {
        foreach ($emails as $email_group) {
            if (is_array($email_group)) {
                foreach ($email_group as $email) {
                    if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
                        $writer->write($email);
                    }
                }
            }
        }
    }

    $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
    preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
    if (is_array($matches)) {
        foreach($matches as $match) {
            if (!empty($match[2]) && is_scalar($match[2])) {
                $url = $match[2];
                if (!filter_var($url, FILTER_VALIDATE_URL)) {
                    $url = $home . $url;
                }
                if (!in_array($url, $done)) {
                    scan_page($home, $url, $writer);
                }
            }
        }
    }
}

class RWriter {
    private $_fh = null;

    private $_written = array();

    public function __construct($fname) {
        $this->_fh = fopen($fname, 'w+');
    }

    public function write($line) {
        if (in_array($line, $this->_written)) {
            return;
        }
        $this->_written[] = $line;
        echo $line . '<br />';
        fwrite($this->_fh, "{$line}\r\n");
    }

    public function __destruct() {
        fclose($this->_fh);
    }
}

scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!