goutte

Setting proxy in Goutte

我的未来我决定 提交于 2021-02-16 08:56:16
问题 I've tried using Guzzle's docs to set proxy but it's not working. The official Github page for Goutte is pretty dead so can't find anything there. Anyone know how to set a proxy? This is what I've tried: $client = new Client(); $client->setHeader('User-Agent', $user_agent); $crawler = $client->request('GET', $request, ['proxy' => $proxy]); 回答1: You thinking rigth, but in Goutte\Client::doRequest(), when create Guzzle client $guzzleRequest = $this->getClient()->createRequest( $request-

Setting proxy in Goutte

余生颓废 提交于 2021-02-16 08:56:00
问题 I've tried using Guzzle's docs to set proxy but it's not working. The official Github page for Goutte is pretty dead so can't find anything there. Anyone know how to set a proxy? This is what I've tried: $client = new Client(); $client->setHeader('User-Agent', $user_agent); $crawler = $client->request('GET', $request, ['proxy' => $proxy]); 回答1: You thinking rigth, but in Goutte\Client::doRequest(), when create Guzzle client $guzzleRequest = $this->getClient()->createRequest( $request-

How to unit test a web scraping service php unit

こ雲淡風輕ζ 提交于 2021-01-29 22:56:28
问题 I am currently developing a project in PHP + Laravel that needs to scrape data from two different websites. I am using the Goutte Scraping Library. I have 10 integration tests, where I use the Crawler object that Goutte's Client provide in order to get the specific data I want to scrape from each website. The tests work just fine (I even used infection library for mutant testing)... But the thing is that I thik there could be a way to unit test all the functions (therefore, the tests would

How to unit test a web scraping service php unit

廉价感情. 提交于 2021-01-29 22:35:01
问题 I am currently developing a project in PHP + Laravel that needs to scrape data from two different websites. I am using the Goutte Scraping Library. I have 10 integration tests, where I use the Crawler object that Goutte's Client provide in order to get the specific data I want to scrape from each website. The tests work just fine (I even used infection library for mutant testing)... But the thing is that I thik there could be a way to unit test all the functions (therefore, the tests would

Can Goutte/Guzzle be forced into UTF-8 mode?

时光毁灭记忆、已成空白 提交于 2019-12-29 04:53:41
问题 I'm scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> However, the content type header is thus: Content-Type: text/html and not: Content-Type: text/html; charset=utf-8 Thus, when I scrape, Goutte does not spot that it is UTF-8, and grabs data incorrectly. The remote site is not under my control, so I can't fix the problem there! Here's a set of scripts to

Login and submit form with web-crawler

别说谁变了你拦得住时间么 提交于 2019-12-25 08:59:47
问题 So in web-crawler I pass and submit data like this $client = new Client(); $crawler = $client->request('GET', 'link'); $form = $crawler->filter('.default')->form(); $crawler = $client->submit($form, array( 'login'=>'ud', 'password'=>'pw' )); But if I use var_dump($crawler); I realise that I never get data from the website after login because it redirects me and var_dump takes data from the page where I submited. I want after login to move to the new link to submit a form $client-

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

南楼画角 提交于 2019-12-19 02:43:04
问题 Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery) 回答1: You want to have a look at phantomjs. There is this php implementation: http://jonnnnyw.github.io/php-phantomjs/ if you need to have it working with php of course. You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for

Mink/Goutte How to check checkbox without attribute in Goutte?

牧云@^-^@ 提交于 2019-12-18 09:48:22
问题 I apologize in advance but I am very beginner. I try to check checkbox without id or name. <span class="ps-align-left"> <input type="checkbox" value="43899" style="background-color: rgb(252, 252, 252);"/> 43899 </span> I figure out how to do it with selenium2driver. So I use function "find" like this: public function checkOption() { $this->getSession()->getPage()->find('css', '.ps-align-left>input')->check(); } And it works fine but when I try to run test with headless browser Goutte I get

InvalidArgumentException - The current node list is empty.

南楼画角 提交于 2019-12-12 03:46:38
问题 I am using goutte sracper to scrape the data , i m getting error like InvalidArgumentException - The current node list is empty. Below is the code which i m using $string = $crawler->filter('div#links.results')->html(); if ( empty( $string ) ) return false; $dom = new \DOMDocument; $state = libxml_use_internal_errors(true); $dom->loadHTML($string); libxml_use_internal_errors($state); $xp = new \DOMXPath($dom); $divNodeList = $xp->query('//div[contains(@class, "results_links_deep")] [contains(

How to click a link which is created javascript with Goutte

*爱你&永不变心* 提交于 2019-12-11 07:12:47
问题 Does anyone know how can I click a link which is set by javascript with Goutte ? For example the link is like this. <a onclick="document.getElementById('Detail').click()" href="#a" id="Link01" name="Link01">get detail</a> 回答1: You know Goutte cannot explain javascript. Selenium or casperjs is a better choice. 来源: https://stackoverflow.com/questions/27587679/how-to-click-a-link-which-is-created-javascript-with-goutte