web-crawler | 易学教程

Is there a hashing algorithm that is tolerant of minor differences?

阅读更多关于 Is there a hashing algorithm that is tolerant of minor differences?

问题 I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page. Are there any hashing algorithms that work for something like this? 回答1: A common way to do document similarity is shingling, which is somewhat more involved than hashing.

Websites that are particularly challenging to crawl and scrape? [closed]

阅读更多关于 Websites that are particularly challenging to crawl and scrape? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I'm interested in public facing sites (nothing behind a login / authentication) that have things like: High use of internal 301 and 302 redirects Anti-scraping measures (but not banning crawlers via robots.txt) Non-semantic, or invalid mark-up Content loaded via AJAX in the form of onclicks or infinite scrolling

How to generate graphical sitemap of large website [closed]

阅读更多关于 How to generate graphical sitemap of large website [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I would like to generate a graphical sitemap for my website. There are two stages, as far as I can tell: crawl the website and analyse the link relationship to extract the tree structure generate a visually pleasing render of the tree Does anyone have advice or experience with achieving this, or know of existing

Symfony2 Functional Testing - Click on elements with jQuery interaction

阅读更多关于 Symfony2 Functional Testing - Click on elements with jQuery interaction

问题 I'm doing some functional tests for an application done with Symfony2 (2.1) and I'm stuck with a problem. I have some parts of the website that load when the user clicks a link or other element, but these actions are performed using jQuery and $.post calls. How can I get the Symfony2 crawler to do these calls? When I do something like this: $link = $crawler->filter('ul.line_menu a')->eq(1)->link(); $crawler = $client->click($link); The crawler gets the "href" of the "a" element and launches

Is there a list of known web crawlers? [closed]

阅读更多关于 Is there a list of known web crawlers? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed last year . I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know. Is there somewhere a list of

Which web crawler for extracting and parsing data from about a thousand of web sites

阅读更多关于 Which web crawler for extracting and parsing data from about a thousand of web sites

问题 I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only. Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in. I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Heritrix crashes about every day, and no attemps with JVM parameters to limit memory usage were successful). From your experiences in the field, which crawler would

Is it possible to plug a JavaScript engine with Ruby and Nokogiri?

阅读更多关于 Is it possible to plug a JavaScript engine with Ruby and Nokogiri?

问题 I'm writing an application to crawl some websites and scrape data from them. I'm using Ruby, Curl and Nokogiri to do this. In most cases it's straightforward and I only need to ping a URL and parse the HTML data. The setup works perfectly fine. However, in some scenarios, the websites retrieve data based on user input on some radio buttons. This invokes some JavaScript which fetches some more data from the server. The generated URL and posted data is determined by JavaScript code. Is it

How to crawl foursquare check-in data?

阅读更多关于 How to crawl foursquare check-in data?

问题 Is it possible to crawl check-in data from foursquare in a greedy way? (even if I don't have friendship with all the users) Just like crawling publicly available twitter messages. If you have any experience or suggestions, please share. Thanks. 回答1: If you have publicly available tweets containing links to foursquare, you can resolve the foursquare short links (4sq.com/XXXXXX) by making a HEAD request. The head request will return a URL with a check-in ID and a signature. You can use those

How can I safely check is node empty or not? (Symfony 2 Crawler)

阅读更多关于 How can I safely check is node empty or not? (Symfony 2 Crawler)

问题 When I try to take some nonexistent content from page I catch this error: The current node list is empty. 500 Internal Server Error - InvalidArgumentException How can I safely check exists this content or not? Here some examples that does not work: if($crawler->filter('.PropertyBody')->eq(2)->text()){ // bla bla } if(!empty($crawler->filter('.PropertyBody')->eq(2)->text())){ // bla bla } if(($crawler->filter('.PropertyBody')->eq(2)->text()) != null){ // bla bla } THANKS, I helped myself with:

Get past request limit in crawling a web site

阅读更多关于 Get past request limit in crawling a web site

问题 I'm working on a web crawler that indexes sites that don't want to be indexed. My first attempt: I wrote a c# crawler that goes through each and every page and downloads them. This resulted in my IP being blocked by their servers within 10 minutes. I moved it to amazon EC2 and wrote a distributed python script that runs about 50 instances. This stays just above their threshold of booting me. This also costs about $1900 a month... I moved back to my initial idea and put it behind a shortened