web-crawler

C# library similar to HtmlUnit

試著忘記壹切 提交于 2019-12-19 04:24:30
问题 I need to write standalone application which will "browse" external resource. Is there lib in C# which automatically handles cookies and supports JavaScript (through JS is not required I believe)? The main goal is to keep session alive and submitting forms so I could pass multistep registration process or "browse" web site after login. I reviewed Html Agility Pack but it looks like it doesn't contain functionality I need - form submitting or cookie support. Thanks, Artem. 回答1: Look at Data

PHP crawl a website, which is using cloudflare

最后都变了- 提交于 2019-12-19 04:14:53
问题 I want to crawl some specific values (e.g.newstext) from a website (which is not my own). file_get_contents() is not working, propably blocked by php.ini. So i tried to do it with curl, problem is: All I get is the redirection text from cloudflare. My crawler should do something like: go to page -> wait the 5secs cloudflare redirect -> curl the page. Any ideas how to crawl the page after the cloudfare waiting time? (in PHP) edit: so i tried a lot of things, problem is still the same.. more

Nutch API advice

喜欢而已 提交于 2019-12-19 02:49:26
问题 I'm working on a project where I need a mature crawler to do some work, and I'm evaluating Nutch for this purpose. My current needs are relatively straightforward: I need a crawler that is able to save the data to disk and I need it to be able to recrawl only the updated resources of a site and skip the parts that are already crawled. Does anyone have any experience working with the Nutch code directly in Java, not via the command line. I would like to start simple: create a crawler (or

How to crawl with php Goutte and Guzzle if data is loaded by Javascript?

南楼画角 提交于 2019-12-19 02:43:04
问题 Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery) 回答1: You want to have a look at phantomjs. There is this php implementation: http://jonnnnyw.github.io/php-phantomjs/ if you need to have it working with php of course. You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for

How do I get the destination URL of a shortened URL using Ruby?

六月ゝ 毕业季﹏ 提交于 2019-12-18 19:03:56
问题 How do I take this URL http://t.co/yjgxz5Y and get the destination URL which is http://nickstraffictricks.com/4856_how-to-rank-1-in-google/ 回答1: require 'net/http' require 'uri' Net::HTTP.get_response(URI.parse('http://t.co/yjgxz5Y'))['location'] # => "http://nickstraffictricks.com/4856_how-to-rank-1-in-google/" 回答2: I've used open-uri for this, because it's nice and simple. It will retrieve the page, but will also follow multiple redirects: require 'open-uri' final_uri = '' open('http://t.co

How do I crawl an infinite-scrolling page?

走远了吗. 提交于 2019-12-18 17:04:32
问题 I'm trying to build something that crawls the content from a page with infinite scroll. However, I can't get the stuff from below the first 'break'. How do I do this? 回答1: Infinite scrolling is almost always done in JavaScript by using AJAX, or related technology. As such, it is not enough for your web crawler to get the HTML and parse it; it must download and execute the javascript, or at least scan it for the AJAX calls. Doing a full javascript execution is probably best (ie, will be most

Recrawl URL with Nutch just for updated sites

社会主义新天地 提交于 2019-12-18 15:52:43
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other

Recrawl URL with Nutch just for updated sites

[亡魂溺海] 提交于 2019-12-18 15:51:59
问题 I crawled one URL with Nutch 2.1 and then I want to re-crawl pages after they got updated. How can I do this? How can I know that a page is updated? 回答1: Simply you can't. You need to recrawl the page to control if it's updated. So according to your needs, prioritize the pages/domains and recrawl them within a time period. For that you need a job scheduler such as Quartz. You need to write a function that compares the pages. However, Nutch originally saves the pages as index files. In other

Web mining or scraping or crawling? What tool/library should I use? [closed]

最后都变了- 提交于 2019-12-18 14:02:45
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . I want to crawl and save some webpages as HTML. Say, crawl into hundreds popular websites and simply save their frontpages and the "About" pages. I've looked into many questions, but didn't find an answer to this from either web crawling or web scraping questions. What library or tool should I use to build the

Chrome Devtools: Save specific requests in Network Tab

£可爱£侵袭症+ 提交于 2019-12-18 11:55:30
问题 Can I save just specific requests in the Chrome Devtools Network tab? It would be very useful to me since our company uses web crawling to fetch info from extranets, and the most I can do is to record (with the rec button) all the requests made to reach for a specific info, and if I want to save the desired requests/responses in a file to analyze them later, all I can do is to save it all as a .har file, which saves EVERYTHING, including every resource (images, css, js, etc), filling the file