web-crawler | 易学教程

mechanize._mechanize.FormNotFoundError: no form matching name 'q'

阅读更多关于 mechanize._mechanize.FormNotFoundError: no form matching name 'q'

问题 Can anyone help me get this form selection correct? Trying to get a crawl of google, I get the error: mechanize._mechanize.FormNotFoundError: no form matching name 'q' Unusual, since I have seen several other tutorials using it, and: p.s. I don't plan to SLAM google with requests, just hope to use an automatic selector to take the effort out of finding academic citation pdfs from time to time. <f GET http://www.google.com.tw/search application/x-www-form-urlencoded <HiddenControl(ie=Big5)

How to follow all links in CasperJS?

阅读更多关于 How to follow all links in CasperJS?

I'm having trouble clicking all JavaScript based links in a DOM and saving the output. The links have the form <a id="html" href="javascript:void(0);" onclick="goToHtml();">HTML</a> the following code works great: var casper = require('casper').create(); var fs = require('fs'); var firstUrl = 'http://www.testurl.com/test.html'; var css_selector = '#jan_html'; casper.start(firstUrl); casper.thenClick(css_selector, function(){ console.log("whoop"); }); casper.waitFor(function check() { return this.getCurrentUrl() != firstUrl; }, function then() { console.log(this.getCurrentUrl()); var file_title

Mass Downloading of Webpages C#

阅读更多关于 Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts. for (int i = 1; i<=pages; i++) { string page_specific_link = baseurl + "&page=" + i.ToString(); try { WebClient client = new WebClient(); var pagesource = client.DownloadString(page_specific_link); client.Dispose(); sourcelist.Add(pagesource); } catch (Exception) { } } The way you approach this problem is going to depend very much on how many pages you want

Scrapy not crawling subsequent pages in order

阅读更多关于 Scrapy not crawling subsequent pages in order

问题 I am writing a crawler to get the names of items from an website. The website has got 25 items per page and multiple pages (200 for some item types). Here is the code: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from lonelyplanet.items import LonelyplanetItem class LonelyplanetSpider(CrawlSpider): name = "lonelyplanetItemName_spider" allowed_domains = ["lonelyplanet.com"]

Nutch regex-urlfilter syntax

阅读更多关于 Nutch regex-urlfilter syntax

问题 I am running Nutch v. 1.6 and it is crawling specific sites correctly, but I can't seem to get the syntax correct for the file NUTCH_ROOT/conf/regex-urlfilter.txt . The site I want to crawl has a URL similar to this: http://www.example.com/foo.cfm On that page there are numerous links that match the following pattern: http://www.example.com/foo.cfm/Bar_-_Foo/Extra/EX/20817/ID=6976 I want to crawl links that match second example above as well. In my regex-urlfilter.txt I have the following: +

Get Scrapy crawler output/results in script file function

阅读更多关于 Get Scrapy crawler output/results in script file function

问题 I am using a script file to run a spider within scrapy project and spider is logging the crawler output/results. But i want to use spider output/results in that script file in some function .I did not want to save output/results in any file or DB. Here is Script code get from https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging from scrapy.utils

Python Package For Multi-Threaded Spider w/ Proxy Support?

阅读更多关于 Python Package For Multi-Threaded Spider w/ Proxy Support?

Instead of just using urllib does anyone know of the most efficient package for fast, multithreaded downloading of URLs that can operate through http proxies? I know of a few such as Twisted, Scrapy, libcurl etc. but I don't know enough about them to make a decision or even if they can use proxies.. Anyone know of the best one for my purposes? Thanks! is's simple to implement this in python. The urlopen() function works transparently with proxies which do not require authentication. In a Unix or Windows environment, set the http_proxy, ftp_proxy or gopher_proxy environment variables to a URL

Send Post Request in Scrapy

阅读更多关于 Send Post Request in Scrapy

I am trying to crawl the latest reviews from google play store and to get that I need to make a post request. With the Postman, it works and I get desired response. but a post request in terminal gives me a server error For ex: this page https://play.google.com/store/apps/details?id=com.supercell.boombeach curl -H "Content-Type: application/json" -X POST -d '{"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'}' https://play.google.com/store/getreviews gives a server error and Scrapy just ignores this line: frmdata = {"id": "com.supercell.boombeach",

getting Forbidden by robots.txt: scrapy

阅读更多关于 getting Forbidden by robots.txt: scrapy

while crawling website like https://www.netflix.com , getting Forbidden by robots.txt: https://www.netflix.com/> ERROR: No response downloaded for: https://www.netflix.com/ In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY ROBOTSTXT_OBEY=False Here are the release notes First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure. 来源： https://stackoverflow.com/questions/37274835/getting

Using one Scrapy spider for several websites

阅读更多关于 Using one Scrapy spider for several websites

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI. How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamically configurable? E.g. I write the configuration to a file, and the spider reads it somehow. WARNING: This answer was for Scrapy v0.7, spider manager api changed a lot since then. Override default SpiderManager class, load your custom rules from a database or