web-crawler | 易学教程

Nutch Crawling not working for particular URL

阅读更多关于 Nutch Crawling not working for particular URL

问题 I am using apache nutch for crawling. When i crawled the page http://www.google.co.in . It crawls the page correctly and produce the results. But when i add one parameter in that url it does not fetch any results for the url http://www.google.co.in/search?q=bill+gates . solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 100 Injector: starting at 2013-05-27 08:01:57 Injector: crawlDb: crawl/crawldb Injector:

How to write scraped data into a CSV file in Scrapy?

阅读更多关于 How to write scraped data into a CSV file in Scrapy?

问题 I am trying to scrape a website by extracting the sub-links and their titles, and then save the extracted titles and their associated links into a CSV file. I run the following code, the CSV file is created but it is empty. Any help? My Spider.py file looks like this: from scrapy import cmdline from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor class HyperLinksSpider(CrawlSpider): name = "linksSpy" allowed_domains = ["some_website"]

Scrapy - access data while crawling and randomly change user agent

阅读更多关于 Scrapy - access data while crawling and randomly change user agent

问题 Is it possible to access the data while scrapy is crawling ? I have a script that finds a specific keyword and writes the keyword in .csv as well as the link where it was found. However, I have to wait for scrapy to be done crawling, and when that is done it actually outputs the data in the .csv file I'm also trying to change my user agent randomly, but it's not working. If I'm not allowed for two questions in one, i will post this as a separate question. #!/usr/bin/env python # -*- coding:

Pass params like variable to evaluate in casperjs and login into site

阅读更多关于 Pass params like variable to evaluate in casperjs and login into site

问题 I'm writing a python script that pass username and password like params to my casperjs script, describe below. But I don't know why a receive the error: CasperError: casper.test property is only available using the `casperjs test` command C:/casperjs/modules/casper.js:179 Can someone help me about this issue? CasperJS.py: import os import subprocess # PATH to files casperjs = 'c:\casperjs\bin\casperjs.exe' app_root = os.path.dirname(os.path.realpath(__file__)) script = os.path.join(app_root,

in R - crawling with rvest - fail to get the texts in HTML tag using html_text function

阅读更多关于 in R - crawling with rvest - fail to get the texts in HTML tag using html_text function

问题 url <-"http://news.chosun.com/svc/content_view/content_view.html?contid=1999080570392" hh = read_html(GET(url),encoding = "EUC-KR") #guess_encoding(hh) html_text(html_node(hh, 'div.par')) #html_text(html_nodes(hh ,xpath='//*[@id="news_body_id"]/div[2]/div[3]')) I'm trying to crawling the news data(just for practice) using rvest in R. When I tried it on the homepage above, I failed to fetch the text from the page. (Xpath doesn't work either.) I do not think I failed to find the link that

WebDriverException: Message: chrome not reachable

阅读更多关于 WebDriverException: Message: chrome not reachable

问题 I use Selenium Python , when I run my crawler I got this error WebDriverException: Message: chrome not reachable (Driver info: chromedriver=2.9.248304,platform=Linux 3.16.0-4-amd64 x86_64) I read this question I downloaded chromedriver (binary) and I copy/paste it to /usr/bin I tried by driver = webdriver.Chrome('/usr/bin/chromedriver') but I have the same error 回答1: In your protractor.configuration file, if you have the following: capabilities: { 'browserName': 'chrome', 'chromeOptions': {

How to approach Google groups discussions crawler

阅读更多关于 How to approach Google groups discussions crawler

问题 as an exercise in RSS I would like to be able to search through pretty much all Unix discussions on this group. comp.unix.shell I know enough Python and understand basic RSS, but I am stuck on ... how do I grab all messages between particular dates, or at least all messages between Nth recent and Mth recent? High level descriptions, pseudo-code is welcome. Thank you! EDIT: I would like to be able to go back more than 100 messages, but do not grabbing like parsing 10 messages at a time such as

JSP Page Import problem. Class file placed in a package inside WEB-INF/classes

阅读更多关于 JSP Page Import problem. Class file placed in a package inside WEB-INF/classes

问题 I have a Web application crawler_GUI running which has another java project jspider in its buildpath. (I use eclipse galileo) The GUI uses the jspider project as its backend. Visit http://i45.tinypic.com/avmszn.jpg for the structure The JSP creates an instance of the jspider object. First of all I didn't have the classes in the WEB-INF/classes folder and I rectified that error. Now it seems to work, and no errors are shown but none of the tasks are carried out. Here's the code : The JSP <%@

Scrapy celery and multiple spiders

阅读更多关于 Scrapy celery and multiple spiders

问题 I'm using scrapy and i'm trying to use celery to manage multiple spiders on one machine. The problem i have is (bit difficult to explain), that the spiders gets multiplied -> meaning if my first spider starts and i start a second spider the first spider executes twice. See my code here: ProcessJob.py class ProcessJob(): def processJob(self, job): #update job mysql = MysqlConnector.Mysql() db = mysql.getConnection(); cur = db.cursor(); job.status = 1 update = "UPDATE job SET status=1 WHERE id=

Google crawling with cookies

阅读更多关于 Google crawling with cookies

问题 The content of my site depends of cookies in the request, and when Google crawler bot visits my site it deoesn't index much content, because it does't have the specific cookies in each of its requests. Is it possible to setup some rule that when the crawler bot is crawling my site it uses the specific cookies? 回答1: Googlebot does not honor cookies on purpose -- it has to "see" what anybody else will see on your website, the "smallest common denominator" if you will; otherwise search results