web-crawler

Scrapy start Crawling after login

≯℡__Kan透↙ 提交于 2019-12-04 06:06:06
问题 Disclaimer: The site I am crawling is a corporate intranet and I modified the url a bit for corporate privacy. I managed to log into the site but I have failed to crawl the site. Start from start_url https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf( this site would direct you to a similar site with more complex url : i.e. https://kmssqkr.sarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument {unid=ADE682E34FC59D274825770B0037D278}) for every page including the

Why do I have different document counts in status and index?

荒凉一梦 提交于 2019-12-04 05:55:48
问题 So i'm following the Storm-Crawler-ElasticSearch tutorial and playing around with it. When Kibana is used to search I've noticed that number of hits for index name 'status' is far greater than 'index'. Example: On the top left, you can see that there's 846 hits for 'status' index I assume that means it has crawled through 846 pages. Now with 'index' index , it is shown that there are only 31 hits . I understand that functionallyn index and status are different as status is just responsible

MySQL server has gone away during crawling in Perl

独自空忆成欢 提交于 2019-12-04 05:26:55
问题 I use WWW::Mechanize library to get the content of URLs and save their data into mysql tables. But when the page's content is too large, it gives this error message: DBD::mysql::st execute failed: MySQL server has gone away at F:\crawling\perl_tests\swc2.pl line 481. For example, it throws this error when I try to extract the content of this page: https://www.e-conomic.com/secure/api1/EconomicWebService.asmx?wsdl I added this code as well, but it still does not work $connection->{max_allowed

connection refused error when running Nutch 2

隐身守侯 提交于 2019-12-04 05:18:18
I am trying to run Nutch 2 crawler on my system but I get the following error: Exception in thread "main" org.apache.gora.util.GoraException: java.io.IOException: java.sql.SQLTransientConnectionException: java.net.ConnectException: Connection refused at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:69) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243) at org.apache.nutch.crawl.Crawler

Splinter or Selenium: Can we get current html page after clicking a button?

有些话、适合烂在心里 提交于 2019-12-04 05:07:17
I'm trying to crawl the website " http://everydayhealth.com ". However, I found that the page will dynamically rendered. So, when I click the button "More", some new news will be shown. However, using splinter to click the button doesn't let "browser.html" automatically changes to the current html content. Is there a way to let it get newest html source, using either splinter or selenium? My code in splinter is as follows: import requests from bs4 import BeautifulSoup from splinter import Browser browser = Browser() browser.visit('http://everydayhealth.com') browser.click_link_by_text("More")

Empty .json file

笑着哭i 提交于 2019-12-04 05:04:38
问题 I have written this short spider code to extract titles from hacker news front page(http://news.ycombinator.com/). import scrapy class HackerItem(scrapy.Item): #declaring the item hackertitle = scrapy.Field() class HackerSpider(scrapy.Spider): name = 'hackernewscrawler' allowed_domains = ['news.ycombinator.com'] # website we chose start_urls = ['http://news.ycombinator.com/'] def parse(self,response): sel = scrapy.Selector(response) #selector to help us extract the titles item=HackerItem()

How to specify parameters on a Request using scrapy

夙愿已清 提交于 2019-12-04 04:36:23
问题 How do I pass parameters to a a request on a url like this: site.com/search/?action=search&description=My Search here&e_author= How do I put the arguments on the structure of a Spider Request, something like this exemple: req = Request(url="site.com/",parameters={x=1,y=2,z=3}) 回答1: Pass your GET parameters inside the URL itself: return Request(url="https://yoursite.com/search/?action=search&description=MySearchhere&e_author=") You should probably define your parameters in a dictionary and

Nested Selectors in Scrapy

ぐ巨炮叔叔 提交于 2019-12-04 04:11:29
问题 I have trouble getting nested Selectors to work as described in the documentation of Scrapy (http://doc.scrapy.org/en/latest/topics/selectors.html) Here's what I got: sel = Selector(response) level3fields = sel.xpath('//ul/something/*') for element in level3fields: site = element.xpath('/span').extract() When I print out "element" in the loop I get < Selector xpath='stuff seen above' data="u'< span class="something">text< /span>> Now I got two problems: Firstly, within the element, there

Going from Ruby to Python : Crawlers [closed]

我们两清 提交于 2019-12-04 01:49:09
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 7 years ago . I've started to learn python the past couple of days. I want to know the equivalent way of writing crawlers in python. so In ruby I

Is it possible for Scrapy to get plain text from raw HTML data?

前提是你 提交于 2019-12-03 23:30:28
For example: scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content Then, I get the following raw HTML code: <div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl> <dt>Simple</dt> <dt> </dt> <dd>Scrapy was designed with simplicity in mind, by providing the features