web-crawler

Re-crawling websites fast

亡梦爱人 提交于 2019-12-03 20:46:58
I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this portals will be indexed for searching. The problem is in re-crawling this portals - first crawling of portal takes very long (examples of portals: www.onet.pl, www.bankier.pl, www.gazeta.pl ) and I want to re-crawl it faster (as fast as it is possible) for example by checking date of modification but I have used wget to download www.bankier.pl but in response it complains that there is no last

Callback for redirected requests Scrapy

痴心易碎 提交于 2019-12-03 20:22:14
I am trying to scrape using scrape framework. Some requests are redirected but the callback function set in the start_requests is not called for these redirected url requests but works fine for the non-redirected ones. I have the following code in the start_requests function: for user in users: yield scrapy.Request(url=userBaseUrl+str(user['userId']),cookies=cookies,headers=headers,dont_filter=True,callback=self.parse_p) But this self.parse_p is called only for the Non-302 requests. I guess you get a callback for the final page (after the redirect). Redirects are been taken care by the

How do i create rules for a crawlspider using scrapy

╄→尐↘猪︶ㄣ 提交于 2019-12-03 20:07:56
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from manga.items import MangaItem class MangaHere(BaseSpider): name = "mangah" allowed_domains = ["mangahere.com"] start_urls = ["http://www.mangahere.com/seinen/"] def parse(self,response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li/div') items = [] for site in sites: rating = site.select("p/span/text()").extract() if rating > 4.5: item = MangaItem() item["title"] = site.select("div/a/text()").extract() item["desc"] = site.select("p[2]/text()").extract() item["link"] = site.select("div/a/

How to use Goutte

限于喜欢 提交于 2019-12-03 17:28:51
问题 Issue : Cannot fully understand the Goutte web scraper. Request : Can someone please help me understand or provide code to help me better understand how to use Goutte the web scraper? I have read over the README.md. I am looking for more information than what that provides such as what options are available in Goutte and how to write those options or when you are looking at forms do you search for the name= or the id= of the form? Webpage Layout attempting to be scraped : Step 1 : The webpage

What is the easiest way to run python scripts in a cloud server?

眉间皱痕 提交于 2019-12-03 17:13:00
问题 I have a web crawling python script that takes hours to complete, and is infeasible to run in its entirety on my local machine. Is there a convenient way to deploy this to a simple web server? The script basically downloads webpages into text files. How would this be best accomplished? Thanks! 回答1: Since you said that performance is a problem and you are doing web-scraping, first thing to try is a Scrapy framework - it is a very fast and easy to use web-scraping framework. scrapyd tool would

How to best develop web crawlers

耗尽温柔 提交于 2019-12-03 17:07:51
I am used to create some crawlers to compile information and as I come to a website I need the info I start a new crawler specific for that site, using shell scripts most of the time and sometime PHP. The way I do is with a simple for to iterate for the page list, a wget do download it and sed , tr , awk or other utilities to clean the page and grab the specific info I need. All the process takes some time depending on the site and more to download all pages. And I often steps into an AJAX site that complicates everything I was wondering if there is better ways to do that, faster ways or even

Web Crawler - Ignore Robots.txt file?

风格不统一 提交于 2019-12-03 16:32:08
问题 Some servers have a robots.txt file in order to stop web crawlers from crawling through their websites. Is there a way to make a web crawler ignore the robots.txt file? I am using Mechanize for python. 回答1: The documentation for mechanize has this sample code: br = mechanize.Browser() .... # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) That does exactly what you want. 回答2: This looks like what you need: from mechanize import Browser br =

Scraping text in h3 and div tags using beautifulSoup, Python

天大地大妈咪最大 提交于 2019-12-03 16:02:13
I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). <div class="box effect"> <div class="row"> <div class="col-lg-10"> <h3>HEADING</h3> <div><i class="fa user"></i>  NAME</div> <div><i class="fa phone"></i>  MOBILE</div> <div><i class="fa mobile-phone fa-2"></i>   NUMBER</div> <div><i class="fa address"></i>   XYZ_ADDRESS</div> <div class="space"> </div> <div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www

Ban robots from website [closed]

坚强是说给别人听的谎言 提交于 2019-12-03 15:35:59
my website is often down because a spider is accessying to many resources. This is what the hosting told me. They told me to ban these IP address: 46.229.164.98 46.229.164.100 46.229.164.101 But I've no idea about how to do this. I've googled a bit and I've now added these lines to .htaccess in the root: # allow all except those indicated here <Files *> order allow,deny allow from all deny from 46.229.164.98 deny from 46.229.164.100 deny from 46.229.164.101 </Files> Is this 100% correct? What could I do? Please help me. Really I don't have any idea about what I should do. Sharky based on these

Adding URL parameter to Nutch/Solr index and search results

…衆ロ難τιáo~ 提交于 2019-12-03 15:04:31
问题 I can't find any hint on how to setup nutch to NOT filter/remove my URL parameters. I want to crawl and index some pages where lots of content is hidden behind the same base URLs (like /news.jsp?id=1 /news.jsp?id=2 /news.jsp?id=3 and so on). the regex-normalize.xml only removes redundant stuff from the URL (like session id, and trailing ?) the regex-urlfilter.txt seems to have a wildcard for my host (+^http://$myHost/) The crawling works fine so far. Any ideas? cheers, mana EDIT: A part of