web-crawler

Deal with AJAX block in web-crawler or create manually inputs

喜欢而已 提交于 2019-12-24 07:49:10
问题 Based on Alvin Bunk article link to article I want to create a web-cralwer that logins in a website then submits a form. My problem is that on that website there is an Ajax block that generates after clicking and empty link few inputs that I need to fill so I need to click that empty link somehow or to insert the inputs manually . I changed the code below in a lot of ways to try to make it work but on the visit function I got stuck I get Uncaught Error: Call to a member function visit() on

How to get all pages from the whole website using python?

只谈情不闲聊 提交于 2019-12-24 07:10:09
问题 I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy. class MySpider(CrawlSpider): name = 'myspider' start_urls = ['https://stackoverflow.com/questions/'] def parse(self, response): le = LinkExtractor() for link in le.extract_links(response): url_lnk = link.url print (url_lnk) Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just

How to disable Rails sessions for web crawlers?

孤街醉人 提交于 2019-12-24 06:41:57
问题 It used to be that a line like this in the application controller would disable sessions entirely for a request: session :off, :if => Proc.new {|req| req.user_agent =~ BOT_REGEX} With Rails 3.x, this is either deprecated or no longer works. I realize that the new concept is that sessions are lazy loaded, but the execution flow through the app uses/checks sessions even if it's a web bot. So is there some new mechanism that could be used to disable sessions on a per-request basis? 回答1: There

Close a scrapy spider when a condition is met and return the output object

半腔热情 提交于 2019-12-24 06:33:43
问题 I have made a spider to get reviews from a page like this here using scrapy. I want product reviews only till a certain date(2nd July 2016 in this case). I want to close my spider as soon as the review date goes earlier than the given date and return the items list. Spider is working well but my problem is that i am not able to close my spider if the condition is met..if i raise an exception, spider closes without returning anything. Please suggest the best way to close the spider manually.

Scraping the source code using VBA-Macros

帅比萌擦擦* 提交于 2019-12-24 03:12:12
问题 I need to crawl the price values from the price comparison website (product link: https://www.toppreise.ch/prod_488002.html). I am not able to scrape. see the highlighted price in the image that I want to capture: Please help me how to crawl this page. PS: toppreise.ch will not be accessible in many countries so use VPN I am using the below code: Private Sub SiteInfo_Click() Dim strhtml On Error Resume Next ThisWorkbook.Sheets("Data Mining").Activate Sheets("Data Mining").Range("B1").Select

Scraping text in h3 and p tags using beautifulsoup python

血红的双手。 提交于 2019-12-24 01:39:08
问题 I have experience with python, BeautifulSoup but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data). ...body and not nested divs... <h3 class="college"> <span class="num">1.</span> <a href="https://www.stanford.edu/">Stanford University</a> </h3> <div class="he-mod" data-block="paragraph-9"></div> <p class="school-location">Stanford, CA</p> ...body and not nested divs... <h3 id="MIT" class="college"> <span

Why doesn't Nutch seem to know about “Last-Modified”?

不羁的心 提交于 2019-12-24 01:25:52
问题 I setup Nutch with a db.fetch.interval.default of 60000 so that I can crawl every day. If I don't, it won't even look at my site when I crawl the next day. But when I do crawl the next day, every page that it fetched yesterday gets fetched with a 200 response code, indicating that it's not using the previous day's date in the "If-Modified-Since". Shouldn't it skip fetching pages that haven't changed? Is there a way to make it do that? I noticed a ProtocolStatus.NOT_MODIFIED in Fetcher.java,

Click on buttons with images

*爱你&永不变心* 提交于 2019-12-24 01:12:16
问题 I'm trying to crawl this page: http://www.1800contractor.com/d.HI.html I made this script from selenium import webdriver URL = "http://www.1800contractor.com/d.GA.html" zip_codes = ['30324'] driver = webdriver.Firefox() driver.get(URL) zip_codes = ['30324'] text_box = driver.find_element_by_xpath('//*[@id="zip"]') text_box.send_keys(zip_codes[0]) button = driver.find_element_by_xpath('//*[@id="xmdListingsSearchForm"]') button.click() Basically I need to put a zip code in the search box: zip

How to read .ARC files from the Heritrix crawler using Python?

故事扮演 提交于 2019-12-24 00:47:26
问题 I looked at the Heritrix documentation website, and they listed a Python .ARC file reader. However, it is 404 not found when I clicked on it. http://crawler.archive.org/articles/developer_manual/arcs.html Does anyone else know any Heritrix ARC reader that uses Python? (I asked this question before, but closed it due to inaccuracy) 回答1: Nothing a little Googling can't find: http://archive-access.cvs.sourceforge.net/viewvc/archive-access/archive-access/projects/hedaern/ 来源: https:/

Asynchronous Webcrawling F#, something wrong?

人走茶凉 提交于 2019-12-24 00:36:18
问题 Not quite sure if it is ok to do this but, my question is: Is there something wrong with my code ? It doesn't go as fast as I would like, and since I am using lots of async workflows maybe I am doing something wrong. The goal here is to build something that can crawl 20 000 pages in less than an hour. open System open System.Text open System.Net open System.IO open System.Text.RegularExpressions open System.Collections.Generic open System.ComponentModel open Microsoft.FSharp open System