web-crawler | 易学教程

What are some good Ruby-based web crawlers? [closed]

阅读更多关于 What are some good Ruby-based web crawlers? [closed]

I am looking at writing my own, but I am wondering if there are any good web crawlers out there which are written in Ruby. Short of a full-blown web crawler, any gems that might be helpful in building a web crawler would be useful. I know this part of the question is touched upon in a couple of places, but a list of gems applicable to building a web crawler would be a great resource as well. Felipe Lima I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat It is still in an early stage but is already functional with

crawl site that has infinite scrolling using python

阅读更多关于 crawl site that has infinite scrolling using python

问题 I have been doing research and so far I found out the python package that I will plan on using its scrapy, now I am trying to find out what is a good way to build a scraper using scrapy to crawl site with infinite scrolling. After digging around I found out that there is a package call selenium and it has python module. I have a feeling someone has already done that using Scrapy and Selenium to scrape site with infinite scrolling. It would be great if someone can point towards to an example.

How do web spiders differ from Wget's spider?

阅读更多关于 How do web spiders differ from Wget's spider?

问题 The next sentence caught my eye in Wget's manual wget --spider --force-html -i bookmarks.html This feature needs much more work for Wget to get close to the functionality of real web spiders. I find the following lines of code relevant for the spider option in wget. src/ftp.c 780: /* If we're in spider mode, don't really retrieve anything. The 784: if (opt.spider) 889: if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST))) 1227: if (!opt.spider) 1239: if (!opt.spider) 1268:

Designing a web crawler

阅读更多关于 Designing a web crawler

I have come across an interview question "If you were designing a web crawler, how would you avoid getting into infinite loops? " and I am trying to answer it. How does it all begin from the beginning. Say Google started with some hub pages say hundreds of them (How these hub pages were found in the first place is a different sub-question). As Google follows links from a page and so on, does it keep making a hash table to make sure that it doesn't follow the earlier visited pages. What if the same page has 2 names (URLs) say in these days when we have URL shorteners etc.. I have taken Google

Scrapy Linkextractor duplicating(?)

阅读更多关于 Scrapy Linkextractor duplicating(?)

I have the crawler implemented as below. It is working and it would go through sites regulated under the link extractor . Basically what I am trying to do is to extract information from different places in the page: - href and text() under the class 'news' ( if exists) - image url under the class 'think block' ( if exists) I have three problems for my scrapy: 1) duplicating linkextractor It seems that it will duplicate processed page. ( I check against the export file and found that the same ~.img appeared many times while it is hardly possible) And the fact is , for every page in the website,

crawling a html page using php?

阅读更多关于 crawling a html page using php?

This website lists over 250 courses in one list. I want to get the name of each course and insert that into my mysql database using php. The courses are listed like this: <td> computer science</td> <td> media studeies</td> … Is there a way to do that in PHP, instead of me having a mad data entry nightmare? Regular expressions work well. $page = // get the page $page = preg_split("/\n/", $page); for ($text in $page) { $matches = array(); preg_match("/^<td>(.*)<\/td>$/", $text, $matches); // insert $matches[1] into the database } See the documentation for preg_match. You can use this HTML

Submit form with no submit button in rvest

阅读更多关于 Submit form with no submit button in rvest

I'm trying write a crawler to download some information, similar to this Stack Overflow post. The answer is useful for creating the filled-in form, but I'm struggling to find a way to submit the form when a submit button is not part of the form. Here is an example: session <- html_session("www.chase.com") form <- html_form(session)[[3]] filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password) session <- submit_form(session, filledform) At this point, I receive this error: Error in names(submits)[[1]] : subscript out of bounds How can I make this form submit? Here

How can I bring google-like recrawling in my application(web or console)

阅读更多关于 How can I bring google-like recrawling in my application(web or console)

问题 How can I bring google-like recrawling in my application(web or console). I need only those pages to be recrawled which are updated after a particular date. The LastModified header in the System.Net.WebResponse gives only the current date of the server. For example if I downloaded one page with HTTPWebRequest on 27 January 2012, and check the header for the LastModified date, it is showing the current time of the server when the page was served. In this case it is 27 January 2012 only. Can

Python Scrapy on offline (local) data

阅读更多关于 Python Scrapy on offline (local) data

问题 I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How? 回答1: SimpleHTTP Server Hosting If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below): python -m SimpleHTTPServer 8000 Then just point scrapy at 127.0.0.1:8000 $ scrapy crawl 127.0.0.1:8000 file:// An alternative is to just have scrapy point to the set of files directly

Creating a generic scrapy spider

阅读更多关于 Creating a generic scrapy spider

My question is really how to do the same thing as a previous question, but in Scrapy 0.14. Using one Scrapy spider for several websites Basically, I have GUI that takes parameters like domain, keywords, tag names, etc. and I want to create a generic spider to crawl those domains for those keywords in those tags. I've read conflicting things, using older versions of scrapy, by either overriding the spider manager class or by dynamically creating a spider. Which method is preferred and how do I implement and invoke the proper solution? Thanks in advance. Here is the code that I want to make