web-crawler | 易学教程

Facebook fanpage crawler

阅读更多关于 Facebook fanpage crawler

问题 I would like to write a Facebook fanpage crawler which crawl following information 1) Fan page name 2) Fan count 3) Feeds. I know i can use open graph API to get this, but i want to write some script which will run once in a day and get all these data and dump in my sql db. Is there any better way to do this? Any help is appreciable 回答1: I think its against the Facebook TOS, no long time ago i read some blog where the whriter create some type of spider to collect data about facebook pages

JSP Page Crawler that extracts all input parameters

阅读更多关于 JSP Page Crawler that extracts all input parameters

问题 Do you happen to know of an opensource Java component that provides the facility to scan a set of dynamic pages (JSP) and then extract all the input parameters from there. Of course, a crawler would be able to crawl static code and not dynamic code, but my idea here is to extend it to crawl a webserver including all the server-side code. Naturally, I am assuming that the tool will have full access to the crawled webserver and not by using any hacks. The idea is to build a static analyzer that

Crawling Issue with Apache Nutch 1.12

阅读更多关于 Crawling Issue with Apache Nutch 1.12

问题 I am new to crawling. I was using https://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website to perform crawling with nutch 1.12. I did the setup using Cygwin on windows. The "bin/nutch" command is running fine but to crawl i did the following changes - This is my conf/nutch-site.xml file <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>http.agent.name<

How to retrieve the ajax data whose loading requires a mouse click with PhantomJS or other tools

阅读更多关于 How to retrieve the ajax data whose loading requires a mouse click with PhantomJS or other tools

问题 I'm using PhantomJS to retrieve this page: Target Page Link. The contents I need are under the "行政公告" and "就業徵才公告" tabs. Because this page is written in Chinese, in case you cannot find the tabs, you can use "find" function of the browsers to find the "行政公告" and "就業徵才公告" tabs. Because the contents under the "行政公告" tab are the loaded as the default option, I can easily use the script below to retrieve the page: var page = require('webpage').create(); var url = 'http://sa.ttu.edu.tw/bin/home

Getting web page after calling DownloadStringAsync()?

阅读更多关于 Getting web page after calling DownloadStringAsync()?

问题 I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI). However, how can the asynchronous event handler actually return the web page to the calling routine? Imports System.Net Public Class Form1 Private Shared Sub DownloadStringCallback2(ByVal sender As Object, ByVal e As DownloadStringCompletedEventArgs) If e.Cancelled = False AndAlso e.Error Is Nothing Then

JSON not working in scrapy when calling spider through a python script?

阅读更多关于 JSON not working in scrapy when calling spider through a python script?

问题 When i call my spider through a python script which is as follows: import os os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import CrawlerSettings from scrapy.xlib.pydispatch import dispatcher from spiders.image import aqaqspider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = aqaqspider

Scrap multiple urls with scrapy

阅读更多关于 Scrap multiple urls with scrapy

问题 How I can scrap multiple urls with scrapy ? I am forced to make multiple crawler? class TravelSpider(BaseSpider): name = "speedy" allowed_domains = ["example.com"] start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4),"http://example.com/superurl/top/page-%d/" % i for i in xrange(55)] def parse(self, response): hxs = HtmlXPathSelector(response) items = [] item = TravelItem() item['url'] = hxs.select('//a[@class="out"]/@href').extract() out = "\n".join(str(e) for e in

How to find “equivalent” texts?

阅读更多关于 How to find “equivalent” texts?

问题 I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution. The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent. This ends up as 2 parts, find the

Advice with crawling web site content

阅读更多关于 Advice with crawling web site content

问题 I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily. But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data). But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing.

PHP - Is the there a safe way to perform deep recursion?

阅读更多关于 PHP - Is the there a safe way to perform deep recursion?

问题 Im talking about performing a deep recursion for around 5+ mins, something that you may have a crawler perform. in order to extract url links and and sub-url links of pages it seems that deep recursion in PHP does not seem realistic e.g. getInfo("www.example.com"); function getInfo($link){ $content = file_get_content($link) if($con = $content->find('.subCategories',0)){ echo "go deeper<br>"; getInfo($con->find('a',0)->href); } else{ echo "reached deepest<br>"; } } 回答1: Doing something like