web-crawler

Facebook fanpage crawler

巧了我就是萌 提交于 2019-12-08 07:36:56
问题 I would like to write a Facebook fanpage crawler which crawl following information 1) Fan page name 2) Fan count 3) Feeds. I know i can use open graph API to get this, but i want to write some script which will run once in a day and get all these data and dump in my sql db. Is there any better way to do this? Any help is appreciable 回答1: I think its against the Facebook TOS, no long time ago i read some blog where the whriter create some type of spider to collect data about facebook pages

JSP Page Crawler that extracts all input parameters

这一生的挚爱 提交于 2019-12-08 07:16:48
问题 Do you happen to know of an opensource Java component that provides the facility to scan a set of dynamic pages (JSP) and then extract all the input parameters from there. Of course, a crawler would be able to crawl static code and not dynamic code, but my idea here is to extend it to crawl a webserver including all the server-side code. Naturally, I am assuming that the tool will have full access to the crawled webserver and not by using any hacks. The idea is to build a static analyzer that

Crawling Issue with Apache Nutch 1.12

随声附和 提交于 2019-12-08 07:11:03
问题 I am new to crawling. I was using https://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website to perform crawling with nutch 1.12. I did the setup using Cygwin on windows. The "bin/nutch" command is running fine but to crawl i did the following changes - This is my conf/nutch-site.xml file <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name<

How to retrieve the ajax data whose loading requires a mouse click with PhantomJS or other tools

一世执手 提交于 2019-12-08 07:10:39
问题 I'm using PhantomJS to retrieve this page: Target Page Link. The contents I need are under the "行政公告" and "就業徵才公告" tabs. Because this page is written in Chinese, in case you cannot find the tabs, you can use "find" function of the browsers to find the "行政公告" and "就業徵才公告" tabs. Because the contents under the "行政公告" tab are the loaded as the default option, I can easily use the script below to retrieve the page: var page = require('webpage').create(); var url = 'http://sa.ttu.edu.tw/bin/home

Getting web page after calling DownloadStringAsync()?

三世轮回 提交于 2019-12-08 06:56:58
问题 I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI). However, how can the asynchronous event handler actually return the web page to the calling routine? Imports System.Net Public Class Form1 Private Shared Sub DownloadStringCallback2(ByVal sender As Object, ByVal e As DownloadStringCompletedEventArgs) If e.Cancelled = False AndAlso e.Error Is Nothing Then

JSON not working in scrapy when calling spider through a python script?

狂风中的少年 提交于 2019-12-08 06:56:30
问题 When i call my spider through a python script which is as follows: import os os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import CrawlerSettings from scrapy.xlib.pydispatch import dispatcher from spiders.image import aqaqspider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = aqaqspider

Scrap multiple urls with scrapy

强颜欢笑 提交于 2019-12-08 06:51:01
问题 How I can scrap multiple urls with scrapy ? I am forced to make multiple crawler? class TravelSpider(BaseSpider): name = "speedy" allowed_domains = ["example.com"] start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4),"http://example.com/superurl/top/page-%d/" % i for i in xrange(55)] def parse(self, response): hxs = HtmlXPathSelector(response) items = [] item = TravelItem() item['url'] = hxs.select('//a[@class="out"]/@href').extract() out = "\n".join(str(e) for e in

How to find “equivalent” texts?

空扰寡人 提交于 2019-12-08 06:47:56
问题 I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution. The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent. This ends up as 2 parts, find the

Advice with crawling web site content

。_饼干妹妹 提交于 2019-12-08 05:33:44
问题 I was trying to crawl some of website content, using jsoup and java combination. Save the relevant details to my database and doing the same activity daily. But here is the deal, when I open the website in browser I get rendered html (with all element tags out there). The javascript part when I test it, it works just fine (the one which I'm supposed to use to extract the correct data). But when I do a parse/get with jsoup(from Java class), only the initial website is downloaded for parsing.

PHP - Is the there a safe way to perform deep recursion?

允我心安 提交于 2019-12-08 05:13:49
问题 Im talking about performing a deep recursion for around 5+ mins, something that you may have a crawler perform. in order to extract url links and and sub-url links of pages it seems that deep recursion in PHP does not seem realistic e.g. getInfo("www.example.com"); function getInfo($link){ $content = file_get_content($link) if($con = $content->find('.subCategories',0)){ echo "go deeper<br>"; getInfo($con->find('a',0)->href); } else{ echo "reached deepest<br>"; } } 回答1: Doing something like