web-crawler

Symfony2 Crawler - Use UTF-8 with XPATH

寵の児 提交于 2019-12-04 18:08:18
I am using Symfony2 Crawler - Bundle for using XPath. Everything works fine, except the encoding. I would like to use UTF-8 encoding and the Crawler is somehow not using it. I noticed that because th   are converted to   , which is a known issue: UTF-8 Encoding Issue My question is: How could I force the Symfony Crawler to use UTF-8 Encoding? Here is the code I am using: $dom_input = new \DOMDocument("1.0","UTF-8"); $dom_input->encoding = "UTF-8"; $dom_input->formatOutput = true; $dom_input->loadHTMLFile($myFile); $crawler = new Crawler($dom_input); $paragraphs = $crawler->filterXPath(

Android GUI crawler

醉酒当歌 提交于 2019-12-04 17:04:39
Anyone knows a good tool for crawling the GUI of an android app? I found this but couldn't figure out how to run it... Personally, I don't think it would be too hard to make a simple GUI crawler using MonkeyRunner and AndroidViewClient . Also, you may want to look into uiautomator and UI Testing Good is a relative term. I have not used Robotium, but it is mentioned in these circles a lot. EDIT - Added example based on comment request. Using MonkeyRunner and AndroidViewClient you can make a heirarchy of views. I think AndroidViewClient has a built-in mechanism to do this, but I wrote my own.

Logic for Implementing a Dynamic Web Scraper in C#

∥☆過路亽.° 提交于 2019-12-04 17:03:39
I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows: Get the URL from the user. Load the Web page in the IE UI control(embedded browser) in WINForms. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page. When the User wishes to persist the location ( the HTML DOM location ) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits. Assume that the loaded website is a pricelisting site and the quoted rate keeps on

Comatose web crawler in R (w/ rvest)

霸气de小男生 提交于 2019-12-04 16:58:42
I recently discovered the rvest package in R and decided to try out some web scraping. I wrote a small web crawler in a function so I could pipe it down to clean it up etc. With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error. urlscrape<-function(url_list) { library(rvest) library(dplyr) assets<-NA price<-NA description<-NA city<-NA length(url_list)->n pb <- txtProgressBar(min = 0, max = n, style =

Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

自古美人都是妖i 提交于 2019-12-04 16:52:26
After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20 I need to extract each URL fetched along with it's content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn't lend itself to being map reduced. bin/nutch readseg -dump /crawl/segments/* /output -nogenerate -noparse -noparsedata -noparsetext Ideally the output should be in this format: http://abc.com/1 content of http://abc.com/1 http://abc.com/2 content of http://abc.com/2 Any suggestions

Creating a static copy of a web page on UNIX commandline / shell script

☆樱花仙子☆ 提交于 2019-12-04 15:47:00
I need to create a static copy of a web page (all media resources, like CSS, images and JS included) in a shell script. This copy should be openable offline in any browser. Some browsers have a similar functionality (Save As... Web Page, complete) which create a folder from a page and rewrite external resources as relative static resources in this folder. What's a way to accomplish and automatize this on Linux command line to a given URL? You can use wget like this: wget --recursive --convert-links --domains=example.org http://www.example.org this command will recursively download any page

Nutch crawl no error , but result is nothing

本小妞迷上赌 提交于 2019-12-04 15:46:03
I try to crawl some urls with nutch 2.1 as follows. bin/nutch crawl urls -dir crawl -depth 3 -topN 5 http://wiki.apache.org/nutch/NutchTutorial There is no error , but undermentioned folders don't be made. crawl/crawldb crawl/linkdb crawl/segments Can anyone help me? I have not resolved this trouble for two days. Thanks a lot! output is as follows. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 0 records. Hit by time limit :0 -finishing thread

Wikipedia text download

早过忘川 提交于 2019-12-04 15:35:29
问题 I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the

Python scrapy - Login Authenication Issue

烈酒焚心 提交于 2019-12-04 15:33:31
问题 I have just started using scrapy. I am facing few problems with login in scrapy. I am trying the scrape items in the website www.instacart.com. But I am facing issues with logging in. The following is the code import scrapy from scrapy.loader import ItemLoader from project.items import ProjectItem from scrapy.http import Request from scrapy import optional_features optional_features.remove('boto') class FirstSpider(scrapy.Spider): name = "first" allowed_domains = ["https://instacart.com"]

iPhone: How to download a full website?

て烟熏妆下的殇ゞ 提交于 2019-12-04 15:26:46
what approach do you recommend me for downloading a website (one HTML site with all included images) to the iPhone? The question is how to crawl all those tiny bits (Javascripts, images, CSS) and save them locally. It's not about the concrete implementation (I know how to use NSURLRequest and stuff. I'm looking for a crawl/spider approach). Jail breaks won't work, since it is intended for an official (App Store) app. Regards, Stefan Downloading? Or getting the HTML-source of the site and displaying it with a UIWebView ? If last, you could simply do this: NSString *data = [[NSString alloc]