web-crawler | 易学教程

Symfony2 Crawler - Use UTF-8 with XPATH

阅读更多关于 Symfony2 Crawler - Use UTF-8 with XPATH

I am using Symfony2 Crawler - Bundle for using XPath. Everything works fine, except the encoding. I would like to use UTF-8 encoding and the Crawler is somehow not using it. I noticed that because th are converted to Â , which is a known issue: UTF-8 Encoding Issue My question is: How could I force the Symfony Crawler to use UTF-8 Encoding? Here is the code I am using: $dom_input = new \DOMDocument("1.0","UTF-8"); $dom_input->encoding = "UTF-8"; $dom_input->formatOutput = true; $dom_input->loadHTMLFile($myFile); $crawler = new Crawler($dom_input); $paragraphs = $crawler->filterXPath(

Android GUI crawler

阅读更多关于 Android GUI crawler

Anyone knows a good tool for crawling the GUI of an android app? I found this but couldn't figure out how to run it... Personally, I don't think it would be too hard to make a simple GUI crawler using MonkeyRunner and AndroidViewClient . Also, you may want to look into uiautomator and UI Testing Good is a relative term. I have not used Robotium, but it is mentioned in these circles a lot. EDIT - Added example based on comment request. Using MonkeyRunner and AndroidViewClient you can make a heirarchy of views. I think AndroidViewClient has a built-in mechanism to do this, but I wrote my own.

Logic for Implementing a Dynamic Web Scraper in C#

阅读更多关于 Logic for Implementing a Dynamic Web Scraper in C#

I am looking to develop a Web scraper in C# window forms. What I am trying to accomplish is as follows: Get the URL from the user. Load the Web page in the IE UI control(embedded browser) in WINForms. Allow the User to select a text (contiguous , small(not exceeding 50 chars)). from the loaded web page. When the User wishes to persist the location ( the HTML DOM location ) it has to be persisted into the DB, so that the user may use that location to fetch the data in that location during his subsequent visits. Assume that the loaded website is a pricelisting site and the quoted rate keeps on

Comatose web crawler in R (w/ rvest)

阅读更多关于 Comatose web crawler in R (w/ rvest)

I recently discovered the rvest package in R and decided to try out some web scraping. I wrote a small web crawler in a function so I could pipe it down to clean it up etc. With a small url list (e.g. 1-100) the function works fine, however when a larger list is used the function hangs at some point. It seems like one of the commands is waiting for a response but does not seems to get one and does not result in an error. urlscrape<-function(url_list) { library(rvest) library(dplyr) assets<-NA price<-NA description<-NA city<-NA length(url_list)->n pb <- txtProgressBar(min = 0, max = n, style =

Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

阅读更多关于 Extracting Fetched Web Pages from Nutch in a Map Reduce Friendly Format

After a Nutch crawl in distributed (deploy) mode as follows: bin/nutch crawl s3n://..... -depth 10 -topN 50000 -dir /crawl -threads 20 I need to extract each URL fetched along with it's content in a map reduce friendly format. By using the readseg command below, the contents are fetched but the output format doesn't lend itself to being map reduced. bin/nutch readseg -dump /crawl/segments/* /output -nogenerate -noparse -noparsedata -noparsetext Ideally the output should be in this format: http://abc.com/1 content of http://abc.com/1 http://abc.com/2 content of http://abc.com/2 Any suggestions

Creating a static copy of a web page on UNIX commandline / shell script

阅读更多关于 Creating a static copy of a web page on UNIX commandline / shell script

I need to create a static copy of a web page (all media resources, like CSS, images and JS included) in a shell script. This copy should be openable offline in any browser. Some browsers have a similar functionality (Save As... Web Page, complete) which create a folder from a page and rewrite external resources as relative static resources in this folder. What's a way to accomplish and automatize this on Linux command line to a given URL? You can use wget like this: wget --recursive --convert-links --domains=example.org http://www.example.org this command will recursively download any page

Nutch crawl no error , but result is nothing

阅读更多关于 Nutch crawl no error , but result is nothing

I try to crawl some urls with nutch 2.1 as follows. bin/nutch crawl urls -dir crawl -depth 3 -topN 5 http://wiki.apache.org/nutch/NutchTutorial There is no error , but undermentioned　folders don't be made. crawl/crawldb crawl/linkdb crawl/segments Can anyone help me? I have not resolved this trouble for two days. Thanks a lot! output is as follows. FetcherJob: threads: 10 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : -1 Using queue mode : byHost Fetcher: threads: 10 QueueFeeder finished: total 0 records. Hit by time limit :0 -finishing thread

Wikipedia text download

阅读更多关于 Wikipedia text download

问题 I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online? To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the

Python scrapy - Login Authenication Issue

阅读更多关于 Python scrapy - Login Authenication Issue

问题 I have just started using scrapy. I am facing few problems with login in scrapy. I am trying the scrape items in the website www.instacart.com. But I am facing issues with logging in. The following is the code import scrapy from scrapy.loader import ItemLoader from project.items import ProjectItem from scrapy.http import Request from scrapy import optional_features optional_features.remove('boto') class FirstSpider(scrapy.Spider): name = "first" allowed_domains = ["https://instacart.com"]

iPhone: How to download a full website?

阅读更多关于 iPhone: How to download a full website?

what approach do you recommend me for downloading a website (one HTML site with all included images) to the iPhone? The question is how to crawl all those tiny bits (Javascripts, images, CSS) and save them locally. It's not about the concrete implementation (I know how to use NSURLRequest and stuff. I'm looking for a crawl/spider approach). Jail breaks won't work, since it is intended for an official (App Store) app. Regards, Stefan Downloading? Or getting the HTML-source of the site and displaying it with a UIWebView ? If last, you could simply do this: NSString *data = [[NSString alloc]