screen-scraping | 易学教程

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

阅读更多关于 UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

问题 With BeautifulSoup 3.1.0.1 and Python 2.5.2, and trying to parse a web page in French. However, as soon as I call findAll, I get the following error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1146: ordinal not in range(128) Below is the code I am currently running: import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("http://fr.encarta.msn.com/encyclopedia_761561798/Paris.html") soup = BeautifulSoup(page, fromEncoding="latin1") r =

Python Urllib UrlOpen Read

阅读更多关于 Python Urllib UrlOpen Read

Say I am retrieving a list of Urls from a server using Urllib2 library from Python. I noticed that it took about 5 seconds to get one page and it would take a long time to finish all the pages I want to collect. I am thinking out of those 5 seconds. Most of the time was consumed on the server side and I am wondering could I just start using the threading library. Say 5 threads in this case, then the average time could be dramatically increased. Maybe 1 or 2 seconds in each page. (might make the server a bit busy). How could I optimize the number of threads so I could get a legit speed and not

Scrapy + Splash + ScrapyJS

阅读更多关于 Scrapy + Splash + ScrapyJS

问题 i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf I am still getting the page without the phone number rendered: class OlxSpider(scrapy.Spider): name = "olx" rotate_user_agent = True allowed_domains = ["olx.pt"] start_urls = [ "https://olx.pt/imoveis/" ] def parse(self, response): script = """ function main

how to scrapy handle dns lookup failed

阅读更多关于 how to scrapy handle dns lookup failed

问题 I am looking to handle a DNS error when scraping domains Scrapy. Here's the error that I am seeing: ERROR: Error downloading <GET http://domain.com>: DNS lookup failed: address 'domain.com' not found [Errno 8] nodename nor servname provided, or not known. How could I be notified when I get an error like this, so that I can handle it myself without Scrapy just throwing an error and moving on. 回答1: Use errback along with callback: Request(url, callback=your_callback, errback=your_errorback) and

Screen scraping in clojure

阅读更多关于 Screen scraping in clojure

I googled, but I can't find a satisfactory answer. This SO question is related but kinda old as well as the exact opposite of what I am looking for: a way to do screen-scraping using XPath, not CSS selectors. I've used enlive for some basic screen-scraping but sometimes one needs the power of XPath selectors. So here it is: Is there any equivalent to Nokogiri or lxml for clojure (java)? What is the state of the "pure java Nokogiri"? Any way to use the library from clojure? Any better alternatives than this hack ? There are a couple of possibilities here. Several of these require semi-well

Image scraping in Ruby

阅读更多关于 Image scraping in Ruby

How do I scrape an image present on a particular URL using Nokogiri? If there are better options than Nokogiri please suggest. The css image tag is .profilePic img Phrogz If it is just an <img> with a URL: PAGE = "http://site.com/page.html" require 'nokogiri' require 'open-uri' html = Nokogiri.HTML(open(PAGE)) src = html.at('.profilePic img')['src'] File.open("foo.png", "wb") do |f| f.write(open(src).read) end If you need to turn a relative image path into an absolute, see: https://stackoverflow.com/a/4864170/405017 The lazy way is to use mechanize as it will figure out the urls and filenames

Arranging coordinates into clockwise order

阅读更多关于 Arranging coordinates into clockwise order

I have 9 screen coordinates, each representing one of 9 positions. From the top right, I want that position to start as the 1st position, and the following clockwise coordinates to represent the 2nd, 3rd, 4th and so on, up until the 9th, which would be the top left coordinate. Would anybody here be able to come up with some sort of mathematical means of determining which of the 9 coordinates is in which position? They're all relative to each other, and will always be THAT relative to each other. Example coordinates could be: (x,y) X Y 663 382 543 454 303 454 183 382 418 459 543 209 303 209 653

Does Ruby's 'open_uri' reliably close sockets after read or on fail?

阅读更多关于 Does Ruby's 'open_uri' reliably close sockets after read or on fail?

I have been using open_uri to pull down an ftp path as a data source for some time, but suddenly found that I'm getting nearly continual "530 Sorry, the maximum number of allowed clients (95) are already connected." I am not sure if my code is faulty or if it is someone else who's accessing the server and unfortunately there's no way for me to really seemingly know for sure who's at fault. Essentially I am reading FTP URI's with: def self.read_uri(uri) begin uri = open(uri).read uri == "Error" ? nil : uri rescue OpenURI::HTTPError nil end end I'm guessing that I need to add some additional

Are there any free .NET OCR libraries that will perform OCR on an application window directly?

阅读更多关于 Are there any free .NET OCR libraries that will perform OCR on an application window directly?

I am looking for a free .NET OCR library that will be able to do OCR on a given application window or even a image in memory (I can take a snapshot of the application window myself). I have looked at tessnet2 and MODI but both require an image located on disk. I need to use OCR because the application I am trying to write a script for does some wacky stuff that cannot be read using windows API and I need to scrape data from the screen. I have tested both of tessnet2 and MODI and they both can read the text mostly but because this has to run in an enviroment that will not be able to write to

Alternatives to Selenium/Webdriver for filling in fields when scraping headlessly with Python?

阅读更多关于 Alternatives to Selenium/Webdriver for filling in fields when scraping headlessly with Python?

问题 With Python 2.7 I'm scraping with urllib2 and when some Xpath is needed, lxml as well. It's fast , and because I rarely have to navigate around the sites, this combination works well. On occasion though, usually when I reach a page that will only display some valuable data when a short form is filled in and a submit button is clicked (example), the scraping-only approach with urllib2 is not sufficient. Each time such a page were encountered, I could invoke selenium.webdriver to refetch the