screen-scraping | 易学教程

Does Ruby's 'open_uri' reliably close sockets after read or on fail?

阅读更多关于 Does Ruby's 'open_uri' reliably close sockets after read or on fail?

问题 I have been using open_uri to pull down an ftp path as a data source for some time, but suddenly found that I'm getting nearly continual "530 Sorry, the maximum number of allowed clients (95) are already connected." I am not sure if my code is faulty or if it is someone else who's accessing the server and unfortunately there's no way for me to really seemingly know for sure who's at fault. Essentially I am reading FTP URI's with: def self.read_uri(uri) begin uri = open(uri).read uri == "Error

Select all <p>'s from a Node's children using HTMLAgilityPack

阅读更多关于 Select all 's from a Node's children using HTMLAgilityPack

问题 I've got the following code that I'm using to get a html page. Make the urls absolute and then make the links rel nofollow and open in a new window/tab. My issue is around the adding of the attributes to the <a> s. string url = "http://www.mysite.com/"; string strResult = ""; HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url); HttpWebResponse response = (HttpWebResponse)request.GetResponse(); if ((request.HaveResponse) && (response.StatusCode == HttpStatusCode.OK)) { using

Are there any free .NET OCR libraries that will perform OCR on an application window directly?

阅读更多关于 Are there any free .NET OCR libraries that will perform OCR on an application window directly?

问题 I am looking for a free .NET OCR library that will be able to do OCR on a given application window or even a image in memory (I can take a snapshot of the application window myself). I have looked at tessnet2 and MODI but both require an image located on disk. I need to use OCR because the application I am trying to write a script for does some wacky stuff that cannot be read using windows API and I need to scrape data from the screen. I have tested both of tessnet2 and MODI and they both can

Screen scraping a Datepicker with Scrapy and Selenium on mouse hover

阅读更多关于 Screen scraping a Datepicker with Scrapy and Selenium on mouse hover

So I need to scrap a page like this for example and I am using Scrapy + Seleninum to interact with a date-picker calendar. I realized that if a certain date is available a price shows on the tooltip, and if its not available if you hover on it nothing happens. Whats the code for me to get the price that appears dynamically when you hover on an available day and also how do I know if its available or not just with the hover? It is not that straightforward how to approach the problem because of the dynamic nature of the page - you have to use waits here and there and it's tricky to catch the

Where is the memory leak? How to timeout threads during multiprocessing in python?

阅读更多关于 Where is the memory leak? How to timeout threads during multiprocessing in python?

问题 It is unclear how to properly timeout workers of joblib's Parallel in python. Others have had similar questions here, here, here and here. In my example I am utilizing a pool of 50 joblib workers with threading backend. Parallel Call (threading): output = Parallel(n_jobs=50, backend = 'threading') (delayed(get_output)(INPUT) for INPUT in list) Here, Parallel hangs without errors as soon as len(list) <= n_jobs but only when n_jobs => -1 . In order to circumvent this issue, people give

How to Programmatically Log in to a Website

阅读更多关于 How to Programmatically Log in to a Website

问题 I dont know how to programmatically login to this site I've searched through stackoverflow and found this, but I still don't know what to put into URL or URI. 回答1: When I just type in username 'abc' and password 'def' and hit the button I get the following post data: next=apps%2Flinks%2F&why=pw&email=abc&password=def&fw_human= So that leads me to beleive if you just use that post data and replace it with the appropriate information, you can simulate a manual login. So from the stack overflow

Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

阅读更多关于 Get Mechanize to handle cookies from an arbitrary POST (to log into a website programmatically)

I want to log into https://www.t-mobile.com/ programmatically. My first idea was to use Mechanize to submit the login form: alt text http://dl.dropbox.com/u/2792776/screenshots/2010-04-08_1440.png However, it turns out that this isn't even a real form. Instead, when you click "Log in" some javascript grabs the values of the fields, creates a new form dynamically, and submits it. "Log in" button HTML: <button onclick="handleLogin(); return false;" class="btnBlue" id="myTMobile-login"><span>Log in</span></button> The handleLogin() function: function handleLogin() { if (ValidateMsisdnPassword())

Jsoup posting modified Document

阅读更多关于 Jsoup posting modified Document

I'm trying to create a web scraper for my coming android app. Therefore I need to use a simple search form on a website, fill it out and send my results back to the server. As mentioned in the Jsoup-Cookbook , I scraped the site I needed from the Server and changed the values. Now I just need to post my modified document back to the server and scrape the resulting page. As far as I've seen in the Jsoup-API there is no way to post something back, except with the .data-Attribute in Jsoup.connection, which is unfortunately not able to fill out text fields by their id. Any ideas or workarounds,

Android/Java: Simulate a click on this webpage

阅读更多关于 Android/Java: Simulate a click on this webpage

问题 Last year I made an Android application that scrapped the informations on my train company in Belgium ( application is BETrains: http://www.cyrket.com/p/android/tof.cv.mpp/) This application was really cool and allowed users to talk with other people in the train ( a messagery server is runned by me) and the conversations wre also on Twitter: http://twitter.com/betrains Everybody in Belgium loved it. The company tried to avoid us to use their data, make some users websites closed, but their

Scrape and generate RSS feed

阅读更多关于 Scrape and generate RSS feed

问题 I use Simple HTML DOM to scrape a page for the latest news, and then generate an RSS feed using this PHP class. This what I have now: <?php // This is a minimum example of using the class include("FeedWriter.php"); include('simple_html_dom.php'); $html = file_get_html('http://www.website.com'); foreach($html->find('td[width="380"] p table') as $article) { $item['title'] = $article->find('span.title', 0)->innertext; $item['description'] = $article->find('.ingress', 0)->innertext; $item['link']