screen-scraping | 易学教程

Extracting/Scraping text from a href inside p inside div

阅读更多关于 Extracting/Scraping text from a href inside p inside div

问题 I am using beautiful soup(bs4) and Python I currently have this structure <div class="class1"> <a class="name" href="/doctor/dr-xxxxxxxxx"><h2>Dr. XX XXXX</h2></a> <p class="specialties"><a href="/location/abcd">ab cd</a></p> <p class="doc-clinic-name"> <a class="light_grey link" href="/clinic/fff">f ff</a> </p> </div> <div class="class2"> <p class="locality"> <a class="link grey" href="/location/doctors/ccc">c cc</a> </p> <p class="fees">INR 999</p> <div class="timings"> <p><span class=

scraping/simulate browsing help

阅读更多关于 scraping/simulate browsing help

问题 I want to make a program that will simulate a user browsing a site and clicking on links. Cookies and javascript have to be enabled. I've successfully done this in python, but I want to write it an compilable language (python ide's don't cut it). The links on the site are generated with javascript and are dynamic. With python I used PAMIE (third party module that uses win32com) to launch an instance of Internet explorer, scrape the generated html for the links, then navigate to one of them.

Download web page with images and stylesheets and (optionally) E-mailing it

阅读更多关于 Download web page with images and stylesheets and (optionally) E-mailing it

问题 I need to make snapshots of web pages programmatically using PHP and get them into a HTML E-Mail. I tried wget --page-requisites . It downloads everything all right, but it doesn't change the HTML page's source code to point to the downloaded files rather than the on-line originals. Also, that HTML is of course a long way from being displayed properly in a HTML E-Mail. I am interested to know whether there are ready-made solutions for this. I would already be happy with a solution that takes

Download web page with images and stylesheets and (optionally) E-mailing it

阅读更多关于 Download web page with images and stylesheets and (optionally) E-mailing it

When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

阅读更多关于 When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

问题 I'm trying to do some web scraping with node.js. Using jsdom , it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements. Thus far, I get NaN when I try to inspect the dimensions of DOM elements with jsdom. Is this possible? It strikes me that there are two distinct challenges: Running all the JS on the web page

How to read a pixel off of the screen?

阅读更多关于 How to read a pixel off of the screen?

问题 I am trying to make a simple bot for a web game, so I would like to be able to read the color of a pixel on the screen. I've done this on Windows with GetPixel(), but I can't seem to figure it out on OS X. I been looking online and came across glReadPixel. When I made a simple command line tool in XCode, I put in the following code. However, I cannot seem to make it work. I keep getting a EXC_BAD_ACCESS error from this: GLfloat r; glReadPixels(0, 0, 1, 1, GL_RED, GL_FLOAT, &r); I thought the

Download html of URL with Python - but with javascript enabled

阅读更多关于 Download html of URL with Python - but with javascript enabled

问题 I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it with BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled. Is there a way to download the HTML of a URL with javascript enabled in Python? 回答1: @kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There

downloading morningstar webpages for screenscraping

阅读更多关于 downloading morningstar webpages for screenscraping

问题 I'd like to be able to screenscrape Morningstar webpages. Morningstar provides information about a mutual fund that I routinely look up but haven't been able to find elsewhere, ie total return compared against benchmark total return compared against peers percentile ranking Here's an example: morningstar example As a prelude to screenscraping, I need to be able to download the webpage with the desired content. Unfortunately, when I try using Java SE6 or wget to retrieve the above example link

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

阅读更多关于 Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

问题 I am attempting to write a program that, as an example, will scrape the top price off of this web page: http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults First, I am easily able to retrieve the HTML by doing the following: from urllib import urlopen from BeautifulSoup import BeautifulSoup import mechanize webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults' br = mechanize.Browser() data = br.open(webpage).get_data() soup = BeautifulSoup(data)

Screen scraping: regular expressions or XQuery expressions?

阅读更多关于 Screen scraping: regular expressions or XQuery expressions?

问题 I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service). My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id