screen-scraping

Extracting/Scraping text from a href inside p inside div

杀马特。学长 韩版系。学妹 提交于 2019-12-23 04:22:05
问题 I am using beautiful soup(bs4) and Python I currently have this structure <div class="class1"> <a class="name" href="/doctor/dr-xxxxxxxxx"><h2>Dr. XX XXXX</h2></a> <p class="specialties"><a href="/location/abcd">ab cd</a></p> <p class="doc-clinic-name"> <a class="light_grey link" href="/clinic/fff">f ff</a> </p> </div> <div class="class2"> <p class="locality"> <a class="link grey" href="/location/doctors/ccc">c cc</a> </p> <p class="fees">INR 999</p> <div class="timings"> <p><span class=

scraping/simulate browsing help

百般思念 提交于 2019-12-23 03:21:35
问题 I want to make a program that will simulate a user browsing a site and clicking on links. Cookies and javascript have to be enabled. I've successfully done this in python, but I want to write it an compilable language (python ide's don't cut it). The links on the site are generated with javascript and are dynamic. With python I used PAMIE (third party module that uses win32com) to launch an instance of Internet explorer, scrape the generated html for the links, then navigate to one of them.

Download web page with images and stylesheets and (optionally) E-mailing it

时光总嘲笑我的痴心妄想 提交于 2019-12-22 13:41:31
问题 I need to make snapshots of web pages programmatically using PHP and get them into a HTML E-Mail. I tried wget --page-requisites . It downloads everything all right, but it doesn't change the HTML page's source code to point to the downloaded files rather than the on-line originals. Also, that HTML is of course a long way from being displayed properly in a HTML E-Mail. I am interested to know whether there are ready-made solutions for this. I would already be happy with a solution that takes

Download web page with images and stylesheets and (optionally) E-mailing it

谁说胖子不能爱 提交于 2019-12-22 13:41:23
问题 I need to make snapshots of web pages programmatically using PHP and get them into a HTML E-Mail. I tried wget --page-requisites . It downloads everything all right, but it doesn't change the HTML page's source code to point to the downloaded files rather than the on-line originals. Also, that HTML is of course a long way from being displayed properly in a HTML E-Mail. I am interested to know whether there are ready-made solutions for this. I would already be happy with a solution that takes

When web scraping with Node.js, can I run all JavaScripts on the page? (i.e., simulate a real browser?)

谁说胖子不能爱 提交于 2019-12-22 05:59:25
问题 I'm trying to do some web scraping with node.js. Using jsdom , it is easy to load up the DOM and inject JavaScript into it. I want to go one step further: run all JavaScript linked to from the web page and then inspect the resulting DOM, including visual properties (height, width, etc) of elements. Thus far, I get NaN when I try to inspect the dimensions of DOM elements with jsdom. Is this possible? It strikes me that there are two distinct challenges: Running all the JS on the web page

How to read a pixel off of the screen?

て烟熏妆下的殇ゞ 提交于 2019-12-22 05:34:11
问题 I am trying to make a simple bot for a web game, so I would like to be able to read the color of a pixel on the screen. I've done this on Windows with GetPixel(), but I can't seem to figure it out on OS X. I been looking online and came across glReadPixel. When I made a simple command line tool in XCode, I put in the following code. However, I cannot seem to make it work. I keep getting a EXC_BAD_ACCESS error from this: GLfloat r; glReadPixels(0, 0, 1, 1, GL_RED, GL_FLOAT, &r); I thought the

Download html of URL with Python - but with javascript enabled

限于喜欢 提交于 2019-12-22 01:31:34
问题 I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it with BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled. Is there a way to download the HTML of a URL with javascript enabled in Python? 回答1: @kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There

downloading morningstar webpages for screenscraping

限于喜欢 提交于 2019-12-21 23:28:42
问题 I'd like to be able to screenscrape Morningstar webpages. Morningstar provides information about a mutual fund that I routinely look up but haven't been able to find elsewhere, ie total return compared against benchmark total return compared against peers percentile ranking Here's an example: morningstar example As a prelude to screenscraping, I need to be able to download the webpage with the desired content. Unfortunately, when I try using Java SE6 or wget to retrieve the above example link

Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-21 20:29:25
问题 I am attempting to write a program that, as an example, will scrape the top price off of this web page: http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults First, I am easily able to retrieve the HTML by doing the following: from urllib import urlopen from BeautifulSoup import BeautifulSoup import mechanize webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults' br = mechanize.Browser() data = br.open(webpage).get_data() soup = BeautifulSoup(data)

Screen scraping: regular expressions or XQuery expressions?

与世无争的帅哥 提交于 2019-12-21 13:16:03
问题 I was answering some quiz questions for an interview, and the question was about how would I do screen scraping. That is, picking content out of a web page, assuming you don't have a better structured way to query the information directly (e.g. a web service). My solution was to use an XQuery expression. The expression was fairly long because the content I needed was pretty deep in the HTML hierarchy. I had to search up through the ancestors a fair way before I found an element with an id