screen-scraping | 易学教程

Selenium: Not able to understand xPath

阅读更多关于 Selenium: Not able to understand xPath

I have some HTML like this: <h4 class="box_header clearfix"> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> <small> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> </h4> I am trying to get the href here in Java using Selenium. I have tried the following: selenium.getText("xpath=/descendant::h4[@class='box_header clearfix']/"); selenium.getAttribute("xpath=/descendant::h4[@class='box_header clearfix']/"); But none of these work. It keeps complaining that my xpath is invalid. Can someone tell me what mistake I am doing? You should

How to capture part of a screen

阅读更多关于 How to capture part of a screen

问题 I am using the win32 PrintWindow function to capture a screen to a BitMap object. If I only want to capture a region of the window, how can I crop the image in memory? Here is the code I'm using to capture the entire window: [System.Runtime.InteropServices.DllImport(strUSER32DLL, CharSet = CharSet.Auto, SetLastError = true)] public static extern int PrintWindow(IntPtr hWnd, IntPtr hBltDC, uint iFlags); public enum enPrintWindowFlags : uint { /// <summary> /// /// </summary> PW_ALL =

Download html of URL with Python - but with javascript enabled

阅读更多关于 Download html of URL with Python - but with javascript enabled

I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it with BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled. Is there a way to download the HTML of a URL with javascript enabled in Python? @kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

Beautifulsoup get value in table

阅读更多关于 Beautifulsoup get value in table

I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name")) print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next The relevant HTML is <td valign="top"> <table border="1" cellpadding="1" cellspacing="0" align="right"> <tbody><tr class="tableheaders"> <td>Owner Name(s)</td> </tr> <tr> <td

Android/Java: Simulate a click on this webpage

阅读更多关于 Android/Java: Simulate a click on this webpage

Last year I made an Android application that scrapped the informations on my train company in Belgium ( application is BETrains: http://www.cyrket.com/p/android/tof.cv.mpp/ ) This application was really cool and allowed users to talk with other people in the train ( a messagery server is runned by me) and the conversations wre also on Twitter: http://twitter.com/betrains Everybody in Belgium loved it. The company tried to avoid us to use their data, make some users websites closed, but their was some lawyers that attack the company and finally we have no more problems and the websites are open

Excluding unwanted results of findAll using BeautifulSoup

阅读更多关于 Excluding unwanted results of findAll using BeautifulSoup

问题 Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook: <p class="review_comment"> So, using the simple code as follows, content = page.read() soup = BeautifulSoup(content) results = soup.find_all("p", "review_comment") I am happily parsing the text that is living here: <p class="review_comment"> This place is terrible!</p> The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which

How do I scrape data from a page that loads specific data after the main page load?

阅读更多关于 How do I scrape data from a page that loads specific data after the main page load?

I have been using Ruby and Nokogiri to pull data from a URL similar to this one from the hollister website: http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358 My script looks like this right now: require 'rubygems' require 'nokogiri' require 'open-uri' page = Nokogiri::HTML(open("http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358")) puts page.css("h3[data-property=GLB_ORDERNUMBERSYMBOL]")[0].text My problem

How to select some urls with BeautifulSoup?

阅读更多关于 How to select some urls with BeautifulSoup?

I want to scrape the following information except the last row and "class="Region" row: ... <td>7</td> <td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> <td bgcolor="" align="left">New York</td> <td bgcolor="" align="left" class="Region">N/A</td> <td bgcolor="" align="left">1,863</td> <td bgcolor="" align="left">565</td> <td bgcolor="" align="left">1,133</td> <td bgcolor="" align="left">$160,000</td> <td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class=

Following a link using Nokogiri for scraping

阅读更多关于 Following a link using Nokogiri for scraping

Is there a method to follow a link using Nokogiri for scraping? I know I can extract the href and open it, but I thought I saw a method to do this using hpricot and was wondering if there was something like that in Nokogiri. dbyrne Here is an excellent screen scraping guide for using Ruby, Nokigiri, Hpricot, and Firebug. Personally I am a big fan of using Mechanize , which is a headless browser, for screen scraping. You can use mechanize to navigate links and fill out forms and it will handle all the tricky stuff like cookies. 来源： https://stackoverflow.com/questions/2807500/following-a-link

How to post ASP.NET login form using PHP/cURL?

阅读更多关于 How to post ASP.NET login form using PHP/cURL?

I need to create a tool that will post a ASP.NET login form using PHP so that I can gather details from the user's summary page that is displayed after they are logged in. Because the site uses ASP.NET and the form has __VIEWSTATE and __EVENTVALIDATION hidden fields, as I understand it, I must get those values first, then submit them in the POST to the login form for this to work. I am new to PHP. The script that I have created should do the following: 1) GET the login form and grab __VIEWSTATE and __EVENTVALIDATION 2) POST to the login form with appropriate post data. 3) GET the summary.htm