screen-scraping

Selenium: Not able to understand xPath

北战南征 提交于 2019-12-04 22:17:51
I have some HTML like this: <h4 class="box_header clearfix"> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> <small> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> </h4> I am trying to get the href here in Java using Selenium. I have tried the following: selenium.getText("xpath=/descendant::h4[@class='box_header clearfix']/"); selenium.getAttribute("xpath=/descendant::h4[@class='box_header clearfix']/"); But none of these work. It keeps complaining that my xpath is invalid. Can someone tell me what mistake I am doing? You should

How to capture part of a screen

孤街醉人 提交于 2019-12-04 22:08:53
问题 I am using the win32 PrintWindow function to capture a screen to a BitMap object. If I only want to capture a region of the window, how can I crop the image in memory? Here is the code I'm using to capture the entire window: [System.Runtime.InteropServices.DllImport(strUSER32DLL, CharSet = CharSet.Auto, SetLastError = true)] public static extern int PrintWindow(IntPtr hWnd, IntPtr hBltDC, uint iFlags); public enum enPrintWindowFlags : uint { /// <summary> /// /// </summary> PW_ALL =

Download html of URL with Python - but with javascript enabled

天涯浪子 提交于 2019-12-04 21:58:46
I am trying to download this page so that I can scrape the search results. However, when I download the page and try to process it with BeautifulSoup, I find that parts of the page (for example, the search results) aren't included as the site has detected that javascript is not enabled. Is there a way to download the HTML of a URL with javascript enabled in Python? @kstruct: My preferred way, instead of writing a full browser with QtWebKit and PyQt4, is to use one already written. There's the PhantomJS (C++) project, or PyPhantomJS (Python). Basically the Python one is QtWebKit and Python.

Beautifulsoup get value in table

北城以北 提交于 2019-12-04 19:17:24
I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name")) print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next The relevant HTML is <td valign="top"> <table border="1" cellpadding="1" cellspacing="0" align="right"> <tbody><tr class="tableheaders"> <td>Owner Name(s)</td> </tr> <tr> <td

Android/Java: Simulate a click on this webpage

血红的双手。 提交于 2019-12-04 19:07:55
Last year I made an Android application that scrapped the informations on my train company in Belgium ( application is BETrains: http://www.cyrket.com/p/android/tof.cv.mpp/ ) This application was really cool and allowed users to talk with other people in the train ( a messagery server is runned by me) and the conversations wre also on Twitter: http://twitter.com/betrains Everybody in Belgium loved it. The company tried to avoid us to use their data, make some users websites closed, but their was some lawyers that attack the company and finally we have no more problems and the websites are open

Excluding unwanted results of findAll using BeautifulSoup

旧街凉风 提交于 2019-12-04 18:47:18
问题 Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook: <p class="review_comment"> So, using the simple code as follows, content = page.read() soup = BeautifulSoup(content) results = soup.find_all("p", "review_comment") I am happily parsing the text that is living here: <p class="review_comment"> This place is terrible!</p> The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which

How do I scrape data from a page that loads specific data after the main page load?

只谈情不闲聊 提交于 2019-12-04 17:23:35
I have been using Ruby and Nokogiri to pull data from a URL similar to this one from the hollister website: http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358 My script looks like this right now: require 'rubygems' require 'nokogiri' require 'open-uri' page = Nokogiri::HTML(open("http://www.hollisterco.com/webapp/wcs/stores/servlet/TrackDetail?storeId=10251&catalogId=10201&langId=-1&URL=TrackDetailView&orderNumber=1316358")) puts page.css("h3[data-property=GLB_ORDERNUMBERSYMBOL]")[0].text My problem

How to select some urls with BeautifulSoup?

只谈情不闲聊 提交于 2019-12-04 16:29:52
I want to scrape the following information except the last row and "class="Region" row: ... <td>7</td> <td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> <td bgcolor="" align="left">New York</td> <td bgcolor="" align="left" class="Region">N/A</td> <td bgcolor="" align="left">1,863</td> <td bgcolor="" align="left">565</td> <td bgcolor="" align="left">1,133</td> <td bgcolor="" align="left">$160,000</td> <td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class=

Following a link using Nokogiri for scraping

人盡茶涼 提交于 2019-12-04 16:09:15
Is there a method to follow a link using Nokogiri for scraping? I know I can extract the href and open it, but I thought I saw a method to do this using hpricot and was wondering if there was something like that in Nokogiri. dbyrne Here is an excellent screen scraping guide for using Ruby, Nokigiri, Hpricot, and Firebug. Personally I am a big fan of using Mechanize , which is a headless browser, for screen scraping. You can use mechanize to navigate links and fill out forms and it will handle all the tricky stuff like cookies. 来源: https://stackoverflow.com/questions/2807500/following-a-link

How to post ASP.NET login form using PHP/cURL?

六月ゝ 毕业季﹏ 提交于 2019-12-04 16:05:38
I need to create a tool that will post a ASP.NET login form using PHP so that I can gather details from the user's summary page that is displayed after they are logged in. Because the site uses ASP.NET and the form has __VIEWSTATE and __EVENTVALIDATION hidden fields, as I understand it, I must get those values first, then submit them in the POST to the login form for this to work. I am new to PHP. The script that I have created should do the following: 1) GET the login form and grab __VIEWSTATE and __EVENTVALIDATION 2) POST to the login form with appropriate post data. 3) GET the summary.htm