screen-scraping | 易学教程

Parsing dl with HtmlAgilityPack

阅读更多关于 Parsing dl with HtmlAgilityPack

问题 This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#). <div class="content-div"> <dl> <dt> <b><a href="1.html" title="1">1</a></b> </dt> <dd> First Entry</dd> <dt> <b><a href="2.html" title="2">2</a></b> </dt> <dd> Second Entry</dd> <dt> <b><a href="3.html" title="3">3</a></b> </dt> <dd> Third Entry</dd> </dl> </div> The Values I want are : The hyperlink -> 1.html The Anchor Text ->1 Inner Text od dd -> First Entry (I have taken examples of the first entry here

Beautifulsoup get value in table

阅读更多关于 Beautifulsoup get value in table

问题 I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name")) print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next The relevant HTML is <td valign="top"> <table border="1" cellpadding="1"

Selenium: Not able to understand xPath

阅读更多关于 Selenium: Not able to understand xPath

问题 I have some HTML like this: <h4 class="box_header clearfix"> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> <small> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> </h4> I am trying to get the href here in Java using Selenium. I have tried the following: selenium.getText("xpath=/descendant::h4[@class='box_header clearfix']/"); selenium.getAttribute("xpath=/descendant::h4[@class='box_header clearfix']/"); But none of these work. It

Using Ruby with Mechanize to log into a website

阅读更多关于 Using Ruby with Mechanize to log into a website

问题 I need to scrape data from a site, but it requires my login first. I've been using hpricot to successfully scrape other sites, but I'm new to using mechanize, and I'm truly baffled by how to work it. I see this example commonly quoted: require 'rubygems' require 'mechanize' a = Mechanize.new a.get('http://rubyforge.org/') do |page| # Click the login link login_page = a.click(page.link_with(:text => /Log In/)) # Submit the login form my_page = login_page.form_with(:action => '/account/login

Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails)

阅读更多关于 Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails)

问题 I've come across an issue which unfortunately I can't seem to surpass, I'm also just a newborn to Ruby on rails unfortunately hence the number of questions I am attempting to scrape a webpage such as the following: http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx I would like to scrape The Addresses, Phones and URL of the next Page which in this case is http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx I've been trying

Can't separate cells properly with simplehtmldom

阅读更多关于 Can't separate cells properly with simplehtmldom

问题 I am trying to write a web scraper. I want to get all the cells in a row. The row before the one I want has THOROUGHBRED MEETINGS as its plain text value. I can successfully get this row. But I can't figure out how to get the next row's children which are the cells or <td> tags. if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS", $html)) { $cell = $foundTag->parent(); $row = $cell->parent(); $nextRow = $row->next_sibling(); echo "Row: ".$row->plaintext."<br />\n"; echo "Next Row: ".

R - Scraping aspx web error

阅读更多关于 R - Scraping aspx web error

问题 library(rvest) url <- "http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=14-12-2016&venue=HV&raceno=1&lang=en" R1odds <- url %>% read_html() %>% html_nodes("table") %>% .[[2]] %>% html_table(fill=TRUE) R1odds I got this error message: Error: input conversion failed due to input error, bytes 0x3C 0x2F 0x6E 0x6F [6003] How to solve this? 回答1: For others who might run into something like this in a non-gambling context here's the solution to get round the nulls. You'll have to deal with your

Nokogiri Ruby HTML Parser

阅读更多关于 Nokogiri Ruby HTML Parser

问题 I'm running into problems scraping across multiple pages with Nokogiri. I need to be able to narrow down the results of what I am searching for based on the qualified hrefs first. So here is a script to get all of the hrefs I'm interested in obtaining. However, I'm having trouble parsing out the titles of the article so that I can link to them. It would be great to know that I can manually inspect the elements so that I have the links I want and whenever I find a link I want I can also grab

Are web developers allowed to scrape html content?

阅读更多关于 Are web developers allowed to scrape html content?

问题 I want to scrape html content from a couple of websites and view them on my website a kind of mashup. I will reference and link to them aswelll! Thank you 回答1: Go ahead and do it but check their robots.txt and make sure there is a way for them to contact you if they have a problem with it. Most people will be happy to get traffic from your mash-up. Anyway the burden is on them to ask you not to. 回答2: It is not considered "polite," but it is done often nonetheless. Some websites take

A PHP HTML parser that lets me do class select and get parent nodes

阅读更多关于 A PHP HTML parser that lets me do class select and get parent nodes

问题 So I'm in a situation where I am scraping a website with PHP and I need to be able to get a node based on it's css class. I need to get a ul tag that doesn't have an id attribute but does have a css class. I, then need to get only li tags inside it which contain specific anchor tags, not all the li tags. I've looked through DOMDocument, Zend_Dom, and neither have both of the requirements, class selections and dom traversal(specifically ascending to parents). 回答1: You could use querypath and