screen-scraping

Parsing dl with HtmlAgilityPack

痴心易碎 提交于 2019-12-12 10:58:13
问题 This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#). <div class="content-div"> <dl> <dt> <b><a href="1.html" title="1">1</a></b> </dt> <dd> First Entry</dd> <dt> <b><a href="2.html" title="2">2</a></b> </dt> <dd> Second Entry</dd> <dt> <b><a href="3.html" title="3">3</a></b> </dt> <dd> Third Entry</dd> </dl> </div> The Values I want are : The hyperlink -> 1.html The Anchor Text ->1 Inner Text od dd -> First Entry (I have taken examples of the first entry here

Beautifulsoup get value in table

99封情书 提交于 2019-12-12 09:28:17
问题 I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name")) print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next The relevant HTML is <td valign="top"> <table border="1" cellpadding="1"

Selenium: Not able to understand xPath

北慕城南 提交于 2019-12-12 09:26:09
问题 I have some HTML like this: <h4 class="box_header clearfix"> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> <small> <span> <a rel="dialog" href="http://www.google.com/?q=word">Search</a> </span> </h4> I am trying to get the href here in Java using Selenium. I have tried the following: selenium.getText("xpath=/descendant::h4[@class='box_header clearfix']/"); selenium.getAttribute("xpath=/descendant::h4[@class='box_header clearfix']/"); But none of these work. It

Using Ruby with Mechanize to log into a website

流过昼夜 提交于 2019-12-12 08:09:42
问题 I need to scrape data from a site, but it requires my login first. I've been using hpricot to successfully scrape other sites, but I'm new to using mechanize, and I'm truly baffled by how to work it. I see this example commonly quoted: require 'rubygems' require 'mechanize' a = Mechanize.new a.get('http://rubyforge.org/') do |page| # Click the login link login_page = a.click(page.link_with(:text => /Log In/)) # Submit the login form my_page = login_page.form_with(:action => '/account/login

Scraping hidden HTML (when visible = false) using Hpricot (Ruby on Rails)

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-12 05:48:35
问题 I've come across an issue which unfortunately I can't seem to surpass, I'm also just a newborn to Ruby on rails unfortunately hence the number of questions I am attempting to scrape a webpage such as the following: http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo.aspx I would like to scrape The Addresses, Phones and URL of the next Page which in this case is http://www.yellowpages.com.mt/Malta/Grocers-Mini-Markets-Retail-In-Malta-Gozo+Ismol.aspx I've been trying

Can't separate cells properly with simplehtmldom

允我心安 提交于 2019-12-12 04:48:26
问题 I am trying to write a web scraper. I want to get all the cells in a row. The row before the one I want has THOROUGHBRED MEETINGS as its plain text value. I can successfully get this row. But I can't figure out how to get the next row's children which are the cells or <td> tags. if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS", $html)) { $cell = $foundTag->parent(); $row = $cell->parent(); $nextRow = $row->next_sibling(); echo "Row: ".$row->plaintext."<br />\n"; echo "Next Row: ".

R - Scraping aspx web error

强颜欢笑 提交于 2019-12-12 04:32:33
问题 library(rvest) url <- "http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=14-12-2016&venue=HV&raceno=1&lang=en" R1odds <- url %>% read_html() %>% html_nodes("table") %>% .[[2]] %>% html_table(fill=TRUE) R1odds I got this error message: Error: input conversion failed due to input error, bytes 0x3C 0x2F 0x6E 0x6F [6003] How to solve this? 回答1: For others who might run into something like this in a non-gambling context here's the solution to get round the nulls. You'll have to deal with your

Nokogiri Ruby HTML Parser

半腔热情 提交于 2019-12-12 04:25:36
问题 I'm running into problems scraping across multiple pages with Nokogiri. I need to be able to narrow down the results of what I am searching for based on the qualified hrefs first. So here is a script to get all of the hrefs I'm interested in obtaining. However, I'm having trouble parsing out the titles of the article so that I can link to them. It would be great to know that I can manually inspect the elements so that I have the links I want and whenever I find a link I want I can also grab

Are web developers allowed to scrape html content?

*爱你&永不变心* 提交于 2019-12-12 03:08:24
问题 I want to scrape html content from a couple of websites and view them on my website a kind of mashup. I will reference and link to them aswelll! Thank you 回答1: Go ahead and do it but check their robots.txt and make sure there is a way for them to contact you if they have a problem with it. Most people will be happy to get traffic from your mash-up. Anyway the burden is on them to ask you not to. 回答2: It is not considered "polite," but it is done often nonetheless. Some websites take

A PHP HTML parser that lets me do class select and get parent nodes

[亡魂溺海] 提交于 2019-12-12 03:03:56
问题 So I'm in a situation where I am scraping a website with PHP and I need to be able to get a node based on it's css class. I need to get a ul tag that doesn't have an id attribute but does have a css class. I, then need to get only li tags inside it which contain specific anchor tags, not all the li tags. I've looked through DOMDocument, Zend_Dom, and neither have both of the requirements, class selections and dom traversal(specifically ascending to parents). 回答1: You could use querypath and