html-parsing | 易学教程

Have HTMLParser differentiate between link-text and other data?

阅读更多关于 Have HTMLParser differentiate between link-text and other data?

问题 Say I have html code similar to this: <a href="http://example.org/">Stuff I do want</a> <p>Stuff I don't want</p> Using HTMLParser's handle_data doesn't differentiate between the link-text(stuff I do want)(Is this even the right term?) and the stuff I don't want. Does HTMLParser have a built-in way to have handle_data return only link-text and nothing else? 回答1: Basically you have to write a handle_starttag() method as well. Just save off every tag you see as self.lasttag or something. Then,

How to parse a html file and get the text which is in between the tags by using Python? [duplicate]

阅读更多关于 How to parse a html file and get the text which is in between the tags by using Python? [duplicate]

问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: Parsing HTML in Python I have searched more over on the internet for get the text which is in between the tags by using Python. Can you guys please explain? 回答1: Here is an example of using BeautifulSoup to parse HTML: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("""<html><body> <div id="a" class="c1"> We want to get this </div> <div id="b"> We don't want to get this </div></body></html>""")

DOM Parser Foreach

阅读更多关于 DOM Parser Foreach

问题 Does anyone know why this wouldn't work? foreach($html->find('tbody.result') as $article) { // get retail $item['Retail'] = trim($article->find('span.price', 0)->plaintext); // get soldby $item['SoldBy'] = trim($article->find('img', 0)->getAttribute('alt')); $articles[] = $item; } print_r($articles); 回答1: Try this: $html = file_get_html('http://www.amazon.com/gp/offer-listing/B002UYSHMM'); $articles = array(); foreach($html->find('table tbody.result tr') as $article) { if($article->find('span

Xpath and wildcards

阅读更多关于 Xpath and wildcards

问题 I have tried several combinations without success. The full xpath to that data is .//*[@id='detail_row_seek_37878']/td The problem is the number portion '37878' changes for each node and thus I can't use a foreach to loop through the nodes. Is there some way to use a wildcard and reduce the xpath to .//*[@id='detail wildcard , in an effort to bypass the absolute value portion? I am using html agility pack on this. HtmlNode ddate = node.SelectSingleNode(".//*[@id='detail_row_seek_37878']/td");

Download documents from aspx web page in R

阅读更多关于 Download documents from aspx web page in R

问题 I'm trying to automatically download documents for Oil & Gas wells from the Colorado Oil and Gas Conservation Commission (COGCC) using the "rvest" and "downloader" packages in R. The link to the table/form that contains the documents for a particular well is; http://ogccweblink.state.co.us/results.aspx?id=12337064 The "id=12337064" is the unique identifier for the well The documents on the form page can be downloaded by clicking them. An example is below. http://ogccweblink.state.co.us

Web crawler to extract in between the list

阅读更多关于 Web crawler to extract in between the list

问题 I am writing a web-crawler in python. I wish to get all the content in between <li> </li> tags .For example: <li>January 13, 1991: At least 40 people <a href ="......."> </a> </li> So here I want to : a.)extract the date- and convert it into dd/mm/yyyy format b.)the number before people. soup = BeautifulSoup(page1) h2 =soup.find_all("li") count = 0 while count < len(h2): print (str(h2[count].get_text().encode('ascii', 'ignore'))) count += 1 I can only extract the text right now. 回答1: Get the

Exceptions while I am extracting data from a Web site

阅读更多关于 Exceptions while I am extracting data from a Web site

问题 I am using Jsoup to extract data by zip codes from a Web site.The zip codes are read from a text file and the results are written at the console. I have around 1500 zip codes. The program throws two kinds of exceptions: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=500, URL=http://www.moving.com/real-estate/city-profile/... java.net.SocketTimeoutException: Read timed out I thought the solution is to read only few data at the time. So, I used a counter, to count 200 zip codes

How to use Python's HTMLParser to extract specific links

阅读更多关于 How to use Python's HTMLParser to extract specific links

问题 I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this: def handle_starttag(self, tag, attrs): if tag == 'a': for (key, value) in attrs: if key == 'href': newUrl = urljoin(self.baseUrl, value) self.links = self.links + [newUrl] This worked very well when I wanted to find every link on the page. Now I only want to fetch certain links. How would I go about only fetching links that are between

Simplexml: parsing HTML leaves out nested elements inside an element with a text node

阅读更多关于 Simplexml: parsing HTML leaves out nested elements inside an element with a text node

问题 I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well. Here is an example: <?php $html = ' <p> <b> <span>zot; zotz </span> </b> <span>Nista; nula. Isto <b>zilch; zip.</b> </span> </p> '; $xml = simplexml_load_string($html); var_dump($xml); ?> Result of var_dump() is: object(SimpleXMLElement)#1 (2) { ["b"]=> object(SimpleXMLElement)#2 (1) { ["span"]

How can i parse remote html page using pure java script

阅读更多关于 How can i parse remote html page using pure java script

问题 I have a requirement to Parse remote html page ( ex: www.mywesite.com/home) how can i get this website html page source and how can i parse this page that html is like this <html> <body> <div class="my-class1"> <a href="home/link?id=1">hello</a> </div> <div class="my-class1"> <a href="home/link?id=2">hey</a> </div> <div class="my-class1"> <a href="home/link?id=3">bye</a> </div> </body> </html> i want output as hello hey bye I'm not using any server side technology(like java, .net) i want to