html-parsing | 易学教程

Matching pair tag with regex

阅读更多关于 Matching pair tag with regex

问题 I'm trying to retrieve specific tags with their content out of an xhtml document, but it's matching the wrong ending tags. In the following content: <cache_namespace name="content"> <content_block id="15"> some content here <cache_namespace name="user"> <content_block id="welcome"> Welcome Apikot! </content_block> </cache_namespace> </content_block> </cache_namespace> The content_block ending tag for id="welcome" actually get's matched as the ending tag of the first opening content_block tag.

How to extract separate text nodes with Jsoup?

阅读更多关于 How to extract separate text nodes with Jsoup?

问题 I have an element like this : <td> TextA <br/> TextB </td> How can I extract TextA and TextB separately? 回答1: Several ways. That really depends on the document itself and whether the given HTML markup is consistent or not. In this particular example you could get the td 's child nodes by Element#childNodes() and then test every node individually if it's a TextNode or not. E.g. Element td = getItSomehow(); for (Node child : td.childNodes()) { if (child instanceof TextNode) { System.out.println

How to get node value / innerHTML with XPath?

阅读更多关于 How to get node value / innerHTML with XPath?

问题 I have a XPath to select to a class I want: //div[@class='myclass'] . But it returns me the whole div (with the <div class='myclass'> also, but I would like to return only the contents of this tag without the tag itself. How can I do it? 回答1: With xpath, the thing you will get returned is the last thing in the path that is not a condition. What that means? Well, conditions are the stuff between [] 's (but you already knew that) and yours reads like pathElement[ that has a 'class' attribute

TagSoup vs. Jsoup vs. HTML Parser vs. HotSax vs [closed]

阅读更多关于 TagSoup vs. Jsoup vs. HTML Parser vs. HotSax vs [closed]

问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . The abundance of HTML parsers to choose from (and stick with) is mind boggling: http://java-source.net/open-source/html-parsers How do

Extract values from HTML TD and Tr

阅读更多关于 Extract values from HTML TD and Tr

问题 I have some HTML source that i get from a website for option quotes. (please see below) What is the best way to extract the various text values in tr and store in a collection based on the strike price (4700 in this case available in the mid td 4700.00 ) Some people recommend regex while other suggest to use a html parser. I'm doing this in VBA so whats the best way? <!--<td><a href="javascript:popup1('','','1')">Quote</a></td> <td><a href="javascript:popup1('','','','','CE')"><img src="

Parsing JS with Beautiful soup

阅读更多关于 Parsing JS with Beautiful soup

问题 I have some page parsed with beautiful soup. But there I have js code : <script type="text/javascript"> var utag_data = { customer_id : "_PHL2883198554", customer_type : "New", loyalty_id : "N", declined_loyalty_interstitial : "false", site_version : "Desktop Site", site_currency: "de_DE_EURO", site_region: "uk", site_language: "en-GB", customer_address_zip : "", customer_email_hash : "", referral_source : "", page_type : "product", product_category_name : ["Lingerie"], product_category_id :

HTML Agility Pack Parsing With Upper & Lower Case Tags?

阅读更多关于 HTML Agility Pack Parsing With Upper & Lower Case Tags?

问题 I am using the HTML Agility Pack to great effect, and am really impressed with it - However, I am selecting content like so doc.DocumentNode.SelectSingleNode("//body").InnerHtml How to I deal with the following situation, with different documents? <body> <Body> <BODY> Will my code above only get the lower case versions? 回答1: The Html Agility Pack handles HTML in a case insensitive way. It means it will parse BODY, Body and body the same way. It's by design since HTML is not case sensitive

Beautiful Soup and Table Scraping - lxml vs html parser

阅读更多关于 Beautiful Soup and Table Scraping - lxml vs html parser

问题 I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup. <table class="facts_label" id="facts_table">...</table> I would like to know why the code bellow works with the "html.parser" and prints back none if I change "html.parser" for "lxml" . #! /usr/bin/python from bs4 import BeautifulSoup from urllib import urlopen webpage = urlopen('http://www.thewebpage.com') soup=BeautifulSoup(webpage, "html.parser") table = soup.find('table', {'class' : 'facts_label'}) print

How to parse malformed HTML in python, using standard libraries

阅读更多关于 How to parse malformed HTML in python, using standard libraries

问题 There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing. I've found plenty of great third-party libraries for this task, but this question is about the python standard library. Requirements: Use only Python standard library components (any 2.x version) DOM support Handle HTML entities ( ) Handle partial documents (like: Hello, <i>World</i>! ) Bonus points: XPATH support Handle unclosed/malformed tags. ( <big>does

PHP function to strip tags, except a list of whitelisted tags and attributes

阅读更多关于 PHP function to strip tags, except a list of whitelisted tags and attributes

问题 I have to strip all HTML tags and attributes from a user input except the ones considered "safe" (ie, a white list approach). strip_tags() strips all tags except the ones listed in the $allowable_tags parameter. But I also need to be able to strip all the not whitelisted attributes; for example, I want to allow the <b> tag, but I don't want to allow the onclick attribute for obvious reasons. Is there a function to do that, or will I have to make my own? 回答1: As far as I know, the strip_tags