html-parsing | 易学教程

JSoup.clean() is not preserving relative URLs

阅读更多关于 JSoup.clean() is not preserving relative URLs

问题 I have tried: Whitelist.relaxed(); Whitelist.relaxed().preserveRelativeLinks(true); Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp"); Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp").preserveRelativeLinks(true); None of them work: When I try to clean a relative url, like <a href="/test.xhtml">test</a> I get the href attribute removed ( <a>test</a> ). I am using JSoup 1.8.2. Any ideas? 回答1: The problem most likely stems from

How to parse HTML in ng-repeat in angular.js [duplicate]

阅读更多关于 How to parse HTML in ng-repeat in angular.js [duplicate]

问题 This question already has answers here : With ng-bind-html-unsafe removed, how do I inject HTML? (10 answers) Closed last year . I need to parse optional HTML from my model in ng-repeat. I have a repeater in a .jade template like this: tr(ng-repeat='car in cars') td(class='arrived-{{car.arrived}}') {{car.number}} td(class='arrived-{{car.arrived}}') {{car.location}} my car.location can be plain text like: City name or it can have some html in it, like this: In transit, <a href="http://example

How to create an Jsoup Selector with an AND operation?

阅读更多关于 How to create an Jsoup Selector with an AND operation?

问题 I want to find the following tag in a html. <a href="http://www.google.com/AAA" class="link">AAA</a> I know I can use a selector like a[href^=http://www.google.com/] or a[class=link] . But how can I combine this two conditions? Or is there a better way to do this? Like regex? and how? Thanks! 回答1: Just combine them in a single CSS selector. Elements links = document.select("a[href^=http://www.google.com/][class=link]"); // ... or Elements links = document.select("a.link[href^=http://www

how to get text between a specific span with HtmlUnit

阅读更多关于 how to get text between a specific span with HtmlUnit

问题 I'm new to HtmlUnit and I'm not even sure if it is the right tool for my project. I'm trying to parse a website and extract the values I need from it. I need to get the value "07:05" from this, <span class="tim tim-dep">07:05</span> I know that I can use the getTextContent() for extracting the value but I don't know how I can select a specific span. I used getElementById for finding the <div> tag that this expression belongs to but when I get the text content of that div, I get a whole line

Convert HTML to plain text and maintain structure/formatting, with ruby

阅读更多关于 Convert HTML to plain text and maintain structure/formatting, with ruby

问题 I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc. The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images). I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with

c# using HtmlAgilityPack to get data from HTML table

阅读更多关于 c# using HtmlAgilityPack to get data from HTML table

问题 i am trying to get information out of an html table by parsing the html using HtmlAgilityPack. here is what the HTML looks like: ... ... ... <tbody> <tr> <td class="style_19" style="vertical-align: baseline;"> <div class="style_18">AA00857</div> </td> <td class="style_19" style="vertical-align: baseline;"> <div></div> <div class="style_20">TPRCF</div> </td> <td class="style_19" style="vertical-align: baseline;"> <div class="style_21"></div> </td> <td class="style_19" style="vertical-align:

How do I scrape only the <body> tag off of a website

阅读更多关于 How do I scrape only the tag off of a website

问题 I'm working on a webcrawler. At the moment i scrape the whole content and then using regular expression i remove <meta>, <script>, <style> and other tags and get the content of the body. However, I'm trying to optimise the performance and I was wondering if there's a way I could scrape only the <body> of the page? namespace WebScrapper { public static class KrioScraper { public static string scrapeIt(string siteToScrape) { string HTML = getHTML(siteToScrape); string text = stripCode(HTML);

Python HTMLParser: UnicodeDecodeError

阅读更多关于 Python HTMLParser: UnicodeDecodeError

问题 I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser . I tried using chardet to detect the encodings and to convert to ascii , or utf-8 (the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed(). The information is there if I just print it out. from HTMLParser import HTMLParser import urllib import

iPhone HTML Parsing using TouchXML and tidy

阅读更多关于 iPhone HTML Parsing using TouchXML and tidy

问题 I'm trying to parse HTML using TouchXML. However, it seems that the data I want to parse (I do not control the source, it's downloaded from the internet) is partially malformed - I get various errors during the parse. Therefore, it seems that I should be using the inbuilt tidy support to fix the HTML but I cannot seem to find any documentation or information on how to enable it or link libtidy successfully into my project. If anyone has any information on how to do this, it'd be much

How to find all text inside <p> elements in an HTML page using BeautifulSoup

阅读更多关于 How to find all text inside elements in an HTML page using BeautifulSoup

问题 I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python. For example, <p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> should return: Many hundreds of cultivars exist. P.S. Some files contain Unicode characters (Hindi) which need to be extracted. Any ideas how to do that? 回答1: Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the