html-parsing

JSoup.clean() is not preserving relative URLs

浪尽此生 提交于 2020-01-02 08:40:08
问题 I have tried: Whitelist.relaxed(); Whitelist.relaxed().preserveRelativeLinks(true); Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp"); Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp").preserveRelativeLinks(true); None of them work: When I try to clean a relative url, like <a href="/test.xhtml">test</a> I get the href attribute removed ( <a>test</a> ). I am using JSoup 1.8.2. Any ideas? 回答1: The problem most likely stems from

How to parse HTML in ng-repeat in angular.js [duplicate]

一曲冷凌霜 提交于 2020-01-02 07:08:19
问题 This question already has answers here : With ng-bind-html-unsafe removed, how do I inject HTML? (10 answers) Closed last year . I need to parse optional HTML from my model in ng-repeat. I have a repeater in a .jade template like this: tr(ng-repeat='car in cars') td(class='arrived-{{car.arrived}}') {{car.number}} td(class='arrived-{{car.arrived}}') {{car.location}} my car.location can be plain text like: City name or it can have some html in it, like this: In transit, <a href="http://example

How to create an Jsoup Selector with an AND operation?

久未见 提交于 2020-01-02 05:50:04
问题 I want to find the following tag in a html. <a href="http://www.google.com/AAA" class="link">AAA</a> I know I can use a selector like a[href^=http://www.google.com/] or a[class=link] . But how can I combine this two conditions? Or is there a better way to do this? Like regex? and how? Thanks! 回答1: Just combine them in a single CSS selector. Elements links = document.select("a[href^=http://www.google.com/][class=link]"); // ... or Elements links = document.select("a.link[href^=http://www

how to get text between a specific span with HtmlUnit

…衆ロ難τιáo~ 提交于 2020-01-02 04:49:09
问题 I'm new to HtmlUnit and I'm not even sure if it is the right tool for my project. I'm trying to parse a website and extract the values I need from it. I need to get the value "07:05" from this, <span class="tim tim-dep">07:05</span> I know that I can use the getTextContent() for extracting the value but I don't know how I can select a specific span. I used getElementById for finding the <div> tag that this expression belongs to but when I get the text content of that div, I get a whole line

Convert HTML to plain text and maintain structure/formatting, with ruby

放肆的年华 提交于 2020-01-02 04:36:05
问题 I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for <br> tags, detecting paragraphs and formatting them as such, etc. The input is pretty simple, usually well-formatted html (not entire documents, just a bunch of content, usually with no anchors or images). I could put together a couple regexs that get me 80% there but figured there might be some existing solutions with

c# using HtmlAgilityPack to get data from HTML table

流过昼夜 提交于 2020-01-02 04:31:26
问题 i am trying to get information out of an html table by parsing the html using HtmlAgilityPack. here is what the HTML looks like: ... ... ... <tbody> <tr> <td class="style_19" style="vertical-align: baseline;"> <div class="style_18">AA00857</div> </td> <td class="style_19" style="vertical-align: baseline;"> <div></div> <div class="style_20">TPRCF</div> </td> <td class="style_19" style="vertical-align: baseline;"> <div class="style_21"></div> </td> <td class="style_19" style="vertical-align:

How do I scrape only the <body> tag off of a website

旧时模样 提交于 2020-01-02 02:32:12
问题 I'm working on a webcrawler. At the moment i scrape the whole content and then using regular expression i remove <meta>, <script>, <style> and other tags and get the content of the body. However, I'm trying to optimise the performance and I was wondering if there's a way I could scrape only the <body> of the page? namespace WebScrapper { public static class KrioScraper { public static string scrapeIt(string siteToScrape) { string HTML = getHTML(siteToScrape); string text = stripCode(HTML);

Python HTMLParser: UnicodeDecodeError

有些话、适合烂在心里 提交于 2020-01-02 00:55:19
问题 I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError exceptions when passing some to HTMLParser . I tried using chardet to detect the encodings and to convert to ascii , or utf-8 (the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed(). The information is there if I just print it out. from HTMLParser import HTMLParser import urllib import

iPhone HTML Parsing using TouchXML and tidy

主宰稳场 提交于 2020-01-01 22:41:30
问题 I'm trying to parse HTML using TouchXML. However, it seems that the data I want to parse (I do not control the source, it's downloaded from the internet) is partially malformed - I get various errors during the parse. Therefore, it seems that I should be using the inbuilt tidy support to fix the HTML but I cannot seem to find any documentation or information on how to enable it or link libtidy successfully into my project. If anyone has any information on how to do this, it'd be much

How to find all text inside <p> elements in an HTML page using BeautifulSoup

杀马特。学长 韩版系。学妹 提交于 2020-01-01 19:38:34
问题 I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python. For example, <p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> should return: Many hundreds of cultivars exist. P.S. Some files contain Unicode characters (Hindi) which need to be extracted. Any ideas how to do that? 回答1: Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the