tag-soup

Processing (too) many XML files (with TagSoup)

怎甘沉沦 提交于 2019-12-24 02:16:36
问题 I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title and base/@href ). To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list. This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files) error. After doing some reading, this isn

Hello World Saxon with Java

感情迁移 提交于 2019-12-11 18:07:24
问题 Using the JAR files installed through apt for Saxon-HE and tagsoup parsing html is a one-liner as: thufir@dur:~/saxon$ thufir@dur:~/saxon$ java -cp /usr/share/java/Saxon-HE-9.8.0.14.jar:/usr/share/java/tagsoup-1.2.1.jar net.sf.saxon.Query -x:org.ccil.cowan.tagsoup.Parser -qs:doc\(\'http://books.toscrape.com/\'\) <?xml version="1.0" encoding="UTF-8"?><!--[if lt IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8

jTidy and TagSoup documentation

泪湿孤枕 提交于 2019-12-10 04:24:46
问题 I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml or html5) tags. I have tested HTMLCleaner, NekoHTML and Jericho, but i don't find documentation for jTidy and TagSoup, apart from simplest examples to clear a file. I need documentation about manipulate contents, replace tags, extract info, etc...

XPath Expression returns nothing for //element, but //* returns a count

懵懂的女人 提交于 2019-12-09 18:33:51
问题 I'm using XOM with the following sample data: Element root = cleanDoc.getRootElement(); //find all the bold elements, as those mark institution and clinic. Nodes nodes = root.query("//*"); <html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml"> <head> <title>Patient Information</title> </head> </html> The following element returns many elements (from real data): //* but something like //head Returns nothing. If I run through the children of the root, the numbers

SAX error: incompatible types: String cannot be converted to InputSource

為{幸葍}努か 提交于 2019-12-08 09:35:17
问题 Relevant code; barfs on instantiating the SAXSource : TransformerFactory factory = TransformerFactory.newInstance(); XMLReader xmlReader = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser"); Source input = new SAXSource(xmlReader, "http://books.toscrape.com/"); Result output = new StreamResult(System.out); factory.newTransformer().transform(input, output); The JavaDoc's say: public SAXSource(XMLReader reader, InputSource inputSource) Create a SAXSource, using an XMLReader and a

TagSoup and XPath

核能气质少年 提交于 2019-12-06 09:07:54
问题 I'm trying to use TagSoup with XPath (JAXP). I know how to obtain SAX parser from TagSoup (or XMLReader). But I failed to find how to create DocumentBuilder that will use that SAX parser. How do I do that? Thank you. EDIT: Sorry for being so general but Java XML API is such a pain. EDIT2: Problem solved: public static void main(String[] args) throws XPathExpressionException, IOException, SAXNotRecognizedException, SAXNotSupportedException, TransformerFactoryConfigurationError,

jTidy and TagSoup documentation

痴心易碎 提交于 2019-12-05 06:15:01
I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml or html5) tags. I have tested HTMLCleaner, NekoHTML and Jericho, but i don't find documentation for jTidy and TagSoup, apart from simplest examples to clear a file. I need documentation about manipulate contents, replace tags, extract info, etc... Thanks Note: After test all options, I used StAX / Woodstox : http://wiki.fasterxml.com/WoodstoxHome

TagSoup and XPath

匆匆过客 提交于 2019-12-04 11:51:18
I'm trying to use TagSoup with XPath (JAXP). I know how to obtain SAX parser from TagSoup (or XMLReader). But I failed to find how to create DocumentBuilder that will use that SAX parser. How do I do that? Thank you. EDIT: Sorry for being so general but Java XML API is such a pain. EDIT2: Problem solved: public static void main(String[] args) throws XPathExpressionException, IOException, SAXNotRecognizedException, SAXNotSupportedException, TransformerFactoryConfigurationError, TransformerException { XPathFactory xpathFac = XPathFactory.newInstance(); XPath xpath = xpathFac.newXPath();

How to get an attribute from an XMLReader

做~自己de王妃 提交于 2019-12-04 09:41:25
问题 I have some HTML that I'm converting to a Spanned using Html.fromHtml(...) , and I have a custom tag that I'm using in it: <customtag id="1234"> So I've implemented a TagHandler to handle this custom tag, like so: public void handleTag( boolean opening, String tag, Editable output, XMLReader xmlReader ) { if ( tag.equalsIgnoreCase( "customtag" ) ) { String id = xmlReader.getProperty( "id" ).toString(); } } In this case I get a SAX exception, as I believe the "id" field is actually an

Jsoup css selector code (xpath code included)

元气小坏坏 提交于 2019-12-03 18:05:35
问题 I am trying to parse below HTML using jsoup but not able to get the right syntax for it. <div class="info"><strong>Line 1:</strong> some text 1<br> <b>some text 2</b><br> <strong>Line 3:</strong> some text 3<br> </div> I need to capture some text 1, some text 2 and some text 3 in three different variables. I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector. //div[@class='info']/strong[1]/following::text() Please help. On a