tag-soup | 易学教程

Processing (too) many XML files (with TagSoup)

阅读更多关于 Processing (too) many XML files (with TagSoup)

问题 I have a directory with about 4500 XML (HTML5) files, and I want to create a "manifest" of their data (essentially title and base/@href ). To this end, I've been using a function to collect all the relevant file paths, opening them with readFile, sending them into a tagsoup based parser and then outputting/formatting the resultant list. This works for a subset of the files, but eventually runs into a openFile: resource exhausted (Too many open files) error. After doing some reading, this isn

Hello World Saxon with Java

阅读更多关于 Hello World Saxon with Java

问题 Using the JAR files installed through apt for Saxon-HE and tagsoup parsing html is a one-liner as: thufir@dur:~/saxon$ thufir@dur:~/saxon$ java -cp /usr/share/java/Saxon-HE-9.8.0.14.jar:/usr/share/java/tagsoup-1.2.1.jar net.sf.saxon.Query -x:org.ccil.cowan.tagsoup.Parser -qs:doc$\'http://books.toscrape.com/\'$ <?xml version="1.0" encoding="UTF-8"?><!--[if IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8

jTidy and TagSoup documentation

阅读更多关于 jTidy and TagSoup documentation

问题 I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml or html5) tags. I have tested HTMLCleaner, NekoHTML and Jericho, but i don't find documentation for jTidy and TagSoup, apart from simplest examples to clear a file. I need documentation about manipulate contents, replace tags, extract info, etc...

XPath Expression returns nothing for //element, but //* returns a count

阅读更多关于 XPath Expression returns nothing for //element, but //* returns a count

问题 I'm using XOM with the following sample data: Element root = cleanDoc.getRootElement(); //find all the bold elements, as those mark institution and clinic. Nodes nodes = root.query("//*"); <html xmlns="http://www.w3.org/1999/xhtml" xmlns:html="http://www.w3.org/1999/xhtml"> <head> <title>Patient Information</title> </head> </html> The following element returns many elements (from real data): //* but something like //head Returns nothing. If I run through the children of the root, the numbers

SAX error: incompatible types: String cannot be converted to InputSource

阅读更多关于 SAX error: incompatible types: String cannot be converted to InputSource

问题 Relevant code; barfs on instantiating the SAXSource : TransformerFactory factory = TransformerFactory.newInstance(); XMLReader xmlReader = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser"); Source input = new SAXSource(xmlReader, "http://books.toscrape.com/"); Result output = new StreamResult(System.out); factory.newTransformer().transform(input, output); The JavaDoc's say: public SAXSource(XMLReader reader, InputSource inputSource) Create a SAXSource, using an XMLReader and a

TagSoup and XPath

阅读更多关于 TagSoup and XPath

问题 I'm trying to use TagSoup with XPath (JAXP). I know how to obtain SAX parser from TagSoup (or XMLReader). But I failed to find how to create DocumentBuilder that will use that SAX parser. How do I do that? Thank you. EDIT: Sorry for being so general but Java XML API is such a pain. EDIT2: Problem solved: public static void main(String[] args) throws XPathExpressionException, IOException, SAXNotRecognizedException, SAXNotSupportedException, TransformerFactoryConfigurationError,

jTidy and TagSoup documentation

阅读更多关于 jTidy and TagSoup documentation

I'm looking for documentation (officially documentation if it is possible) for TagSoup and jTidy libraries. I want use this libraries to manipulate html "tagsoup" files that include xml tags with different namespaces mixed between html (html, xhtml or html5) tags. I have tested HTMLCleaner, NekoHTML and Jericho, but i don't find documentation for jTidy and TagSoup, apart from simplest examples to clear a file. I need documentation about manipulate contents, replace tags, extract info, etc... Thanks Note: After test all options, I used StAX / Woodstox : http://wiki.fasterxml.com/WoodstoxHome

TagSoup and XPath

阅读更多关于 TagSoup and XPath

I'm trying to use TagSoup with XPath (JAXP). I know how to obtain SAX parser from TagSoup (or XMLReader). But I failed to find how to create DocumentBuilder that will use that SAX parser. How do I do that? Thank you. EDIT: Sorry for being so general but Java XML API is such a pain. EDIT2: Problem solved: public static void main(String[] args) throws XPathExpressionException, IOException, SAXNotRecognizedException, SAXNotSupportedException, TransformerFactoryConfigurationError, TransformerException { XPathFactory xpathFac = XPathFactory.newInstance(); XPath xpath = xpathFac.newXPath();

How to get an attribute from an XMLReader

阅读更多关于 How to get an attribute from an XMLReader

问题 I have some HTML that I'm converting to a Spanned using Html.fromHtml(...) , and I have a custom tag that I'm using in it: <customtag id="1234"> So I've implemented a TagHandler to handle this custom tag, like so: public void handleTag( boolean opening, String tag, Editable output, XMLReader xmlReader ) { if ( tag.equalsIgnoreCase( "customtag" ) ) { String id = xmlReader.getProperty( "id" ).toString(); } } In this case I get a SAX exception, as I believe the "id" field is actually an

Jsoup css selector code (xpath code included)

阅读更多关于 Jsoup css selector code (xpath code included)

问题 I am trying to parse below HTML using jsoup but not able to get the right syntax for it. <div class="info"><strong>Line 1:</strong> some text 1<br> <b>some text 2</b><br> <strong>Line 3:</strong> some text 3<br> </div> I need to capture some text 1, some text 2 and some text 3 in three different variables. I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector. //div[@class='info']/strong[1]/following::text() Please help. On a