tag-soup

With Haskell, how do I process large volumes of XML?

◇◆丶佛笑我妖孽 提交于 2019-12-03 05:06:37
问题 I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing. TagSoup import Control.Monad import Text.HTML.TagSoup userid = "83805" main = do posts <- liftM parseTags (readFile "posts.xml") print $ head $ map (fromAttrib "Id") $ filter (~== ("<row OwnerUserId=" ++ userid ++ ">")

How to get an attribute from an XMLReader

大兔子大兔子 提交于 2019-12-03 03:51:26
I have some HTML that I'm converting to a Spanned using Html.fromHtml(...) , and I have a custom tag that I'm using in it: <customtag id="1234"> So I've implemented a TagHandler to handle this custom tag, like so: public void handleTag( boolean opening, String tag, Editable output, XMLReader xmlReader ) { if ( tag.equalsIgnoreCase( "customtag" ) ) { String id = xmlReader.getProperty( "id" ).toString(); } } In this case I get a SAX exception, as I believe the "id" field is actually an attribute, not a property. However, there isn't a getAttribute() method for XMLReader . So my question is, how

With Haskell, how do I process large volumes of XML?

≡放荡痞女 提交于 2019-12-02 18:19:26
I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing. TagSoup import Control.Monad import Text.HTML.TagSoup userid = "83805" main = do posts <- liftM parseTags (readFile "posts.xml") print $ head $ map (fromAttrib "Id") $ filter (~== ("<row OwnerUserId=" ++ userid ++ ">")) posts hxt import Text.XML.HXT.Arrow import Text.XML.HXT.XPath userid = "83805" main = do runX $

How to use JAXB with HTML?

帅比萌擦擦* 提交于 2019-12-01 09:08:18
I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7). Tagsoup is a SAX-compliant XML parser that can handle nasty HTML. How can I setup JAXB to use Tagsoup for unmarshalling HTML? I tried setting System.setProperty("org.xml.sax.driver", "org.ccil.cowan.tagsoup.Parser"); If I create an XMLReader, it uses Tagsoup, but not when I use JAXB. Does com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl use DOM or SAX for parsing XML? How can I tell JAXB to use SAX? How can I tell JAXB to use TagSoup as it's SAX implementation? As per Blaise's suggesting, tried below,

How to use JAXB with HTML?

血红的双手。 提交于 2019-12-01 06:54:54
问题 I would like to unmarshall some nasty HTML to a Java object using JAXB. (I'm on Java 7). Tagsoup is a SAX-compliant XML parser that can handle nasty HTML. How can I setup JAXB to use Tagsoup for unmarshalling HTML? I tried setting System.setProperty("org.xml.sax.driver", "org.ccil.cowan.tagsoup.Parser"); If I create an XMLReader, it uses Tagsoup, but not when I use JAXB. Does com.sun.xml.internal.bind.v2.runtime.JAXBContextImpl use DOM or SAX for parsing XML? How can I tell JAXB to use SAX?

Jsoup css selector code (xpath code included)

痞子三分冷 提交于 2019-11-29 07:57:32
I am trying to parse below HTML using jsoup but not able to get the right syntax for it. <div class="info"><strong>Line 1:</strong> some text 1<br> <b>some text 2</b><br> <strong>Line 3:</strong> some text 3<br> </div> I need to capture some text 1, some text 2 and some text 3 in three different variables. I have the xpath for first line (which should be similar for line 3) but unable to work out the equivalent css selector. //div[@class='info']/strong[1]/following::text() Please help. On a separate I have few hundred html files and need to parse and extract data from them to store in a