sax | 易学教程

Parsing of badly formatted HTML in PHP

阅读更多关于 Parsing of badly formatted HTML in PHP

问题 In my code I convert some styled xls document to html using openoffice. I then parse the tables using xml_parser_create . The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4> . The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast. Do you know a (hopefully

Is there a SaxParser that reads json and fires events so it looks like xml

阅读更多关于 Is there a SaxParser that reads json and fires events so it looks like xml

问题 This would be great as it would allow my xml stuff to read json w/out any change except for the different sax parser. 回答1: If you meant, event-based parser then there are a couple of projects out there that do this: http://code.google.com/p/json-simple/ Stoppable SAX-like interface for streaming input of JSON text This project has moved to https://github.com/fangyidong/json-simple http://jackson.codehaus.org/Tutorial Jackson Streaming API is similar to Stax API This project has moved to https

Cure for 'The string “--” is not permitted within comments.' exception?

阅读更多关于 Cure for 'The string “--” is not permitted within comments.' exception?

问题 I'm using Java 6. I have this dependency in my pom ... <dependency> <groupId>xerces</groupId> <artifactId>xercesImpl</artifactId> <version>2.10.0</version> </dependency> I'm trying to parse an XHTML doc with this line <!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w

Storing specific XML node values with R's xmlEventParse

阅读更多关于 Storing specific XML node values with R's xmlEventParse

问题 I have a big XML file which I need to parse with xmlEventParse in R. Unfortunately on-line examples are more complex than I need, and I just want to flag a matching node tag to store the matched node text (not attribute), each text in a separate list, see the comments in the code below: library(XML) z <- xmlEventParse( "my.xml", handlers = list( startDocument = function() { cat("Starting document\n") }, startElement = function(name,attr) { if ( name == "myNodeToMatch1" ){ cat("FLAG Matched

How to Parse Big (50 GB) XML Files in Java

阅读更多关于 How to Parse Big (50 GB) XML Files in Java

问题 Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements. Is there any way to speed this up? A better method? Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like. Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000 Main: public

ElementTree iterparse strategy

阅读更多关于 ElementTree iterparse strategy

问题 I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing). My concern is the following, imagine you have an xml like this <?xml version="1.0" encoding="UTF-8" ?> <families> <family> <name>Simpson</name> <members> <name>Homer</name> <name>Marge</name> <name>Bart</name> </members> </family> <family> <name>Griffin</name> <members> <name>Peter</name> <name>Brian</name> <name>Meg</name> </members> </family> <

Is there any XPath processor for SAX model?

阅读更多关于 Is there any XPath processor for SAX model?

I'm looking for an XPath evaluator that doesn't rebuild the whole DOM document to look for the nodes of a document: actually the object is to manage a large amount of XML data (ideally over 2Gb) with SAX model, which is very good for memory management, and give the possibility to search for nodes. Thank you all for the support! For all those who say it's not possible: I recently, after asked the question, found a project named "saxpath" ( http://www.saxpath.org/ ), but I can't find any implementing project. koppor My current list (compiled from web search results and the other answers) is:

What ever happened to XPathReader

阅读更多关于 What ever happened to XPathReader

问题 XPathReader is/ was an implementation of a forward reading XML parser (built on XMLReader) which allowed you to register XPath queries for it to find (or at least a subset of XPath called Sequential XPath). This seems to be the perfect choice for easy access to elements of xml streams, or case where you just need to pull some information out of the start of a large xml document and therefore don't want to load the whole thing into memory. There seemed to be a flurry of excitement about the

SAX parser: Ignoring special characters

阅读更多关于 SAX parser: Ignoring special characters

问题 I'm using Xerces to parse my xml document. The issue is that xml escaped characters like ' ' appear in characters() method as non-escaped ones. I need to get escaped characters inside characters() method as is. Thanks. UPD: Tried to override resolveEntity method im my DefaultHandler's descendant. Can see from debug that it's set as entity resolver to xml reader but code from overridden method is not invoked. 回答1: I think your solution is not too bad: a few lines of code to do exactly what you

Is XPath much more efficient as compared to DOM and SAX?

阅读更多关于 Is XPath much more efficient as compared to DOM and SAX?

问题 I need to parse an xml string and find values of specific text nodes, attribute values etc. I'm doing this in javascript and was using the DOMParser class for the same. Later I was informed that DOM is takes up a lot of memory and SAX is a better option. Recently I found that XPath too provides a simple way to find nodes. But I'm not sure which amongst these 3 would be the most efficient way to parse XML. Kindly help.... 回答1: SAX is a top-down parser and allows serial access to a XML document