sax

Parsing of badly formatted HTML in PHP

て烟熏妆下的殇ゞ 提交于 2019-11-27 02:49:28
问题 In my code I convert some styled xls document to html using openoffice. I then parse the tables using xml_parser_create . The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4> . The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast. Do you know a (hopefully

Is there a SaxParser that reads json and fires events so it looks like xml

匆匆过客 提交于 2019-11-27 02:40:37
问题 This would be great as it would allow my xml stuff to read json w/out any change except for the different sax parser. 回答1: If you meant, event-based parser then there are a couple of projects out there that do this: http://code.google.com/p/json-simple/ Stoppable SAX-like interface for streaming input of JSON text This project has moved to https://github.com/fangyidong/json-simple http://jackson.codehaus.org/Tutorial Jackson Streaming API is similar to Stax API This project has moved to https

Cure for 'The string “--” is not permitted within comments.' exception?

*爱你&永不变心* 提交于 2019-11-27 02:34:36
问题 I'm using Java 6. I have this dependency in my pom ... <dependency> <groupId>xerces</groupId> <artifactId>xercesImpl</artifactId> <version>2.10.0</version> </dependency> I'm trying to parse an XHTML doc with this line <!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:TrackMoves/> <w:TrackFormatting/> <w:PunctuationKerning/> <w:ValidateAgainstSchemas/> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w

Storing specific XML node values with R's xmlEventParse

无人久伴 提交于 2019-11-27 02:24:53
问题 I have a big XML file which I need to parse with xmlEventParse in R. Unfortunately on-line examples are more complex than I need, and I just want to flag a matching node tag to store the matched node text (not attribute), each text in a separate list, see the comments in the code below: library(XML) z <- xmlEventParse( "my.xml", handlers = list( startDocument = function() { cat("Starting document\n") }, startElement = function(name,attr) { if ( name == "myNodeToMatch1" ){ cat("FLAG Matched

How to Parse Big (50 GB) XML Files in Java

吃可爱长大的小学妹 提交于 2019-11-27 00:30:45
问题 Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvements. Is there any way to speed this up? A better method? Stripped it to bare bones, so i now have the following code and when running in command line it still doesn't go as fast as i would like. Running it with "java -Xms-4096m -Xmx8192m -jar reader.jar" i get a GC overhead limit exceeded around article 700000 Main: public

ElementTree iterparse strategy

杀马特。学长 韩版系。学妹 提交于 2019-11-27 00:18:12
问题 I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing). My concern is the following, imagine you have an xml like this <?xml version="1.0" encoding="UTF-8" ?> <families> <family> <name>Simpson</name> <members> <name>Homer</name> <name>Marge</name> <name>Bart</name> </members> </family> <family> <name>Griffin</name> <members> <name>Peter</name> <name>Brian</name> <name>Meg</name> </members> </family> <

Is there any XPath processor for SAX model?

跟風遠走 提交于 2019-11-26 23:55:57
I'm looking for an XPath evaluator that doesn't rebuild the whole DOM document to look for the nodes of a document: actually the object is to manage a large amount of XML data (ideally over 2Gb) with SAX model, which is very good for memory management, and give the possibility to search for nodes. Thank you all for the support! For all those who say it's not possible: I recently, after asked the question, found a project named "saxpath" ( http://www.saxpath.org/ ), but I can't find any implementing project. koppor My current list (compiled from web search results and the other answers) is:

What ever happened to XPathReader

ぐ巨炮叔叔 提交于 2019-11-26 23:13:19
问题 XPathReader is/ was an implementation of a forward reading XML parser (built on XMLReader) which allowed you to register XPath queries for it to find (or at least a subset of XPath called Sequential XPath). This seems to be the perfect choice for easy access to elements of xml streams, or case where you just need to pull some information out of the start of a large xml document and therefore don't want to load the whole thing into memory. There seemed to be a flurry of excitement about the

SAX parser: Ignoring special characters

让人想犯罪 __ 提交于 2019-11-26 22:04:56
问题 I'm using Xerces to parse my xml document. The issue is that xml escaped characters like ' ' appear in characters() method as non-escaped ones. I need to get escaped characters inside characters() method as is. Thanks. UPD: Tried to override resolveEntity method im my DefaultHandler's descendant. Can see from debug that it's set as entity resolver to xml reader but code from overridden method is not invoked. 回答1: I think your solution is not too bad: a few lines of code to do exactly what you

Is XPath much more efficient as compared to DOM and SAX?

爷,独闯天下 提交于 2019-11-26 22:00:21
问题 I need to parse an xml string and find values of specific text nodes, attribute values etc. I'm doing this in javascript and was using the DOMParser class for the same. Later I was informed that DOM is takes up a lot of memory and SAX is a better option. Recently I found that XPath too provides a simple way to find nodes. But I'm not sure which amongst these 3 would be the most efficient way to parse XML. Kindly help.... 回答1: SAX is a top-down parser and allows serial access to a XML document