问题
Currently I have ~20,000 XML files that range in size from a couple of KB to a few MB. Although it may not be ideal, I am using the "xmlTreeParse" function in the XML package to loop through each of the files and extract the text that I need and save the document as a csv file.
The code below works fine for files <1 MB in size:
files <- list.files()
for (i in files) {
doc <- xmlTreeParse(i, useInternalNodes = TRUE)
root <- xmlRoot(doc)
name <- xmlValue(root[[8]][[1]][[1]]) # Name
data <- xmlValue(root[[8]][[1]]) # Full text
x <- data.frame(c(name))
x$data <- data
write.csv(x, paste(i, ".csv"), row.names=FALSE, na="")
}
The trouble is that any file >1 MB gives me the following error:
Excessive depth in document: 256 use XML_PARSE_HUGE option
Extra content at the end of the document
Error: 1: Excessive depth in document: 256 use XML_PARSE_HUGE option
2: Extra content at the end of the document
Please forgive my ignorance, however I have tried searching for the "XML_PARSE_HUGE" function in the XML package and can't seem to find it. Has anyone had any experience using this function? If so, I would greatly appreciate any advice as to how to get this code to handle slightly larger XML files.
Thanks!
回答1:
To choose "XML_PARSE_HUGE" you need to stipulate it in the options. XML:::parserOptions lists the option choices:
> XML:::parserOptions
RECOVER NOENT DTDLOAD DTDATTR DTDVALID NOERROR NOWARNING
1 2 4 8 16 32 64
PEDANTIC NOBLANKS SAX1 XINCLUDE NONET NODICT NSCLEAN
128 256 512 1024 2048 4096 8192
NOCDATA NOXINCNODE COMPACT OLD10 NOBASEFIX HUGE OLDSAX
16384 32768 65536 131072 262144 524288 1048576
for example
> HUGE
[1] 524288
It is suffiecient to declare a vector of integers with any of these options. In your case
xmlTreeParse(i, useInternalNodes = TRUE, options = HUGE)
来源:https://stackoverflow.com/questions/17154308/parse-xml-files-1-megabyte-in-r