How to retrieve a very long XML-string from an SQL database with R?

℡╲_俬逩灬. 提交于 2019-12-01 05:06:30

问题


I have a script to get an XML file from an SQL database. Here is how I do this:

library(RODBC)
library(XML)

myconn <- odbcConnect("mydsn")

query.text <- "SELECT xmlfield FROM db WHERE id = 12345"
doc <- sqlQuery(myconn, query.text, stringsAsFactors=FALSE)
doc <- iconv(doc[1,1], from="latin1", to="UTF-8")
doc <- xmlInternalTreeParse(doc, encoding="UTF-8")

However, the parsing didn't work for a particular database row, although it worked when I copied the content of this field into a separate file and parsed from the file. After two days "trial-and-error" I identified the main problem. It seems that querying short XML files this way doesn't cause any problems, but when I query larger files, the string gets chopped off after 65534 characters. Therefore, the end of the XML file is missing and the file can't be parsed.

I thought this might be an overall restriction of the ODBC connections on my computer. However, another programme that also uses ODBC to get the same XML field from the same database does this without any problems. So I guess it's an R-specific problem.

Any ideas how to fix it?


回答1:


I've written to the package author and have finally received the following answer:

Your inability to read is not my problem, nor is it a reasonable excuse.

The manual says

'\item[Character types] Character types can be classified three ways: fixed or variable length, by the maximum size and by the character
set used. The most commonly used types\footnote{the SQL names for
these are \code{CHARACTER VARYING} and \code{CHARACTER}, but these
are too cumbersome for routine use.} are \code{varchar} for short
strings of variable length (up to some maximum) and \code{char} for
short strings of fixed length (usually right-padded with spaces).
The value of `short' differs by DBMS and is at least 254, often a
few thousand---often other types will be available for longer
character strings. There is a sanity check which will allow only
strings of up to 65535 bytes when reading: this can be removed by
recompiling \pkg{RODBC}.'

This manual can be found in the doc directory of the RODBC package. This information is not contained within the reference manual.

As in the meantime I've found a good solution to retrieve my data without using RODBC, I haven't tried to recompile this package. But I hope this answer will be helpful for those having trouble with the same issue.




回答2:


If you want to change the source of RODBC and recompile it is fairly easy using github and the devtools package:

  1. fork the repo here: https://github.com/cran/RODBC
  2. comment out the line (this one from the R-3.03 release): https://github.com/cran/RODBC/blob/R-3.0.3/src/RODBC.c#L734

            if (datalen > 65535) datalen = 65535;
    
  3. (re)install from github:

    devtools::install.github("<yourgithubname>/RODBC")
    

Now you should be able to read in large strings. Something to note though, you may get errors due to trying to allocate too much memory (the line following the sanity check is:

    thisHandle->ColData[i].pData = Calloc(nRows * (datalen + 1), char);

hence the simplest way to proceed is set the argument rows_at_time = 1 in your sqlQuery call from R

HTH



来源:https://stackoverflow.com/questions/13525539/how-to-retrieve-a-very-long-xml-string-from-an-sql-database-with-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!