问题
I'd like to extract from a .kml file the value(s) for description using R.
Here is the file:
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2"
xmlns:gx="http://www.google.com/kml/ext/2.2"
xmlns:atom="http://www.w3.org/2005/Atom">
<Document>
<open>1</open>
<visibility>1</visibility>
<name><![CDATA[2013-07-06 4:18pm]]></name>
...
<Placemark>
<name><![CDATA[2013-07-06 4:18pm (Start)]]></name>
<description><![CDATA[]]></description>
<TimeStamp><when>2013-07-06T20:18:56.000Z</when></TimeStamp>
<styleUrl>#start</styleUrl>
<Point>
<coordinates>-78.353348,45.020615,340.29998779296875</coordinates>
</Point>
</Placemark>
<Placemark id="tour">
<name><![CDATA[2013-07-06 4:18pm]]></name>
<description><![CDATA[]]></description>
...
<gx:Track>
<when>2013-07-06T20:18:56.000Z</when>
<gx:coord>-78.353348 45.020615 340.29998779296875</gx:coord>
<when>2013-07-06T20:19:12.000Z</when>
<gx:coord>-78.353315 45.020644 340.29998779296875</gx:coord>
<when>2013-07-06T22:12:23.000Z</when>
<gx:coord>-78.353108 45.020736 342.29998779296875</gx:coord>
<ExtendedData>
...
<Placemark>
<name><![CDATA[2013-07-06 4:18pm (End)]]></name>
<description><![CDATA[Created by Google My Tracks on Android.
Name: 2013-07-06 4:18pm
Activity type: cycling
Description: -
Total distance: 49.62 km (30.8 mi)
Total time: 1:53:28
Moving time: 1:50:17
Average speed: 26.24 km/h (16.3 mi/h)
Average moving speed: 27.00 km/h (16.8 mi/h)
Max speed: 61.20 km/h (38.0 mi/h)
Average pace: 2.29 min/km (3.7 min/mi)
Average moving pace: 2.22 min/km (3.6 min/mi)
Fastest pace: 0.98 min/km (1.6 min/mi)
Max elevation: 406 m (1333 ft)
Min elevation: 265 m (868 ft)
Elevation gain: 690 m (2263 ft)
Max grade: 12 %
Min grade: -11 %
Recorded: 2013-07-06 4:18pm
]]></description>
...
</Placemark>
</Document>
</kml>
And here is what I want to extract, the text contained in
<description><![CDATA[Created by Google My Tracks on Android.: ]]></description>
i.e.:
Name: 2013-07-06 4:18pm
Activity type: cycling
Description: -
Total distance: 49.62 km (30.8 mi)
Total time: 1:53:28
Moving time: 1:50:17
Average speed: 26.24 km/h (16.3 mi/h)
Average moving speed: 27.00 km/h (16.8 mi/h)
Max speed: 61.20 km/h (38.0 mi/h)
Average pace: 2.29 min/km (3.7 min/mi)
Average moving pace: 2.22 min/km (3.6 min/mi)
Fastest pace: 0.98 min/km (1.6 min/mi)
Max elevation: 406 m (1333 ft)
Min elevation: 265 m (868 ft)
Elevation gain: 690 m (2263 ft)
Max grade: 12 %
Min grade: -11 %
Recorded: 2013-07-06 4:18p
xmlToList gives me, I think NULL because the CDATA tag means the stuff following is not processed by the parser:
xml <- xmlTreeParse("test1.kml", useInternalNodes=TRUE)
xmllist <- xmlToList(xml)
xmllist$Document$Placemark$description
[[1]]
NULL
I think that is what this means "The term CDATA is used about text data that should not be parsed by the XML parser ...Everything inside a CDATA section is ignored by the parser. A CDATA section starts with "" "
The following will not work for me either, perhaps for the same reason related to CDATA:
z1 <- xpathApply(xml, "//description", xmlValue)
z1
list()
Can anyone help me extract the text in the file?
Here is a link to the file: https://docs.google.com/file/d/0B__iOdFGJbXYOHJGbWJVNW0tS3M/edit?usp=sharing
回答1:
doc <- xmlTreeParse("test1.kml", useInternalNodes = TRUE)
root <-xmlRoot(doc)
xmlValue(root[["Document"]][["name"]])
R> xmlValue(root[["Document"]][["name"]])
[1] "2013-07-06 4:18pm"
Also xmlToDataFrame(root)
and xmlToDataFrame(doc)
return that value in the name column. Using xmlToList
on either root or doc returns NULL
for the value of any CData. I'm looking at the name node because copy and pasting your example doesn't xmlParse
. From my own little tests it looks like this should work on any CData.
回答2:
Jake Burkhead answered this in the comments. His solution does it. And I am most grateful. Here is how the text is extracted from the .kml file:
> xml1 <- xmlTreeParse("2013-07-06 4-18pm.kml", useInternalNodes=TRUE)
> root <-xmlRoot(xml1)
> names(root[["Document"]])
open visibility name author Style Style Style Style
"open" "visibility" "name" "author" "Style" "Style" "Style" "Style"
Style Schema Placemark Placemark Placemark
"Style" "Schema" "Placemark" "Placemark" "Placemark"
> # note that I want the text in the third "Placemark" which is in position [13] so:
> xmlValue(root[["Document"]][[13]][["description"]])
[1] "Created by Google My Tracks on Android.\n\nName: 2013-07-06 4:18pm\nActivity type: cycling\nDescription: -\nTotal distance: 49.62 km (30.8 mi)\nTotal time: 1:53:28\nMoving time: 1:50:17\nAverage speed: 26.24 km/h (16.3 mi/h)\nAverage moving speed: 27.00 km/h (16.8 mi/h)\nMax speed: 61.20 km/h (38.0 mi/h)\nAverage pace: 2.29 min/km (3.7 min/mi)\nAverage moving pace: 2.22 min/km (3.6 min/mi)\nFastest pace: 0.98 min/km (1.6 min/mi)\nMax elevation: 406 m (1333 ft)\nMin elevation: 265 m (868 ft)\nElevation gain: 690 m (2263 ft)\nMax grade: 12 %\nMin grade: -11 %\nRecorded: 2013-07-06 4:18pm\n"
I have accepted the answer but thought I put the complete solution here in case it helps others.
Many thanks for your persistence Jake. Thanks also Ricardo and agstudy.
来源:https://stackoverflow.com/questions/17540262/extract-cdata-tagged-values-from-kml-in-r