I am fairly new to using R and very new to using the XML package and xpath. I need to pull four elements from an xml file that looks like this (except that I have trimmed off a lot of other xmlnodes to simplify it here):
<?xml version="1.0" encoding="utf-8"?>
<iati-activities version="1.03" generated-datetime="2015-07-07T16:49:09+00:00">
<iati-activity last-updated-datetime="2014-08-11T14:36:59+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100530</iati-identifier>
<title>Improvement of basic health care</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<participating-org role="Funding" ref="EU" type="15">EU</participating-org>
<participating-org role="Funding" type="21">Cordaid Memisa</participating-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="21">CORDAID RCA</participating-org>
<recipient-country percentage="100" code="CF">CENTRAL AFRICAN REPUBLIC</recipient-country>
<budget type="1">
<period-start iso-date="2010-01-01"></period-start>
<period-end iso-date="2013-02-28"></period-end>
</budget>
</iati-activity>
<iati-activity last-updated-datetime="2013-07-19T14:12:14+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100625</iati-identifier>
<title>Pigs for Pencils</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Funding" type="60">Stichting Kapatiran</participating-org>
<participating-org role="Implementing" type="22">PREDA Foundation Inc.</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
<period-start iso-date="2010-04-20"></period-start>
<period-end iso-date="2012-10-02"></period-end>
<value value-date="2010-04-20">12500</value>
</budget>
</iati-activity>
<iati-activity last-updated-datetime="2015-04-08T03:01:58+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100815</iati-identifier>
<title>Job and housing opportunities for women </title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="22">WISE</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
<period-start iso-date="2010-10-01"></period-start>
<period-end iso-date="2011-12-31"></period-end>
<value value-date="2010-10-01">227000</value>
</budget>
</iati-activity>
</iati-activities>
Also this is my first question ever on StackOverflow, so apologies if I'm not doing it correctly (and bc that xml is not perfectly aligned). The elements I need, and what I'm assigning them to are:
UniqueID <- "//iati-activity/iati-identifier"
GrantTitle <- "//iati-activity/title"
GrantAmount <- "//iati-activity/budget/value"
Recipient <- "//iati-activity/participatingorg[@role='Implementing']"
So far (after much trial and tribulation) I have come up with this code, that goes through the current node (x), pulling the 4 variables, and cbinding them into a row, then using xpathApply to loop through iati-activity nodes calling the function and rbinding the resulting rows together.
This code works when all four elements exist in each activity. However, note the absence of the budget/value node from the xml sample. This is because I removed it in order to solve this problem of missing nodes, which occur frequently in the full file for almost all the elements I need.
Also note the [1] at the end of my xpath expressions- I've included these because there are also multiple titles, multiple participating-orgs of all types, etc.
Given the multiples of some elements and the nonexistence of others, it makes it impossible to simple pull all the same elements into a vector and pop it into a data frame. Thus the need to loop through each activity pulling the elements with it. My code currently doesn't work to account for missing elements (the missing budget/value in the first iati-activity) because cbinding (and rbinding) ignore null vectors.
xmltestNA = xmlInternalTreeParse("XMLtoDF_TestNA.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
UniqueID <- xpathSApply(x, "./iati-identifier", xmlValue)
GrantTitle <- xpathSApply(x, "./title[1]", xmlValue)
GrantAmount <- xpathSApply(x, "./budget/value[1]", xmlValue)
Recipient <- xpathSApply(x, "./participating-org[@role='Implementing'][1]", xmlValue)
cbind(UniqueID=UniqueID, GrantTitle=GrantTitle, GrantAmount=GrantAmount, Recipient=Recipient)
}
res <-xpathApply(xmltestNA, '//iati-activity', fun=bodyToDF)
IatiNA <-do.call(rbind, res)
IatiNA
How can I keep the null values/missing nodes in order to turn it into a matrix or dataframe that looks like this:
UniqueID GrantTitle GrantAmount Recipient
1 NL-KVK-41160054-100530 Improvement of basic health care NA CORDAID RCA
2 NL-KVK-41160054-100625 Pigs for Pencils 12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women 227000 WISE
Because I'm still new, the simpler the code, the better. Thanks in advance!
If your xpath queries return too many or few results, I think it's easier to work with the nodes
doc <- xmlParse( '<your xml here>')
nodes<- getNodeSet(doc, "//iati-activity")
#Compare
xpathSApply(doc, "//budget/value", xmlValue)
xpathSApply(doc, "//participating-org[@role='Funding']", xmlValue)
sapply(nodes, function(x) xpathSApply(x, "./budget/value", xmlValue))
sapply(nodes, function(x) xpathSApply(x, "./participating-org[@role='Funding']", xmlValue))
Add a function to handle missing or multiple nodes and then create the data.frame
xpath2 <-function(x, path, fun = xmlValue, ...){
y <- xpathSApply(x, path, fun, ...)
ifelse(length(y) == 0, NA,
ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}
GrantAmount <- sapply(nodes, xpath2, "./budget/value")
UniqueID <- sapply(nodes, xpath2, "./iati-identifier")
GrantTitle <- sapply(nodes, xpath2, "./title")
Recipient <- sapply(nodes, xpath2, "./participating-org[@role='Implementing']")
## updated xpath2 so xmlGetAttr will also work
Funding_ref <- sapply(nodes, xpath2, "./participating-org[@role='Funding']", xmlGetAttr, "ref")
Budget_start <- sapply(nodes, xpath2, ".//period-start", xmlGetAttr, "iso-date")
data.frame(UniqueID, GrantTitle, GrantAmount, Recipient)
UniqueID GrantTitle GrantAmount Recipient
1 NL-KVK-41160054-100530 Improvement of basic health care <NA> CORDAID RCA
2 NL-KVK-41160054-100625 Pigs for Pencils 12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women 227000 WISE
Consider using an XSLT transformation file. As information, XSLT is a declarative special-purpose programming language used to transform, style, or re-format XML files in various structures. Sadly this forgotten language in the web age is still very useful for every day needs.
Specifically for you, you need a simpler structure for R to easily import. The one caveat is R does not have a dedicated, reliable XSLT library. Therefore, you need to find an external XSLT processor. Fortunately, several open-source solutions are available which R can leverage.
XSLT Stylesheet
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" omit-xml-declaration="no" />
<xsl:template match="iati-activities">
<GrantData>
<xsl:for-each select="iati-activity">
<iati-activity>
<UniqueID><xsl:value-of select="iati-identifier"/></UniqueID>
<GrantTitle><xsl:value-of select="title"/></GrantTitle>
<GrantAmount><xsl:value-of select="budget/value"/></GrantAmount>
<Recipient><xsl:value-of select="participating-org[@role='Implementing']"/></Recipient>
</iati-activity>
</xsl:for-each>
</GrantData>
</xsl:template>
</xsl:stylesheet>
Python (using its lxml library)
#!/usr/bin/python
import lxml.etree as ET
dom = ET.parse('C:\Path\To\RawXMLFile.xml')
xslt = ET.parse('C:\Path\To\RawXSLTFile.xsl')
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
print(tree_out)
xmlfile = open('C:\Path\To\OutputXMLFile.xml','ab')
xmlfile.write(tree_out)
xmlfile.close()
PHP (using its xsl extension)
This is for those who think PHP is not a general-purpose language but only restricted to the web:
<?php
// Load the XML source and XSLT file
$cd = dirname(__FILE__);
$xml = new DOMDocument;
$xml->load('C:\Path\To\RawXMLFile.xml');
$xsl = new DOMDocument;
$xsl->load('C:\Path\To\RawXSLTFile.xsl');
// Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules
// Transform XML source
$newXml = $proc->transformToXML($xml);
// Save output to file
$xmlfile = 'C:\Path\To\OutputXMLFile.xml.xml';
file_put_contents($xmlfile, $newXml);
?>
Both of which produces XML with a simpler structure for your R data frame needs (notice the empty node for GrantAmount).
<?xml version="1.0"?>
<GrantData>
<iati-activity>
<UniqueID>NL-KVK-41160054-100530</UniqueID>
<GrantTitle>Improvement of basic health care</GrantTitle>
<GrantAmount/>
<Recipient>CORDAID RCA</Recipient>
</iati-activity>
<iati-activity>
<UniqueID>NL-KVK-41160054-100625</UniqueID>
<GrantTitle>Pigs for Pencils</GrantTitle>
<GrantAmount>12500</GrantAmount>
<Recipient>PREDA Foundation Inc.</Recipient>
</iati-activity>
<iati-activity>
<UniqueID>NL-KVK-41160054-100815</UniqueID>
<GrantTitle>Job and housing opportunities for women </GrantTitle>
<GrantAmount>227000</GrantAmount>
<Recipient>WISE</Recipient>
</iati-activity>
</GrantData>
Once the simpler XML is generated, simply import into R:
library(XML)
# LOADING TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("C:\\Path\\To\\OutputXMLFile.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//iati-activity"))
VBA (using the MSXML object)
Finally, there is a VBA solution to be run in MS Excel or MS Access. Fortunately, for this last solution you can connect the VBA with R using the R's RDCOMClient package (only a PC solution):
library(XML)
library(RDCOMClient)
xmlfile = COMCreate("MSXML2.DOMDocument")
xslfile = COMCreate("MSXML2.DOMDocument")
newxmlfile = COMCreate("MSXML2.DOMDocument")
xmlstr = "C:\\Path\\To\\RawXMLFile.xml"
xslstr = "C:\\Path\\To\\RawXSLTFile.xsl"
newxmlstr = "C:\\Path\\To\\OutputXMLFile.xml"
# LOADING XML & XSLT FILES
xmlfile.async = FALSE
xmlfile$Load(xmlstr)
xslfile.async = FALSE
xslfile$Load(xslstr)
# TRANSFORMING XML FILE USING XLST INTO NEW FILE
xmlfile$transformNodeToObject(xslfile, newxmlfile)
newxmlfile$Save(newxmlstr)
# LOADING TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("C:\\Path\\To\\OutputXMLFile.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//iati-activity"))
View(xmldf)
xmlfile <- NULL
xslfile <- NULL
newxmlfile <- NULL
Data frame results with missing node as NaN:
UniqueID GrantTitle GrantAmount Recipient
1 NL-KVK-41160054-100530 Improvement of basic health care CORDAID RCA
2 NL-KVK-41160054-100625 Pigs for Pencils 12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women 227000 WISE
来源:https://stackoverflow.com/questions/31999486/r-xml-pulling-nodes-into-matrix-df-accounting-for-missing-nodes