How can I extract xml nodes and key values to data.frame in R studio, including NA values?

问题

Data sample contain words (orth) and kategories (prop key="sense:ukb:unitsstr"). I'd like to extract pairs of data such as orth and prop key="sense:ukb:unitsstr as a row to data frame. However, some words may not have any prop data, just like two last records. Then I'd like to see them as NA.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1" type="p">
  <sentence id="s1">
   <tok>
    <orth>ktoś</orth>
    <lex disamb="1"><base>ktoś</base><ctag>subst:sg:nom:m1</ctag></lex>
    <prop key="polarity">0</prop>
    <prop key="sense:ukb:syns_id">11511</prop>
    <prop key="sense:ukb:syns_rank">11511/128.6156573170 243094/95.1234745165</prop>
    <prop key="sense:ukb:unitsstr">ktoś.2(15:os)</prop>
   </tok>
   <tok>
    <orth>go</orth>
    <lex disamb="1"><base>go</base><ctag>subst:sg:nom:n</ctag></lex>
    <prop key="polarity">0</prop>
    <prop key="sense:ukb:syns_id">47620</prop>
    <prop key="sense:ukb:syns_rank">47620/108.9010709884 234524/90.4766173102</prop>
    <prop key="sense:ukb:unitsstr">go.1(2:czy)</prop>
   </tok>
   <tok>
    <orth>krokodyl</orth>
    <lex disamb="1"><base>krokodyl</base><ctag>subst:sg:nom:m2</ctag></lex>
    <prop key="polarity">0</prop>
    <prop key="sense:ukb:syns_id">12879</prop>
    <prop key="sense:ukb:syns_rank">12879/40.5162836207 254796/35.9915058408 7063215/33.3657479890 7063214/26.6770712118 7063217/25.5775738130 7063213/23.6851347572 7063212/23.6300037076</prop>
    <prop key="sense:ukb:unitsstr">krokodyl.1(21:zw) krokodyl_właściwy.1(21:zw)</prop>
   </tok>
   <tok>
    <orth>się</orth>
    <lex disamb="1"><base>się</base><ctag>qub</ctag></lex>
   </tok>
   <tok>
    <orth>ja</orth>
    <lex disamb="1"><base>ja</base><ctag>ppron12:sg:nom:m1:pri</ctag></lex>
   </tok>

I assumed that I can get it with some xml path lines, but I got stuck:

doc = xmlTreeParse("statsUCZESTxfreqkeyword xml.txt",useInternal = TRUE)
top = xmlRoot(doc)
xmlName(top)
names(top) 
names( top[[ 1 ]] )
sent <- top[[ 1 ]] [[ "sentence" ]]
names(sent)
names(sent[[1]])
xmlSApply(sent[[1]], xmlValue)
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))
nodes = getNodeSet(top, "//prop[@key='sense:ukb:unitsstr']")
lapply(nodes, function(x) xmlSApply(x, xmlValue)) # 152 words have prop
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))

回答1:

Here is a solution using the xml2 library. I find the syntax of xml2 to be easier that the xml library. Both have their advantages and disadvantages.
The logic is similar to the answer I provided here: rvest: Return NAs for empty nodes given multiple listings. The code's comments explain each step. In the code below xmltext is either the xml text or the filename of the xml which you would like to process.

library(xml2)

#read the xml page
page<-read_xml(xmltext)
#find the listing nodes and id of each node
listings<-xml_find_all(page, ".//tok")

#find the text associated witht the ortho nodes
orthotext<-sapply(listings, function(x){xml_text(xml_find_first(x, ".//orth"))})

#find text associated with the prop key="sense:ukb:unitsstr"
ukb<-sapply(listings, function(x){ nodes<-xml_find_all(x, ".//prop")
                            #find node with wanted key
                           wantednode<-nodes[xml_attr(nodes, "key" )=="sense:ukb:unitsstr"]
                           #extract text
                           wantednode<-xml_text(wantednode)
                           #return NA if node is empty.
                           ifelse(is.character(wantednode), wantednode, NA)
})


#create dataframe
finalanswer<-data.frame(orthotext, ukb)

来源：https://stackoverflow.com/questions/54574176/how-can-i-extract-xml-nodes-and-key-values-to-data-frame-in-r-studio-including

标签

python

xml

extract