问题
Data sample contain words (orth) and kategories (prop key="sense:ukb:unitsstr"). I'd like to extract pairs of data such as orth and prop key="sense:ukb:unitsstr as a row to data frame. However, some words may not have any prop data, just like two last records. Then I'd like to see them as NA.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
<chunk id="ch1" type="p">
<sentence id="s1">
<tok>
<orth>ktoś</orth>
<lex disamb="1"><base>ktoś</base><ctag>subst:sg:nom:m1</ctag></lex>
<prop key="polarity">0</prop>
<prop key="sense:ukb:syns_id">11511</prop>
<prop key="sense:ukb:syns_rank">11511/128.6156573170 243094/95.1234745165</prop>
<prop key="sense:ukb:unitsstr">ktoś.2(15:os)</prop>
</tok>
<tok>
<orth>go</orth>
<lex disamb="1"><base>go</base><ctag>subst:sg:nom:n</ctag></lex>
<prop key="polarity">0</prop>
<prop key="sense:ukb:syns_id">47620</prop>
<prop key="sense:ukb:syns_rank">47620/108.9010709884 234524/90.4766173102</prop>
<prop key="sense:ukb:unitsstr">go.1(2:czy)</prop>
</tok>
<tok>
<orth>krokodyl</orth>
<lex disamb="1"><base>krokodyl</base><ctag>subst:sg:nom:m2</ctag></lex>
<prop key="polarity">0</prop>
<prop key="sense:ukb:syns_id">12879</prop>
<prop key="sense:ukb:syns_rank">12879/40.5162836207 254796/35.9915058408 7063215/33.3657479890 7063214/26.6770712118 7063217/25.5775738130 7063213/23.6851347572 7063212/23.6300037076</prop>
<prop key="sense:ukb:unitsstr">krokodyl.1(21:zw) krokodyl_właściwy.1(21:zw)</prop>
</tok>
<tok>
<orth>się</orth>
<lex disamb="1"><base>się</base><ctag>qub</ctag></lex>
</tok>
<tok>
<orth>ja</orth>
<lex disamb="1"><base>ja</base><ctag>ppron12:sg:nom:m1:pri</ctag></lex>
</tok>
I assumed that I can get it with some xml path lines, but I got stuck:
doc = xmlTreeParse("statsUCZESTxfreqkeyword xml.txt",useInternal = TRUE)
top = xmlRoot(doc)
xmlName(top)
names(top)
names( top[[ 1 ]] )
sent <- top[[ 1 ]] [[ "sentence" ]]
names(sent)
names(sent[[1]])
xmlSApply(sent[[1]], xmlValue)
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))
nodes = getNodeSet(top, "//prop[@key='sense:ukb:unitsstr']")
lapply(nodes, function(x) xmlSApply(x, xmlValue)) # 152 words have prop
xmlSApply(sent, function(x) xmlSApply(x, xmlValue))
回答1:
Here is a solution using the xml2 library. I find the syntax of xml2 to be easier that the xml library. Both have their advantages and disadvantages.
The logic is similar to the answer I provided here: rvest: Return NAs for empty nodes given multiple listings. The code's comments explain each step. In the code below xmltext
is either the xml text or the filename of the xml which you would like to process.
library(xml2)
#read the xml page
page<-read_xml(xmltext)
#find the listing nodes and id of each node
listings<-xml_find_all(page, ".//tok")
#find the text associated witht the ortho nodes
orthotext<-sapply(listings, function(x){xml_text(xml_find_first(x, ".//orth"))})
#find text associated with the prop key="sense:ukb:unitsstr"
ukb<-sapply(listings, function(x){ nodes<-xml_find_all(x, ".//prop")
#find node with wanted key
wantednode<-nodes[xml_attr(nodes, "key" )=="sense:ukb:unitsstr"]
#extract text
wantednode<-xml_text(wantednode)
#return NA if node is empty.
ifelse(is.character(wantednode), wantednode, NA)
})
#create dataframe
finalanswer<-data.frame(orthotext, ukb)
来源:https://stackoverflow.com/questions/54574176/how-can-i-extract-xml-nodes-and-key-values-to-data-frame-in-r-studio-including