R: convert XML data to data frame

前端 未结 4 585
醉话见心
醉话见心 2020-12-08 04:52

For a homework assignment I am attempting to convert an XML file into a data frame in R. I have tried many different things, and I have searched for ideas on the internet bu

4条回答
  •  自闭症患者
    2020-12-08 05:20

    Great answers above! For future readers, anytime you face a complex XML needing R import, consider re-structuring the XML document using XSLT (a special-purpose declarative programming language that manipulates XML content into various end-use needs). Then simply use R's xmlToDataFrame() function from XML package.

    Unfortunately, R does not have a dedicated XSLT package available on CRAN-R across all operating systems. The listed SXLT seems to be a Linux package and not able to be used on Windows. See unanswered SO questions here and here. I understand @hrbrmstr (above) maintains a GitHub XSLT project. Nonetheless, nearly all general-purpose languages maintain XSLT processors including Java, C#, Python, PHP, Perl, and VB.

    Below is the open-source Python route and because the XML document is pretty nuanced, two XSLTs are being used (of course XSLT gurus can combine them into one but tried as I might couldn't get it to work.

    FIRST XSLT (using a recursive template)

    
    
    
    
        
    
        
           
        
    
    
            
        
        
                    
                            
                
                    
                              
            
            
                                  
                                      
                                  
                
                    
                                
                        
                
         
    
            
     
    

    SECOND XSLT

    
    
            
        
            
               
            
        
    
        
            
                
                
                
                
                
                
                
                
                
                
                
            
                
    
    
    

    Python (using lxml module)

    import lxml.etree as ET
    
    cd = os.path.dirname(os.path.abspath(__file__))
    
    # FIRST TRANSFORMATION
    dom = ET.parse('http://www.ggobi.org/book/data/olive.xml')
    xslt = ET.parse(os.path.join(cd, 'Olive.xsl'))
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    
    tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)
    
    xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb')
    xmlfile.write(tree_out)
    xmlfile.close()    
    
    # SECOND TRANSFORMATION
    dom = ET.parse(os.path.join(cd, 'Olive_py.xml'))
    xslt = ET.parse(os.path.join(cd, 'Olive2.xsl'))
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    
    tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)    
    
    xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb')
    xmlfile.write(tree_out)
    xmlfile.close()
    

    R

    library(XML)
    
    # LOADING TRANSFORMED XML INTO R DATA FRAME
    doc<-xmlParse("Olive_py.xml")
    xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record"))
    View(xmldf)
    

    Output

    area_name   area    region  palmitic    palmitoleic stearic oleic   linoleic    linolenic   arachidic   eicosenoic
    North-Apulia 1      1       1075        75          226     7823        672          na                     60
    North-Apulia 1      1       1088        73          224     7709        781          31          61         29
    North-Apulia 1      1       911         54          246     8113        549          31          63         29
    North-Apulia 1      1       966         57          240     7952        619          50          78         35
    North-Apulia 1      1       1051        67          259     7771        672          50          80         46
       ...
    

    (slight cleanup on very first record is needed as an extra space was added after "na" in xml doc, so arachidic and eicosenoic were shifted forward)

提交回复
热议问题