Meteorological Data from XML to Dataframe in R

时光怂恿深爱的人放手 提交于 2021-02-11 10:36:47

问题


I´m trying to analize meteorological data, importing directly to R from it´s native structure in XML. But it seems to be a very complicated XML format not corresponding to the commonly used standard of "one observation per row". The provider of the data has grouped the variables by every ten minutes intervals registered.

Here is a piece of the XML code:

<?xml version= "1.0" encoding="ISO-8859-1" ?>
<mes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C069_2018_1.xsd">
    <dia Dia="2018-1-01">
        <hora Hora="00:00">
            <Meteoros>
                <Dir.Med._a_1800cm>250.5</Dir.Med._a_1800cm>
                <Humedad._a_170cm>43.94</Humedad._a_170cm>
                <Irradia.._a_273cm>0.0</Irradia.._a_273cm>
                <Precip.._a_144cm>0.0</Precip.._a_144cm>
                <Sig.Dir._a_1800cm>17.82</Sig.Dir._a_1800cm>
                <Sig.Vel._a_1800cm>2.78</Sig.Vel._a_1800cm>
                <Tem.Aire._a_170cm>12.57</Tem.Aire._a_170cm>
                <Vel.Max._a_1800cm>15.48</Vel.Max._a_1800cm>
                <Vel.Med._a_1800cm>8.6</Vel.Med._a_1800cm>
            </Meteoros>
        </hora>
        <hora Hora="00:10">
            <Meteoros>
                <Dir.Med._a_1800cm>249.3</Dir.Med._a_1800cm>
                <Humedad._a_170cm>44.65</Humedad._a_170cm>
                <Irradia.._a_273cm>0.0</Irradia.._a_273cm>
                <Precip.._a_144cm>0.0</Precip.._a_144cm>
                <Sig.Dir._a_1800cm>20.21</Sig.Dir._a_1800cm>
                <Sig.Vel._a_1800cm>2.32</Sig.Vel._a_1800cm>
                <Tem.Aire._a_170cm>12.55</Tem.Aire._a_170cm>
                <Vel.Max._a_1800cm>14.5</Vel.Max._a_1800cm>
                <Vel.Med._a_1800cm>7.8</Vel.Med._a_1800cm>
            </Meteoros>
        </hora>
        <hora Hora="00:20">
            <Meteoros>
                <Dir.Med._a_1800cm>250.3</Dir.Med._a_1800cm>
                <Humedad._a_170cm>46.17</Humedad._a_170cm>
                <Irradia.._a_273cm>0.0</Irradia.._a_273cm>
                <Precip.._a_144cm>0.0</Precip.._a_144cm>
                <Sig.Dir._a_1800cm>23.02</Sig.Dir._a_1800cm>
                <Sig.Vel._a_1800cm>2.25</Sig.Vel._a_1800cm>
                <Tem.Aire._a_170cm>12.45</Tem.Aire._a_170cm>
                <Vel.Max._a_1800cm>13.72</Vel.Max._a_1800cm>
                <Vel.Med._a_1800cm>5.55</Vel.Med._a_1800cm>
            </Meteoros>
        </hora>
...

And here is the full XML for the data of january 2019 (>60 mb):

http://opendata.euskadi.eus/contenidos/ds_meteorologicos/met_stations_ds_2018/opendata/2018/C069/C069_2018_1.xml

When I used the function "xmlTreeParse" I got the error:

"Error: XML content does not seem to be XML"

It´s my first attempt with XML data structure, but I´ve been trying the recomendations of similar questions on this site as:

Transforming data from xml into R dataframe

xml to dataframe in r

R XML to Dataframe

But those seem to be simple XML structures that works just fine parsing directly or even converting directly to dataframes with the libraries "XML" and "methods"

I need to obtain a dataframe with similar structure to this:

dia hora    Dir.Med._a_1800cm   Humedad._a_170cm    Irradia.._a_273cm   Precip.._a_144cm    Sig.Dir._a_1800cm   Sig.Vel._a_1800cm   Tem.Aire._a_170cm   Vel.Max._a_1800cm   Vel.Med._a_1800cm
01/01/2018  0:00    250.5   43.94   0.0 0.0 17.82   2.78    12.57   15.48   8.6
01/01/2018  0:10    249.3   44.65   0.0 0.0 20.21   2.32    12.55   14.5    7.8
01/01/2018  0:20    250.3   46.17   0.0 0.0 23.02   2.25    12.45   13.72   5.55

回答1:


It's quite some work but not impossible. This solution will also work with different number of observations per day.


dia <- xmlstr %>% read_xml() %>%  xml_find_all("//dia")
dia.dat <- dia %>% map(xml_attrs) %>% map(~t(.) %>% as_tibble)

hora <- dia %>% map(xml_children) 
hora.dat <- hora %>% map(xml_attrs) %>% map(~map_df(., ~t(.) %>% as_tibble))

hora.details <- hora %>% 
  map(~map(.,xml_children) %>% 
        map(xml_children) %>% 
        map(~setNames(xml_text(.), xml_name(.)) %>% t() %>% as_tibble)) %>% map(.,~do.call(rbind,.) %>% as_tibble)

pmap_df(list(dia.dat, hora.dat, hora.details),cbind)

I added some data to your xml example to test. (1 extra day and 2nd day an extra hour).

Result:


        Dia  Hora Dir.Med._a_1800cm Humedad._a_170cm Irradia.._a_273cm Precip.._a_144cm Sig.Dir._a_1800cm
1 2018-1-01 00:00             250.5            43.94               0.0              0.0             17.82
2 2018-1-01 00:10             249.3            44.65               0.0              0.0             20.21
3 2018-1-01 00:20             250.3            46.17               0.0              0.0             23.02
4 2018-1-02 00:00             250.5            43.94               0.0              0.0             17.82
5 2018-1-02 00:10             249.3            44.65               0.0              0.0             20.21
6 2018-1-02 00:20             250.3            46.17               0.0              0.0             23.02
7 2018-1-02 00:30             250.3            46.17               0.0              0.0             23.02
  Sig.Vel._a_1800cm Tem.Aire._a_170cm Vel.Max._a_1800cm Vel.Med._a_1800cm
1              2.78             12.57             15.48               8.6
2              2.32             12.55              14.5               7.8
3              2.25             12.45             13.72              5.55
4              2.78             12.57             15.48               8.6
5              2.32             12.55              14.5               7.8
6              2.25             12.45             13.72              5.55
7              2.25             12.45             13.72              5.55

credits to answers of:

converting XML nodes to a dataframe

Getting all the children nodes of XML file to data.frame or data.table




回答2:


Consider XSLT, special-purpose language designed to transform XML files and sibling to XPath, to migrate the top level dia and hora nodes into Meteoros level regardless of number of nodes:

XSLT (save as .xsl, a special .xml file)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TRANFORM -->
    <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- REPEAT dia and hora NODES -->
    <xsl:template match="Meteoros">
        <xsl:copy>
            <dia><xsl:value-of select="ancestor::dia/@Dia"/></dia>
            <hora><xsl:value-of select="ancestor::hora/@Hora"/></hora>    
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Online demo

R (no loops or mapping needed)

library(XML)
library(xslt)

doc <- read_xml("Import.xml", package = "xslt")
style <- read_xml("Script.xsl", package = "xslt")

new_xml <- xml_xslt(doc, style)

new_doc <- XML::xmlParse(new_xml)    
meteoros_df <- XML::xmlToDataFrame(nodes=getNodeSet(new_doc, "//Meteoros"))


来源:https://stackoverflow.com/questions/57875654/meteorological-data-from-xml-to-dataframe-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!