Parsing a nested XML string from a Hive table using PIG

问题

I'm trying to use PIG to extract some XML from a field in a Hive table, rather than from an XML file (which is the assumption of most of the examples I have read). The XML comes from a table arranged as follows:

ID, {XML_string}

The XML string contains n. number of rows, always containing at least one from up to 10 attributes. We can assume that attribute #1 will always be present and will be unique.

<row>
 <att1></att1>
 <att2></att2>
 ...
</row>
<row>
 <att1></att1>
 <att2></att2>
 ...
</row>
...

I want to transform this into a new table with each row in the XML string exploded out into a separate row in the new table, but I still want to include the ID from the existing table.

ID, att1, att2, att3
==  ====  ====  ====
1   1     xxx   xxx
1   2     xxx   xxx
1   3     xxx   xxx
2   1     xxx   xxx

I've approached this so far in PIG by using XPathAll. I've read a lot of advice that suggests avoiding Regex for XML parsing.

REGISTER /home/piggybank-0.12.0.jar
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
A = LOAD 'HiveTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
B= FOREACH A GENERATE id, 
    XPathAll(xml_string,'ROW/_ATT1') as att1;
    XPathAll(xml_string,'ROW/_ATT2') as att2;
    XPathAll(xml_string,'ROW/_ATT3') as att3;
dump B;

This results in the following output, assuming there are three row instances for item 1:

(1 (Att1-i1,Att1-i2,Att1-i3),(Att2-i1,Att2-i2,Att2-i3),(Att3-i1,Att3-i2,Att3-i3))

All of the information appears to be there, I just can't seem to unlock the way to pull out the first element from each of the embedded tuples into a new row, then the second elements, and so on. In other words:

(1, Att1-i1, Att2-i1, Att3-i1)
(1, Att1-i2, Att2-i2, Att3-i2)
(1, Att1-i3, Att2-i3, Att3-i3)

I'm clinging to the hope this can be done using Hive + Pig without having to resort to Java, etc. I'd appreciate any insights. I'm not precious about the approach taken so far, so if I have gone the long way round, please tell me!

来源：https://stackoverflow.com/questions/36240294/parsing-a-nested-xml-string-from-a-hive-table-using-pig

标签

xml

Hadoop

xpath

Hive

apache-pig