问题
I'm trying to use PIG to extract some XML from a field in a Hive table, rather than from an XML file (which is the assumption of most of the examples I have read). The XML comes from a table arranged as follows:
ID, {XML_string}
The XML string contains n. number of rows, always containing at least one from up to 10 attributes. We can assume that attribute #1 will always be present and will be unique.
<row>
<att1></att1>
<att2></att2>
...
</row>
<row>
<att1></att1>
<att2></att2>
...
</row>
...
I want to transform this into a new table with each row in the XML string exploded out into a separate row in the new table, but I still want to include the ID from the existing table.
ID, att1, att2, att3
== ==== ==== ====
1 1 xxx xxx
1 2 xxx xxx
1 3 xxx xxx
2 1 xxx xxx
I've approached this so far in PIG by using XPathAll. I've read a lot of advice that suggests avoiding Regex for XML parsing.
REGISTER /home/piggybank-0.12.0.jar
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
A = LOAD 'HiveTable' USING org.apache.hive.hcatalog.pig.HCatLoader();
B= FOREACH A GENERATE id,
XPathAll(xml_string,'ROW/_ATT1') as att1;
XPathAll(xml_string,'ROW/_ATT2') as att2;
XPathAll(xml_string,'ROW/_ATT3') as att3;
dump B;
This results in the following output, assuming there are three row instances for item 1:
(1 (Att1-i1,Att1-i2,Att1-i3),(Att2-i1,Att2-i2,Att2-i3),(Att3-i1,Att3-i2,Att3-i3))
All of the information appears to be there, I just can't seem to unlock the way to pull out the first element from each of the embedded tuples into a new row, then the second elements, and so on. In other words:
(1, Att1-i1, Att2-i1, Att3-i1)
(1, Att1-i2, Att2-i2, Att3-i2)
(1, Att1-i3, Att2-i3, Att3-i3)
I'm clinging to the hope this can be done using Hive + Pig without having to resort to Java, etc. I'd appreciate any insights. I'm not precious about the approach taken so far, so if I have gone the long way round, please tell me!
来源:https://stackoverflow.com/questions/36240294/parsing-a-nested-xml-string-from-a-hive-table-using-pig