Pig not loading data into HCatalog table - HortonWorks Sandbox [closed]

ぐ巨炮叔叔 提交于 2019-12-10 12:03:37

问题


I am running a Pig script in the HortonWorks virtual machine with the goal of extracting certain parts of my XML dataset, and loading those parts into columns in an HCatalog table. On my local machine, I run my Pig script on the XML file and get an output file with all the extracted parts. However, for some reason when I run this same script in the HortonWorks VM the script appears to run successfully but the HCatalog table is still empty.

Here is my local script:

 REGISTER piggybank.jar

items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS  (row:chararray);

data = FOREACH items GENERATE 
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS  id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS  creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS  score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS  title:chararray;


STORE data INTO '/tmp/postsETLResults' USING PigStorage();

The one I'm using in HortonWorks:

REGISTER piggybank.jar

items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS  (row:chararray);

data = FOREACH items GENERATE 
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS  id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS  creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS  score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS  title:chararray;


STORE data into 'posts_table_1' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.posts_table_1' USING org.apache.hcatalog.pig.HCatLoader();

Sample XML row (from the StackOverflow public dataset):

<row Id="149115" PostTypeId="2" ParentId="149078" CreationDate="2008-09-29T15:16:23.870" Score="1" Body="&lt;p&gt;I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.&lt;/p&gt;&#xA;" OwnerDisplayName="user16324" LastActivityDate="2008-09-29T15:16:23.870" CommentCount="1" />

I created the HCatalog table manually, and all the correct fields exists and are of the correct type.

The strange thing is that if I do dump data in Pig, I get no output. If I illustrate data I see pieces of my data in the log, followed by large blank areas, followed by more data, and so on.

What am I missing here? I'd really like to take this messy XML file and get a neat table in HCatalog. Again, I get the results I'm looking for when running the local script on my machine, but when I run the second version designed for storing the output into the posts_table_1 HCatalog table, I get a success message but an empty table.

Alternatively, if I can just get the output on my local machine as a comma-delimited file, I can use that file and have HCatalog automatically load the data in the Hue interface. As of now, the output is space-delimited which is problematic in Hue because the titles of posts contain spaces.

Thanks in advance! This has me stumped.


回答1:


I found the issue. I created the HCatalog table manually and had used all of the default options, including the delimiter which was set to ^A (/100). My output had columns separated by Tab spaces (\t), so when the table received the data, it found no ^A delimiter and stored an empty dataset. I recreated the table to look for \t and everything worked fine.



来源:https://stackoverflow.com/questions/22627693/pig-not-loading-data-into-hcatalog-table-hortonworks-sandbox

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!