Avoid exception in ToDate in Pig for individual rows

北慕城南 提交于 2019-12-12 21:19:36

问题


I have an input as a CSV file which I am trying to process with Pig. In the csv, there is a date column which contains corrupt values for some rows. Please suggest me a mechanism to filter out those rows which are corrupt(have corrupt date column) before I apply the ToDate() function to the date column in a FOREACH...GENERATE statement.

A sample format of my data is:

A,21,12/1/2010 8:26
B,33,12/1/2010 8:26
C,42,i am corrupted
D,30,12/1/2013 9:26

I want to be able to load this and then transform this as:

Assuming csv file is loaded into Y(name,id,date)

X = FOREACH Y GENERATE ToDate(date, 'mm/dd/yyyy HH:mm') AS newdate;

I want to apply a FILTER to Y before the above statement to filter out row starting with C. Since, as is, the above statement throws exception and the job fails when I DUMP X;.


回答1:


Two cases when ToDate Fails,

1) When the date is missing or syntax is wrong, Filter all the dates using a regular expression,

X = FILTER Y BY (date matches '/(0[1-9]|1[012])[- \/.](0[1-9]|[12][0-9]|3[01])[- \/.](19|20)\d\d/');

2) When the date falls into DST (https://en.wikipedia.org/wiki/Daylight_saving_time) of your timezone. You have to manually filter that.



来源:https://stackoverflow.com/questions/38147906/avoid-exception-in-todate-in-pig-for-individual-rows

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!