apache-pig

Apache PIG: Get the day of the week and split accordingly

て烟熏妆下的殇ゞ 提交于 2019-12-11 15:55:23
问题 I need to split dates between two and ignore saturday and sunday from it. Built in function on 0.11.1 will help to get day of the week but how to find out whether that is saturday or Sunday? Anyone has any idea of it? My expected output described below. Input: User Fromdate Todate Raj 10/3/2013 10/8/2013 James 10/4/2013 10/7/2013 etc.. Expected Output: Raj 10/3/2013 Raj 10/4/2013 Raj 10/7/2013 Raj 10/8/2013 James 10/4/2013 James 10/7/2013 回答1: Since the Pig DateTime objects are really Unix

Extracting string from logs with regex in pig script

匆匆过客 提交于 2019-12-11 15:04:07
问题 I have log data and I want to extract each information into a variable The following is sample one line log. {:id=>306, :name=>"bblite", :cpu=>{:quota=>4, :allocated=>4, :actual=>0}, :memory=>{:quota=>8192, :allocated=>8192, :actual=>8578}, :cluster_stats=>{"wc1104"=>{:cpu=>0, :mem=>8578}}} I need variable that have all ids,a variable that have all names,a variable that have CPUs and a variable that have all cluster stats The following is the portion of my pig script. I can store the ids but

How do you decode JSON in Pig that comes from a column?

大憨熊 提交于 2019-12-11 12:59:22
问题 There are numerous examples of how to use JsonLoader() to load JSON data with a schema from a file, but not from any sort of other output. 回答1: You are looking for the JsonStringToMap UDF provided in Elephant Bird: https://github.com/kevinweil/elephant-bird/search?q=JsonStringToMap&ref=cmdform Sample File: foo bar {"version":1, "type":"an event", "count": 1} foo bar {"version":1, "type":"another event", "count": 1} Pig Script: REGISTER /path/to/elephant-bird.jar; DEFINE JsonStringToMap com

PIG Error 1066 after iterating through a joined set.

我与影子孤独终老i 提交于 2019-12-11 12:47:56
问题 Trying to join a one set which has number of days in the month with a data set on the year month key. After I join the and try to do a FOREACH over the set I get an ERROR: 1066 ... Backend error : Scalar has more than one row in the output. Here is an abbreviated set with the same problem: $ hadoop fs -cat DIM/\* 2011,01,31 2011,02,28 2011,03,31 2011,04,30 2011,05,31 2011,06,30 2011,07,31 2011,08,31 2011,09,30 2011,10,31 2011,11,30 2011,12,31 $ hadoop fs -cat ACCT/\* 2011,7,26,key1,23.25,2470

How to get the current time stamp in PIG

孤人 提交于 2019-12-11 12:47:44
问题 I have a query with respect to a PIG script that I am writing. How can I get the current Unix Time Stamp in PIG script.? Do I need any UDF for the purpose.. or can PIG provide me the currnet time stamp ? Kindly advice me. Thanks 回答1: I am pointing two solutions first one: use CurrentTime(),convert it to ToUnixTime() for need to get timestamp. Ex: X = load "xx" ......... ; X1 = FOREACH X GENERATE ToUnixTime(CurrentTime()) second one: Passing from command line as a parameter. pig -f myscript

Find average by joining two datasets

核能气质少年 提交于 2019-12-11 12:34:41
问题 I have two data sets , EmployeeDetail(data set 1):- id name gender location SalaryDetail(data set 2):- id salary I need to join both and find out average salary of male and female in each location. So I tried following code . EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray); SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float); JoinedEmpDetail = join EmpDetail by id,

How to use REGEX_EXTRACT_ALL in Pig

末鹿安然 提交于 2019-12-11 11:57:49
问题 This is my sample data, subId=00001111911128052627,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212218.4621702216543667E17 subId=00001111911128052639,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212219.6726312167218586E17 subId=00001111911128052615,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212216.9431647633139046E17 My expected output will be a tuple where each field represents a matched group: (capturing_group1,

Hive: How to calculate time difference

匆匆过客 提交于 2019-12-11 11:48:29
问题 My requirement is simple how to calculate the time difference between two column in hive Example Time_Start: 10:15:00 Time_End: 11:45:00 I need to do (Time_End-Time_Start) = 1:30:00 Note both the columns are in String datatype kindly help to get required result.. 回答1: Language manual contains description of all available datetime functions. Difference in seconds can be calculated in such way: hour(time_end) * 3600 + minute(time_end) * 60 + second(time_end) - hour(time_start) * 3600 - minute

Pig script to read Cassandra table

时间秒杀一切 提交于 2019-12-11 11:06:58
问题 Trying to write a Pig script that will extract data from a Cassandra table. The Pig script looks like this: REGISTER ./cassandra-all-2.0.8.39.jar REGISTER ./datastax-agent-4.1.4-standalone.jar REGISTER ./cassandra-driver-core-2.0.2.1.jar REGISTER ./apache-cassandra-thrift-2.0.12.jar A = LOAD 'cql://username:password/mykeyspace/mycolumnfamily' USING org.apache.cassandra.hadoop.pig.CqlStorage() AS (user_id:long, fname:chararray, last_update_date:chararray, lname:chararray); DUMP A; I keep

Apache Pig: Extra query parameters from web log

孤街浪徒 提交于 2019-12-11 11:06:45
问题 I am working on analyzing AWS CloudFront access logs. I have the code to load the lines of the file raw_logs2 =LOAD 'file:///home/ec2-user/ENWRZAC68E00M.2011-02-28-18.72jA8eGh' USING PigStorage('\t') AS ( date: chararray, time: chararray, x_edge_location: chararray, sc_bytes: int, c_ip: chararray, cs_method: chararray, cs_host: chararray, cs_uri_stem: chararray, sc_status: chararray, cs_referer: chararray, cs_user_agent:chararray, cs_uri_query: chararray ); Now I am trying to parse the query