apache-pig | 易学教程

Apache PIG: Get the day of the week and split accordingly

阅读更多关于 Apache PIG: Get the day of the week and split accordingly

问题 I need to split dates between two and ignore saturday and sunday from it. Built in function on 0.11.1 will help to get day of the week but how to find out whether that is saturday or Sunday? Anyone has any idea of it? My expected output described below. Input: User Fromdate Todate Raj 10/3/2013 10/8/2013 James 10/4/2013 10/7/2013 etc.. Expected Output: Raj 10/3/2013 Raj 10/4/2013 Raj 10/7/2013 Raj 10/8/2013 James 10/4/2013 James 10/7/2013 回答1: Since the Pig DateTime objects are really Unix

Extracting string from logs with regex in pig script

阅读更多关于 Extracting string from logs with regex in pig script

问题 I have log data and I want to extract each information into a variable The following is sample one line log. {:id=>306, :name=>"bblite", :cpu=>{:quota=>4, :allocated=>4, :actual=>0}, :memory=>{:quota=>8192, :allocated=>8192, :actual=>8578}, :cluster_stats=>{"wc1104"=>{:cpu=>0, :mem=>8578}}} I need variable that have all ids,a variable that have all names,a variable that have CPUs and a variable that have all cluster stats The following is the portion of my pig script. I can store the ids but

How do you decode JSON in Pig that comes from a column?

阅读更多关于 How do you decode JSON in Pig that comes from a column?

问题 There are numerous examples of how to use JsonLoader() to load JSON data with a schema from a file, but not from any sort of other output. 回答1: You are looking for the JsonStringToMap UDF provided in Elephant Bird: https://github.com/kevinweil/elephant-bird/search?q=JsonStringToMap&ref=cmdform Sample File: foo bar {"version":1, "type":"an event", "count": 1} foo bar {"version":1, "type":"another event", "count": 1} Pig Script: REGISTER /path/to/elephant-bird.jar; DEFINE JsonStringToMap com

PIG Error 1066 after iterating through a joined set.

阅读更多关于 PIG Error 1066 after iterating through a joined set.

问题 Trying to join a one set which has number of days in the month with a data set on the year month key. After I join the and try to do a FOREACH over the set I get an ERROR: 1066 ... Backend error : Scalar has more than one row in the output. Here is an abbreviated set with the same problem: $ hadoop fs -cat DIM/\* 2011,01,31 2011,02,28 2011,03,31 2011,04,30 2011,05,31 2011,06,30 2011,07,31 2011,08,31 2011,09,30 2011,10,31 2011,11,30 2011,12,31 $ hadoop fs -cat ACCT/\* 2011,7,26,key1,23.25,2470

How to get the current time stamp in PIG

阅读更多关于 How to get the current time stamp in PIG

问题 I have a query with respect to a PIG script that I am writing. How can I get the current Unix Time Stamp in PIG script.? Do I need any UDF for the purpose.. or can PIG provide me the currnet time stamp ? Kindly advice me. Thanks 回答1: I am pointing two solutions first one: use CurrentTime(),convert it to ToUnixTime() for need to get timestamp. Ex: X = load "xx" ......... ; X1 = FOREACH X GENERATE ToUnixTime(CurrentTime()) second one: Passing from command line as a parameter. pig -f myscript

Find average by joining two datasets

阅读更多关于 Find average by joining two datasets

问题 I have two data sets , EmployeeDetail(data set 1):- id name gender location SalaryDetail(data set 2):- id salary I need to join both and find out average salary of male and female in each location. So I tried following code . EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray); SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float); JoinedEmpDetail = join EmpDetail by id,

How to use REGEX_EXTRACT_ALL in Pig

阅读更多关于 How to use REGEX_EXTRACT_ALL in Pig

问题 This is my sample data, subId=00001111911128052627,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212218.4621702216543667E17 subId=00001111911128052639,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212219.6726312167218586E17 subId=00001111911128052615,towerid=11232w34532543456345623453456984756894756,bytes=122112212212212216.9431647633139046E17 My expected output will be a tuple where each field represents a matched group: (capturing_group1,

Hive: How to calculate time difference

阅读更多关于 Hive: How to calculate time difference

问题 My requirement is simple how to calculate the time difference between two column in hive Example Time_Start: 10:15:00 Time_End: 11:45:00 I need to do (Time_End-Time_Start) = 1:30:00 Note both the columns are in String datatype kindly help to get required result.. 回答1: Language manual contains description of all available datetime functions. Difference in seconds can be calculated in such way: hour(time_end) * 3600 + minute(time_end) * 60 + second(time_end) - hour(time_start) * 3600 - minute

Pig script to read Cassandra table

阅读更多关于 Pig script to read Cassandra table

问题 Trying to write a Pig script that will extract data from a Cassandra table. The Pig script looks like this: REGISTER ./cassandra-all-2.0.8.39.jar REGISTER ./datastax-agent-4.1.4-standalone.jar REGISTER ./cassandra-driver-core-2.0.2.1.jar REGISTER ./apache-cassandra-thrift-2.0.12.jar A = LOAD 'cql://username:password/mykeyspace/mycolumnfamily' USING org.apache.cassandra.hadoop.pig.CqlStorage() AS (user_id:long, fname:chararray, last_update_date:chararray, lname:chararray); DUMP A; I keep

Apache Pig: Extra query parameters from web log

阅读更多关于 Apache Pig: Extra query parameters from web log

问题 I am working on analyzing AWS CloudFront access logs. I have the code to load the lines of the file raw_logs2 =LOAD 'file:///home/ec2-user/ENWRZAC68E00M.2011-02-28-18.72jA8eGh' USING PigStorage('\t') AS ( date: chararray, time: chararray, x_edge_location: chararray, sc_bytes: int, c_ip: chararray, cs_method: chararray, cs_host: chararray, cs_uri_stem: chararray, sc_status: chararray, cs_referer: chararray, cs_user_agent:chararray, cs_uri_query: chararray ); Now I am trying to parse the query