问题
I have two data sets ,
EmployeeDetail(data set 1):-
id
name
gender
location
SalaryDetail(data set 2):-
id
salary
I need to join both and find out average salary of male and female in each location. So I tried following code .
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as
(id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as
(id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by
id;
GroupedByLocation = group JoinedEmpDetail by location;
AverageSalary = foreach GroupedByLocation {
genderGrp = group JoinedEmpDetail by JoinedEmpDetail.EmpDetail::gender;
avgSalary = foreach genderGrp generate group,
AVG(JoinedEmpDetail.SalaryDetail::salary);
generate group as location, JoinedEmpDetail.EmpDetail::gender, avgSalary;
};
But it is throwing below error
<line 6, column 22> Syntax error, unexpected symbol at or near
'JoinedEmpDetail'
Can anyone please help where am I doing the mistake or how to do it properly?
For more clarity about my requirement I am giving some sample data sets.
EmpDetail.txt
1 Biswa Male Bangalore
12 Bratati Mahapatra Female Chennai
2 Bibhu kalyan Male Bangalore
3 Chinta Male Mumbai
10 Amrit Anand Male Bangalore
11 Sateesh panda Male Bangalore
4 Kirti Kumar Male Mumbai
6 Shruthi Female Chennai
7 Vijay Male Chennai
5 Bibhu Male Chennai
9 Bratati Mohanty Female Bangalore
8 Rupa Mahapatra Female Bangalore
13 Salini Female Mumbai
14 Priyanka Chopra Female Mumbai
EmpSalary.txt
1 10000
12 12000
2 15900
3 9000
10 8000
11 13400
4 7600
6 22000
7 17000
5 16800
9 9800
8 10000
13 11000
14 12500
Final result I need is:
Mumbai male <avgsalary amount>
Mumbai female <avgsalary amount>
Bangalore male <avgsalary amount>
Bangalore female <avgsalary amount>
Chennai male <avgsalary amount>
Chennai female <avgsalary amount>
回答1:
You can solve this problem using simple foreach stmt
so don't go for nested foreach stmt.
Group command
will not work inside nested Foreach, its restricted in pig. Only few commands are allowed inside the nested foreach (CROSS, DISTINCT, FILTER, FOREACH, LIMIT, and ORDER BY).
Can you change your script like this?
EmpDetail = load '/Users/bmohanty6/EmployeeDetails/EmpDetail.txt' as (id:int, name:chararray, gender:chararray, location:chararray);
SalaryDetail = load '/Users/bmohanty6/EmployeeDetails/EmpSalary.txt' as (id:int, salary:float);
JoinedEmpDetail = join EmpDetail by id, SalaryDetail by id;
GroupedByLocation = group JoinedEmpDetail by (location,gender);
AverageSalary = FOREACH GroupedByLocation GENERATE FLATTEN(group),AVG(JoinedEmpDetail.SalaryDetail::salary);
DUMP AverageSalary;
Output:
(Mumbai,Male,8300.0)
(Mumbai,Female,11750.0)
(Chennai,Male,16900.0)
(Chennai,Female,17000.0)
(Bangalore,Male,11825.0)
(Bangalore,Female,9900.0)
来源:https://stackoverflow.com/questions/27791699/find-average-by-joining-two-datasets