问题
There's probably something very trivial that I'm missing, but I just can't get this to work. I have a "movies" object, with title, actor, year and role. Now what I want, is to have results with the title, along with a nested bag containing actor/role pairs.
If I just do group movies by title
, I end up with results like (title, {movie objects}) which would be perfect, except that the title and year also appear in the movie objects there. I want just the actor and role.
I also tried foreach movie_groups generate group, movies.actor, movies.role
but then I end up with (title, {all actors}, {all roles}) which is obviously wrong.
In SQL this would be so trivial that I can't help but feel incredibly stupid for not being able to figure this out. Would anyone have a suggestion?
回答1:
It would be helpful to see the format of movies, but I'm assuming it is something like this:
MovieTitle1 Year1 Actor1 Role1
MovieTitle1 Year2 Actor2 Role2
etc.
In that case, I would do it like this:
result = FOREACH (GROUP movies BY title)
GENERATE FLATTEN(group), movies.(actor, role) AS actors ;
Also, you mention that the movies contain the year as well. If you do not need that field it might be worthwhile to project only the fields that you need (title, actor, role) first.
来源:https://stackoverflow.com/questions/17370222/selecting-fields-after-grouping-in-pig