Joining two Tables in Hive using HiveQL(Hadoop) [duplicate]

前端未结

关注

 2  1637

眼角桃花 2020-12-06 03:40

2条回答

春和景丽 (楼主)

2020-12-06 03:54
You probably need to use Hive transform functionality and have a custom reducer that does the matching between the records from the two tables: t1 and t2 where t1 is simply TestingTable1 and t2 is
```
   SELECT
      user_id,
      prod_and_ts.product_id as product_id,
      prod_and_ts.timestamps as timestamps
   FROM 
      TestingTable2 
      LATERAL VIEW explode(purchased_item) exploded_table as prod_and_ts
```
as explained by me in another question of yours.
```
FROM (
   FROM (
      SELECT
         buyer_id,
         item_id,
         created_time,
         id 
      FROM (
         SELECT
            buyer_id,
            item_id,
            created_time,
            't1' as id
         FROM
            TestingTable1 t1
         UNION ALL
         SELECT
            user_id as buyer_id,
            prod_and_ts.product_id as item_id,
            prod_and_ts.timestamps as created_time,
            't2' as id
         FROM 
            TestingTable2
            LATERAL VIEW explode(purchased_item) exploded_table as prod_and_ts
         )t
      )x
      MAP
         buyer_id,
         item_id,
         created_time,
         id
      USING '/bin/cat'
      AS
         buyer_id,
         item_id,
         create_time,
         id
      CLUSTER BY
         buyer_id
      ) map_output
   REDUCE 
      buyer_id,
      item_id,
      create_time,
      id
   USING 'my_custom_reducer'
   AS
      buyer_id,
      item_id,
      create_time,
      product_id,
      timestamps;
```
The above query has 2 distinct portions. The first part is "MAP" and the other is "REDUCE". In between these 2 parts is a phase called shuffle (represented by CLUSTER BY buyer_id) that is automatically taken care of my Hive. The Map part of the query reads from tables and also passes an identifier (called id that represents which tables the record is coming from). The Shuffle phase groups all the records per buyer_id. The Reduce phase will take in the all records for a given buyer_id and emit out only the records that satisfy the matching criteria. You will have to write the reducer yourself based on your matching criteria. You can write it in any language of your choice. It's guaranteed that all records that have the same buyer_id will go to the same reducer script.

There might be an easier way to do but this is the method I can think of right now. Good luck! To gain further appreciation of why I chose this method, see my recent answer here.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

热议问题