Hive Explode / Lateral View multiple arrays

后端 未结 5 937
刺人心
刺人心 2020-11-30 02:23

I have a hive table with the following schema:

COOKIE  | PRODUCT_ID | CAT_ID |    QTY    
1234123   [1,2,3]    [r,t,null]  [2,1,null]

How

相关标签:
5条回答
  • 2020-11-30 02:31

    You can do this by using posexplode, which will provide an integer between 0 and n to indicate the position in the array for each element in the array. Then use this integer - call it pos (for position) to get the matching values in other arrays, using block notation, like this:

    select 
      cookie, 
      n.pos as position, 
      n.prd_id as product_id,
      cat_id[pos] as catalog_id,
      qty[pos] as qty
    from table
    lateral view posexplode(product_id_arr) n as pos, prd_id;
    

    This avoids the using imported UDF's as well as joining various arrays together (this has much better performance).

    0 讨论(0)
  • 2020-11-30 02:35

    I found a very good solution to this problem without using any UDF, posexplode is a very good solution :

    SELECT COOKIE ,
    ePRODUCT_ID,
    eCAT_ID,
    eQTY
    FROM TABLE 
    LATERAL VIEW posexplode(PRODUCT_ID) ePRODUCT_IDAS seqp, ePRODUCT_ID
    LATERAL VIEW posexplode(CAT_ID) eCAT_ID AS seqc, eCAT_ID
    LATERAL VIEW posexplode(QTY) eQTY AS seqq, eDateReported
    WHERE seqp = seqc AND seqc = seqq;
    0 讨论(0)
  • 2020-11-30 02:43

    If you are using Spark 2.4 in pyspark, use arrays_zip with posexplode:

    df = (df
        .withColumn('zipped', arrays_zip('col1', 'col2'))
        .select('id', posexplode('zipped')))
    
    0 讨论(0)
  • 2020-11-30 02:53

    You can use the numeric_range and array_index UDFs from Brickhouse ( http://github.com/klout/brickhouse ) to solve this problem. There is an informative blog posting describing in detail over at http://brickhouseconfessions.wordpress.com/2013/03/07/exploding-multiple-arrays-at-the-same-time-with-numeric_range/

    Using those UDFs, the query would be something like

    select cookie,
       array_index( product_id_arr, n ) as product_id,
       array_index( catalog_id_arr, n ) as catalog_id,
       array_index( qty_id_arr, n ) as qty
    from table
    lateral view numeric_range( size( product_id_arr )) n1 as n;
    
    0 讨论(0)
  • 2020-11-30 02:54

    I tried to work out on your scenario... please try this code -

    create table info(cookie string,productid int,catid string,qty string);
    
    insert into table info
    select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table
    lateral view posexplode(productid) pro as myprod,pro
    lateral view posexplode(categoryid) cate as mycat,cate
    lateral view posexplode(qty) q as myqty,q
    where myprod=mycat and mycat=myqty;
    

    Note - In the above statements, if you place - select cookie,myprod,mycat,myqty from table in place of select cookie,productid[myprod],categoryid[mycat],qty[myqty] from table in the output you will get the index of the element in the array of productid, categoryid and qty. Hope this will be helpful.

    0 讨论(0)
提交回复
热议问题