pySpark mapping multiple variables

后端 未结 1 576
谎友^
谎友^ 2020-12-07 06:12

The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputValue

相关标签:
1条回答
  • 2020-12-07 07:08

    From discussion in comments, I think you just need to add a default mappings from the existing one and then use coalease() function to find the first non-null value, see below:

    from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map, coalesce
    
    # skip some old code
    
    d    
    #[['LeaseStatus\x00Abandoned', 'Active'],
    # ['LeaseStatus\x00DEFAULT', 'Pending'],
    # ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
    # ['LeaseStatus\x00Archive', 'Expired'],
    # ['LeaseStatus\x00Terminated', 'Terminated'],
    # ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
    # ['LeaseRecoveryType\x00Gross', 'Gross']]
    
    # original mapping
    mappings = create_map([ lit(j) for i in d for j in i ])
    
    # default mapping
    mappings_default = create_map([ lit(j.split('\0')[0]) for i in d if i[0].upper().endswith('\x00DEFAULT') for j in i ])
    #Column<b'map(LeaseStatus, Pending)'>
    
    # a set of available PrimaryLookupAttributeName
    available_list = set([ i[0].split('\0')[0] for i in d ])
    # {'LeaseRecoveryType', 'LeaseStatus'}
    
    # use coalesce to find the first non-null values from mappings, mappings_defaul etc
    datasetPrimaryAttributes_False = datasetMatchedPortfolio.select("*",*[ 
      coalesce(
        mappings[concat_ws('\0', lit(c), col(c))],
        mappings_default[c],
        lit("Not Specified at Source" if c in available_list else "Lookup not found")
      ).alias(c_name) for c,c_name in matchedAttributeName_List.items()])
    

    Some explanation:

    (1) d is a list of lists retrieved from the reference_df, we use a list comprehension [ lit(j) for i in d for j in i ] to flatten this to a list and apply the flattened list to the create_map function:

    (2) The mappings_default is similar to the above, but add a if condition to serve as a filter and keep only entries having PrimaryLookupAttributeValue (which is the first item of the inner list i[0]) ending with \x00DEFAULT and then use split to strip PrimaryLookupAttributeValue(which is basically \x00DEFAULT) off from the map_key.

    0 讨论(0)
提交回复
热议问题