pySpark mapping multiple variables

问题

The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputItemNameByValue. However, I'm trying to add the rule that when PrimaryLookupAttributeValue = DEFAULT to also return the OutputItemNameByValue.

The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values whose corresponding PrimaryLookupAttributeValue = DEFAULT to replace null by the OutputItemNameByValue.

  #create a map based on columns from reference_df
  map_key = concat_ws('\0', final_reference.PrimaryLookupAttributeName, final_reference.PrimaryLookupAttributeValue)
  map_value = final_reference.OutputItemNameByValue

  #dataframe of concatinated mappings to get the corresponding OutputValues from reference table
  d = final_reference.agg(collect_set(array(concat_ws('\0','PrimaryLookupAttributeName','PrimaryLookupAttributeValue'), 'OutputItemNameByValue')).alias('m')).first().m
  #display(d)

  #iterate through mapped values 
  mappings = create_map([lit(i) for i in chain.from_iterable(d)])

  #dataframe with corresponding matched OutputValues
  datasetPrimaryAttributes_False = datasetMatchedPortfolio.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in matchedAttributeName_List.items()]) 
  display(datasetPrimaryAttributes_False)

reference df

+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+
|SourceSystemId|SourceSystemName|     Portfolio|DomainId|       DomainName| PrimaryLookupEntity|PrimaryLookupAttributeName|SecondaryLookupAttributeName|StandardDomainMapId|StandardDomainItemMapId|PrimaryLookupAttributeValue|SecondaryLookupAttributeValue|OutputItemIdByValue|OutputItemCodeByValue|OutputItemNameByValue|
+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+
|             4|          ABC123|ALL_PORTFOLIOS|  100022|LeaseRecoveryType|ABC123_FF_Leases.csv|      LeaseRecoveryType |                            |                  9|                     329|                 Gross-modified|                               |             15|                     |            Modified Gross|
|             4|          ABC123|ALL_PORTFOLIOS|  100022|LeaseRecoveryType|ABC123_FF_Leases.csv|         LeaseRecoveryType|                            |                  9|                    330|             Gross         |                             |                 11|                     |                Gross|
|             4|          ABC123|ALL_PORTFOLIOS|  100022|LeaseRecoveryType|ABC123_FF_Leases.csv|         LeaseRecoveryType|                            |                  9|                    331|                 Gross w/base year|                      |                 18|                     |       Modified Gross|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                   1872|                  Abandoned|                             |                 10|                     |               Active|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                    332|                 Terminated|                             |                 10|                     |           Terminated|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                   1873|                    Archive|                             |                 11|                     |              Expired|
|             4|          ABC123|ALL_PORTFOLIOS|  100011|      LeaseStatus|ABC123_FF_Leases.csv|               LeaseStatus|                            |                 10|                    333|                 DEFAULT   |                             |                 10|                     |              Pending|
+--------------+----------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-------------------+-----------------------+---------------------------+-----------------------------+-------------------+---------------------+---------------------+

dataset

This dataset is filter down by null, which means that no match was found. However, some null values need to be replace by the DEFAULT corresponding OutputValue.

For example, in reference df, the sample below will yield no matched when compare to the dataset because value "DEFAULT" will never exists in dataset. However, given the matching Domain | PrimaryLookupName and if PrimaryLookupValue = DEFAULT, output OutputItemNameByValue instead of the null that I have outputting right now.

There should only be a null, if there ae no matches found. Since asfsadfwill not yield any matches, then the output should be null.

Domain      | PrimaryLookupName |SecondaryLookupName|PrimaryLookupValue|OutputItemNameByValue
--------------------------------------------------------------------
 LeaseStatus|  LeaseStatus      |                   |DEFAULT           | Pending 
LeaseStatus | LeaseStatus       |                   |asfsadf           | asfdsdf

#dataframe with corresponding matched OutputValues
datasetPrimaryA_Nulls_False =  datasetPrimaryAttributes_False.where(reduce(lambda x, y: x | y, (f.col(x).isNull() for x in datasetPrimaryAttributes_False.columns)))
#display(datasetPrimaryA_Nulls_False)
datasetPrimaryA_Nulls_False.show()



+----------------+----------------+-------------+----------------+-----------+-----------------+---------------+--------------+----------------------+------------------------------+------------------------+
|SourceSystemName|       Portfolio|SourceLeaseID|SourcePropertyID|LeaseStatus|LeaseRecoveryType|      LeaseType| PortfolioRule|ConformedLeaseTypeName|ConformedLeaseRecoveryTypeName|ConformedLeaseStatusName|
+----------------+----------------+-------------+----------------+-----------+-----------------+---------------+--------------+----------------------+------------------------------+------------------------+
|          ABC123|ABC123 Portfolio|         4265|            1892|      Draft|             null|Field Associate|ALL_PORTFOLIOS|                   N/A|                          null|                 Pending|
|          ABC123|ABC123 Portfolio|         4266|            1893|      Draft|             null|Field Associate|ALL_PORTFOLIOS|                   N/A|                          null|                 Pending|
|          ABC123|ABC123 Portfolio|         1676|            1894| Terminated|             null|  Expense Lease|ALL_PORTFOLIOS|                Tenant|                          null|              Terminated|
|          ABC123|ABC123 Portfolio|         4304|            1937|      Draft|             null|Field Associate|ALL_PORTFOLIOS|                   N/A|                          null|                 Pending

回答1:

From discussion in comments, I think you just need to add a default mappings from the existing one and then use coalease() function to find the first non-null value, see below:

from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map, coalesce

# skip some old code

d    
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseStatus\x00DEFAULT', 'Pending'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]

# original mapping
mappings = create_map([ lit(j) for i in d for j in i ])

# default mapping
mappings_default = create_map([ lit(j.split('\0')[0]) for i in d if i[0].upper().endswith('\x00DEFAULT') for j in i ])
#Column<b'map(LeaseStatus, Pending)'>

# a set of available PrimaryLookupAttributeName
available_list = set([ i[0].split('\0')[0] for i in d ])
# {'LeaseRecoveryType', 'LeaseStatus'}

# use coalesce to find the first non-null values from mappings, mappings_defaul etc
datasetPrimaryAttributes_False = datasetMatchedPortfolio.select("*",*[ 
  coalesce(
    mappings[concat_ws('\0', lit(c), col(c))],
    mappings_default[c],
    lit("Not Specified at Source" if c in available_list else "Lookup not found")
  ).alias(c_name) for c,c_name in matchedAttributeName_List.items()])

Some explanation:

(1) d is a list of lists retrieved from the reference_df, we use a list comprehension [ lit(j) for i in d for j in i ] to flatten this to a list and apply the flattened list to the create_map function:

(2) The mappings_default is similar to the above, but add a if condition to serve as a filter and keep only entries having PrimaryLookupAttributeValue (which is the first item of the inner list i[0]) ending with \x00DEFAULT and then use split to strip PrimaryLookupAttributeValue(which is basically \x00DEFAULT) off from the map_key.

来源：https://stackoverflow.com/questions/61964179/pyspark-mapping-multiple-variables

标签

python

dataframe

pyspark

apache-spark-sql

mapping