问题
I have a code that currently works, however I'm looking to make it more efficient and avoid hard coding:
1) avoid hard coding: for NotDefined_filterDomainLookup
will like to reference the default_reference df for the corresponding Code and Name when Id = 4. Instead of hard coding the Code and Name value.
2) I repeat the same code and process for Id/Code/Name. Is there a way to loop all of that instead of coding each scenario? How can I iterate over the current logic?
Question 1
list of columns name and corresponding new column names
test_matchedAttributeName_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)
Output: {'LeaseType': 'ConformedLeaseTypeName', 'LeaseRecoveryType': 'ConformedLeaseRecoveryTypeName', 'LeaseStatus': 'ConformedLeaseStatusName'}
working code, except looking to avoid hard coding. Specifically, I will like to reference the default_reference df for the corresponding Code and Name when Id = 4
cond = col('PrimaryLookupAttributeName').isNull() & col('SecondaryLookupAttributeName').isNull()
NotDefined_filterDomainLookup = filterDomainLookup \
.withColumn('OutputItemIdByAttribute', when(cond, lit('4')).otherwise(col('OutputItemIdByAttribute'))) \
.withColumn('OutputItemCodeByAttribute', when(cond, lit('N/D')).otherwise(col('OutputItemCodeByAttribute'))) \
.withColumn('OutputItemNameByAttribute', when(cond, lit('Not Defined')).otherwise(col('OutputItemNameByAttribute')))
NotDefined_filterDomainLookup
+--------------+----------------+-----------------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-----------------------+-------------------------+-------------------------+
|SourceSystemId|SourceSystemName|IsPortfolioSpecificRule| Portfolio|DomainId| DomainName| PrimaryLookupEntity|PrimaryLookupAttributeName|SecondaryLookupAttributeName|OutputItemIdByAttribute|OutputItemCodeByAttribute|OutputItemNameByAttribute|
+--------------+----------------+-----------------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-----------------------+-------------------------+-------------------------+
| 4| ABC123| false|ALL_PORTFOLIOS| 100011| LeaseStatus|ABC123_FF_Leases.csv| LeaseStatus| | null| null| null|
| 4| ABC123| false|ALL_PORTFOLIOS| 100022|LeaseRecoveryType|ABC123_FF_Leases.csv| LeaseRecoveryType| | null| null| null|
| 4| ABC123| false|ALL_PORTFOLIOS| 100022|LeaseRecoveryType|ABC123_FF_Leases.csv| LeaseRecoveryType| Net| null| null| null|
| 4| ABC123| false|ALL_PORTFOLIOS| 100028| LeaseType|ABC123_FF_Leases.csv| LeaseType| | null| null| null|
| 4| ABC123| true| Boeing| 100011| LeaseStatus|ABC123_FF_Leases.csv| LeaseStatus| | null| null| null|
+--------------+----------------+-----------------------+--------------+--------+-----------------+--------------------+--------------------------+----------------------------+-----------------------+-------------------------+-------------------------+
default_reference
+--------+----------+------+--------+--------------------+
|DomainId|DomainName|ItemId|ItemCode| ItemName|
+--------+----------+------+--------+--------------------+
| 0| Default| 0| N/S|Not Specified at ...|
| 0| Default| 1| N/F| Lookup Not Found|
| 0| Default| 2| N/I|Not Implemented i...|
| 0| Default| 3| N/A| Not Applicable|
| 0| Default| 4| N/D| Not Defined|
| 0| Default| 5| O/R| Consumer Override|
| 0| Default| 6| SCM|Standard Column M...|
+--------+----------+------+--------+--------------------+
Question2
working code, except looking to simplify and avoid repeating logic
#----
#--AttributeId
#create a map based on columns from reference_df
NotDefined_AttributeId_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForId')).alias('m')).first().m)
NotDefinedId_map_key = concat_ws('\0', NotDefined_filterDomainLookup.DomainName, NotDefined_filterDomainLookup.PrimaryLookupAttributeName)
NotDefinedId_map_value = NotDefined_filterDomainLookup.OutputItemIdByAttribute
testingId = NotDefined_filterDomainLookup.agg(collect_set(array(concat_ws('\0','DomainName','PrimaryLookupAttributeName'), col('OutputItemIdByAttribute').astype('string'))).alias('id')).first().id
#iterate through mapped values
testing_mappings = create_map([lit(i) for i in chain.from_iterable(testingId)])
#--AttributeCode
NotDefined_AttributeCode_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForCode')).alias('m')).first().m)
NotDefinedCode_map_key = concat_ws('\0', NotDefined_filterDomainLookup.DomainName, NotDefined_filterDomainLookup.PrimaryLookupAttributeName)
NotDefinedCode_map_value = NotDefined_filterDomainLookup.OutputItemCodeByAttribute
testingCode = NotDefined_filterDomainLookup.agg(collect_set(array(concat_ws('\0','DomainName','PrimaryLookupAttributeName'), col('OutputItemCodeByAttribute').astype('string'))).alias('code')).first().code
#iterate through mapped values
testing_mappings = create_map([lit(i) for i in chain.from_iterable(testingCode)])
#--AttributeName
NotDefined_AttributeName_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)
NotDefinedName_map_key = concat_ws('\0', NotDefined_filterDomainLookup.DomainName, NotDefined_filterDomainLookup.PrimaryLookupAttributeName)
NotDefinedName_map_value = NotDefined_filterDomainLookup.OutputItemNameByAttribute
testingName = NotDefined_filterDomainLookup.agg(collect_set(array(concat_ws('\0','DomainName','PrimaryLookupAttributeName'), col('OutputItemNameByAttribute').astype('string'))).alias('name')).first().name
#iterate through mapped values
testing_mappings = create_map([lit(i) for i in chain.from_iterable(testingName)])
#--Dataframe with corresponding matched Outputs for Not Defined TargetAttribute columns
testing_NotDefined = datasetMatchedPortfolio.select("*",*[ testing_mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NotDefined_AttributeId_List.items() if c_name]).select("*",*[ testing_mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NotDefined_AttributeCode_List.items() if c_name]).select("*",*[ testing_mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NotDefined_AttributeName_List.items() if c_name])
Final output. Currently, all values are null since cond
does not have any Nulls. Also, working on making this conditional, as in, if cond
has nothing, then do nothing, else, do Step 1 and Step2, returning output below.
+----------------+----------------+-------------+----------------+-------------+-----------+-----------------+-------------+--------------+--------------------+----------------------+----------------------------+----------------------+------------------------------+------------------------+
|SourceSystemName| Portfolio|SourceLeaseID|SourcePropertyID|ClientLeaseID|LeaseStatus|LeaseRecoveryType| LeaseType| PortfolioRule|ConformedLeaseTypeId|ConformedLeaseStatusId|ConformedLeaseRecoveryTypeId|ConformedLeaseTypeName|ConformedLeaseRecoveryTypeName|ConformedLeaseStatusName|
+----------------+----------------+-------------+----------------+-------------+-----------+-----------------+-------------+--------------+--------------------+----------------------+----------------------------+----------------------+------------------------------+------------------------+
| ABC123|ABC123 Portfolio| 1814| 1865| null| Terminated| Gross|Expense Lease|ALL_PORTFOLIOS| null| null| null| null| null| null|
| ABC123|ABC123 Portfolio| 1508| 1866| null| Active|Gross w/base year|Expense Lease|ALL_PORTFOLIOS| null| null| null| null| null| null|
| ABC123|ABC123 Portfolio| 1826| 1875| null| Active| Gross-modified|Expense Lease|ALL_PORTFOLIOS| null| null| null| null| null| null|
| ABC123|ABC123 Portfolio| 1865| 1881| null| Active| Net-triple|Expense Lease|ALL_PORTFOLIOS| null| null| null| null| null| null|
| ABC123|ABC123 Portfolio| 1831| 1883| null| Active|Gross w/base year|Expense Lease|ALL_PORTFOLIOS| null| null| null| null| null| null|
+----------------+----------------+-------------+----------------+-------------+-----------+-----------------+-------------+--------------+--------------------+----------------------+----------------------------+----------------------+------------------------------+------------------------+
回答1:
For the Question-2, based on your code, I'd advise some adjustments as below:
- Set up item_keys including Id, Name and Code and merge the same logic using list comprehensions
- Use struct instead of array to implement the above logic
- No need to create Python dictionary for NotDefned_Attribute_List, list of tuples are enough and better
See below steps:
(1) Set up two aggregate functions to calculate item_map used for testing_mappings and NotDefined_Attribute_List. check named_struct and struct (two methods for the same task for your exercises)
from itertools import chain
from pyspark.sql.functions import expr, collect_set, struct, col
item_keys = ['Id', 'Name', 'Code']
# use SQL expression
m1_by_sql_expr = expr("""
collect_set(
named_struct(
'attr_name', PrimaryLookupAttributeName,
'attr_value', PrimaryLookupAttributeValue,
'Id', OutputItemIdByValue,
'Name', OutputItemNameByValue,
'Code', OutputItemCodeByValue
)
) as item_map
""")
# use PySpark API functions
m2_by_func = collect_set(
struct(
col('DomainName').alias('domain'),
col('TargetAttributeForId').alias('Id'),
col('TargetAttributeForName').alias('Name'),
col('TargetAttributeForCode').alias('Code')
)
).alias('item_map')
(2) Set up ItemKey(Id, Code or Name) + PrimaryLookupAttributeName + PrimaryLookupAttributeValue mapping to ItemValue
m1 = NotDefined_filterDomainLookup.agg(m1_by_sql_expr).first().item_map
"""create a list of tuples of (map_key, map_value) to create MapType column:
| map_key = concat_ws('\0', item_key, attr_name, attr_value)
| map_value = item_value
"""
testingId = [('\0'.join([k, row.attr_name, row.attr_value]), row[k]) for row in m1 for k in item_keys if row[k]]
#[('Id\x00LeaseRecoveryType\x00Gross w/base year', '18'),
# ('Name\x00LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'),
# ('Id\x00LeaseStatus\x00Abandoned', '10'),
# ('Name\x00LeaseStatus\x00Abandoned', 'Active'),
# ('Id\x00LeaseStatus\x00Draft', '10'),
# ('Name\x00LeaseStatus\x00Draft', 'Pending'),
# ('Id\x00LeaseStatus\x00Archive', '11'),
# ('Name\x00LeaseStatus\x00Archive', 'Expired'),
# ('Id\x00LeaseStatus\x00Terminated', '10'),
# ('Name\x00LeaseStatus\x00Terminated', 'Terminated'),
# ('Id\x00LeaseRecoveryType\x00Gross', '11'),
# ('Name\x00LeaseRecoveryType\x00Gross', 'Gross'),
# ('Id\x00LeaseRecoveryType\x00Gross-modified', '15'),
# ('Name\x00LeaseRecoveryType\x00Gross-modified', 'Modified Gross')]
# this could be a problem for too many entries.
testing_mappings = create_map([lit(i) for i in chain.from_iterable(testingId)])
(3) Create NotDefined_AttributeCode_List (same logic as in (2), use PySpark API functions for m2)
m2 = matchedDomains.agg(m2_by_func).first().item_map
NotDefned_Attribute_List = [(k, row.domain, row[k]) for row in m2 for k in item_keys if row[k]]
(4) Get a list of additional columns based on NotDefined_Attribute_List:
additional_cols = [
testing_mappings[concat_ws('\0', lit(k), lit(c), col(c))].alias(c_name)
for k,c,c_name in NotDefined_Attribute_List
]
(5) select the additional columns
if count_ND > 0:
# move code above in (2), (3) and (4) here
# set up testing_NotDefined
testing_NotDefined = datasetMatchedPortfolio.select("*", *additional_cols)
else:
print("no Not Defines exist")
回答2:
Clean code
map_ids = dict((int(e[0]), {'code':e[1],'name':e[2]}) for e in default.agg(collect_set(array('ItemId','ItemCode','ItemName')).alias('id')).first().id)
cond = col('PrimaryLookupAttributeName').isNull() & col('SecondaryLookupAttributeName').isNull()
item_id = 4
item = map_ids[item_id]
NotDefined_filterDomainLookup = filterDomainLookup \
.filter(cond) \
.withColumn('OutputItemIdByValue', lit(item_id)) \
.withColumn('OutputItemCodeByValue', lit(item['code'])) \
.withColumn('OutputItemNameByValue', lit(item['name']))
#display(NotDefined_filterDomainLookup )
count_ND = NotDefined_filterDomainLookup.count()
#-- Prep to combine with filterDomainLookup view in order to end with OutputItem..values with Confor column names
#--AttributeId
#create a map based on columns from reference_df
NotDefined_AttributeId_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForId')).alias('m')).first().m)
#print(NotDefined_AttributeId_List)
testingId = NotDefined_filterDomainLookup.agg(collect_set(array(concat_ws('\0','DomainName','PrimaryLookupAttributeName'), col('OutputItemIdByAttribute').astype('string'))).alias('id')).first().id
#display(testing)
#iterate through mapped values
testingId_mappings = create_map([lit(i) for i in chain.from_iterable(testingId)])
#--AttributeCode
NotDefined_AttributeCode_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForCode')).alias('m')).first().m)
#print(NotDefined_AttributeCode_List)
testingCode = NotDefined_filterDomainLookup.agg(collect_set(array(concat_ws('\0','DomainName','PrimaryLookupAttributeName'), col('OutputItemCodeByAttribute').astype('string'))).alias('code')).first().code
#display(testing)
#iterate through mapped values
testingCode_mappings = create_map([lit(i) for i in chain.from_iterable(testingCode)])
#--AttributeName
NotDefined_AttributeName_List =dict(matchedDomains.agg(collect_set(array('DomainName', 'TargetAttributeForName')).alias('m')).first().m)
#print(NotDefined_AttributeName_List)
testingName = NotDefined_filterDomainLookup.agg(collect_set(array(concat_ws('\0','DomainName','PrimaryLookupAttributeName'), col('OutputItemNameByAttribute').astype('string'))).alias('name')).first().name
#display(testing)
#iterate through mapped values
testingName_mappings = create_map([lit(i) for i in chain.from_iterable(testingName)])
if count_ND > 0:
#--Dataframe with corresponding matched Outputs for Not Defined TargetAttribute columns
testing_NotDefined = datasetMatchedPortfolio.select("*",*[ testingId_mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NotDefined_AttributeId_List.items() if c_name]).select("*",*[ testingCode_mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NotDefined_AttributeCode_List.items() if c_name]).select("*",*[ testingName_mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in NotDefined_AttributeName_List.items() if c_name])
else:
print("no Not Defines exist")
来源:https://stackoverflow.com/questions/62053976/pyspark-iterating-repetitive-variables