How to optimize percentage check and cols drop in large pyspark dataframe?
问题 I have a sample pandas dataframe like as shown below. But my real data is 40 million rows and 5200 columns df = pd.DataFrame({ 'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4], 'readings' : ['READ_1','READ_2','READ_1','READ_3',np.nan,'READ_5',np.nan,'READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'], 'val' :[5,6,7,np.nan,np.nan,7,np.nan,12,13,56,32,13,45,43,46], }) from pyspark.sql.types import * from pyspark.sql.functions import isnan, when, count, col mySchema =