问题
I'm using an EMR cluster with 1 Master (m5.2x large) and 4 core nodes (c5.2xlarge) and running a PySpark job on it which will join 5 fact tables 150 columns and 100k rows each and 5 small dimension tables 10 columns each with less than 100 records. When I join all these tables the resultant dataframe will have 600 columns and 420k records (approximately 1.5 GB of data).
Please suggest me something here, I'm from a SQL and DWH backgound. Hence I have used a single SQL query to join all 5 facts and 5 dimensions as I have to derive 600 columns out of this and I feel that a single properly formatted query would be easy to maintain in the future. Also another thread in my mind is suggestion to use core spark joins (Eg: df=a.join(b, a['id']=b['id'], how='inner'))..
Which one would be the best/suggestible way to join these tables? Currently in my huge SQL Query I'm broadcasting the small dimensions and that helped to certain extent.
来源:https://stackoverflow.com/questions/62029473/etl-1-5-gb-dataframe-within-pyspark-on-aws-emr