ETL 1.5 GB Dataframe within pyspark on AWS EMR

问题

I'm using an EMR cluster with 1 Master (m5.2x large) and 4 core nodes (c5.2xlarge) and running a PySpark job on it which will join 5 fact tables 150 columns and 100k rows each and 5 small dimension tables 10 columns each with less than 100 records. When I join all these tables the resultant dataframe will have 600 columns and 420k records (approximately 1.5 GB of data).

Please suggest me something here, I'm from a SQL and DWH backgound. Hence I have used a single SQL query to join all 5 facts and 5 dimensions as I have to derive 600 columns out of this and I feel that a single properly formatted query would be easy to maintain in the future. Also another thread in my mind is suggestion to use core spark joins (Eg: df=a.join(b, a['id']=b['id'], how='inner'))..

Which one would be the best/suggestible way to join these tables? Currently in my huge SQL Query I'm broadcasting the small dimensions and that helped to certain extent.

来源：https://stackoverflow.com/questions/62029473/etl-1-5-gb-dataframe-within-pyspark-on-aws-emr

标签

amazon-web-services

pyspark

apache-spark-sql

etl

amazon-emr

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!