apache-spark-sql

Using broadcasted dataframe in pyspark UDF

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-18 05:07:56
问题 Is it possible to use a broadcasted data frame in the UDF of a pyspark SQl application. My Code calls the broadcasted Dataframe inside a pyspark dataframe like below. fact_ent_df_data = sparkSession.sparkContext.broadcast(fact_ent_df.collect()) def generate_lookup_code(col1,col2,col3): fact_ent_df_count=fact_ent_df_data. select(fact_ent_df_br.TheDate.between(col1,col2), fact_ent_df_br.Ent.isin('col3')).count() return fact_ent_df_count sparkSession.udf.register("generate_lookup_code" ,

Using broadcasted dataframe in pyspark UDF

删除回忆录丶 提交于 2021-01-18 04:53:26
问题 Is it possible to use a broadcasted data frame in the UDF of a pyspark SQl application. My Code calls the broadcasted Dataframe inside a pyspark dataframe like below. fact_ent_df_data = sparkSession.sparkContext.broadcast(fact_ent_df.collect()) def generate_lookup_code(col1,col2,col3): fact_ent_df_count=fact_ent_df_data. select(fact_ent_df_br.TheDate.between(col1,col2), fact_ent_df_br.Ent.isin('col3')).count() return fact_ent_df_count sparkSession.udf.register("generate_lookup_code" ,

Using broadcasted dataframe in pyspark UDF

坚强是说给别人听的谎言 提交于 2021-01-18 04:53:05
问题 Is it possible to use a broadcasted data frame in the UDF of a pyspark SQl application. My Code calls the broadcasted Dataframe inside a pyspark dataframe like below. fact_ent_df_data = sparkSession.sparkContext.broadcast(fact_ent_df.collect()) def generate_lookup_code(col1,col2,col3): fact_ent_df_count=fact_ent_df_data. select(fact_ent_df_br.TheDate.between(col1,col2), fact_ent_df_br.Ent.isin('col3')).count() return fact_ent_df_count sparkSession.udf.register("generate_lookup_code" ,

Converting query from SQL to pyspark

霸气de小男生 提交于 2021-01-07 01:37:08
问题 I am trying to convert the following SQL query into pyspark: SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result The code I have in PySpark right now is this: Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) /

Converting query from SQL to pyspark

泪湿孤枕 提交于 2021-01-07 01:32:56
问题 I am trying to convert the following SQL query into pyspark: SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result The code I have in PySpark right now is this: Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) /

Spark: unusually slow data write to Cloud Storage

大憨熊 提交于 2021-01-07 01:24:25
问题 As the final stage of the pyspark job, I need to save 33Gb of data to Cloud Storage. My cluster is on Dataproc and consists of 15 n1-standard-v4 workers. I'm working with avro and the code I use to save the data: df = spark.createDataFrame(df.rdd, avro_schema_str) df \ .write \ .format("avro") \ .partitionBy('<field_with_<5_unique_values>', 'field_with_lots_of_unique_values>') \ .save(f"gs://{output_path}") The write stage stats from the UI: My worker stats: Quite strangely for the adequate

How to create edge list from spark data frame in Pyspark?

独自空忆成欢 提交于 2021-01-06 03:42:25
问题 I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame. For example, below is my vertices data frame. I have a list of ids and they belong to different groups. +---+-----+ |id |group| +---+-----+ |a |1 | |b |2 | |c |1 | |d |2 | |e |3 | |a |3 | |f |1 | +---+-----+ My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id

How to create edge list from spark data frame in Pyspark?

爷,独闯天下 提交于 2021-01-06 03:42:25
问题 I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame. For example, below is my vertices data frame. I have a list of ids and they belong to different groups. +---+-----+ |id |group| +---+-----+ |a |1 | |b |2 | |c |1 | |d |2 | |e |3 | |a |3 | |f |1 | +---+-----+ My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id

How to create edge list from spark data frame in Pyspark?

二次信任 提交于 2021-01-06 03:42:21
问题 I am using graphframes in pyspark for some graph type of analytics and wondering what would be the best way to create the edge list data frame from a vertices data frame. For example, below is my vertices data frame. I have a list of ids and they belong to different groups. +---+-----+ |id |group| +---+-----+ |a |1 | |b |2 | |c |1 | |d |2 | |e |3 | |a |3 | |f |1 | +---+-----+ My objective is to create an edge list data frame to indicate ids which appear in common groups. Please note that 1 id

Scala/Spark - How to get first elements of all sub-arrays

怎甘沉沦 提交于 2021-01-05 09:11:38
问题 I have the following DataFrame in a Spark (I'm using Scala): [[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]] I want