apache-spark-sql

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

我的梦境 提交于 2021-02-06 09:22:31
问题 I have a spark dataframe ( prof_student_df ) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame. For example, here are the pairings/scores for one time frame.

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

懵懂的女人 提交于 2021-02-06 09:21:11
问题 I have a spark dataframe ( prof_student_df ) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame. For example, here are the pairings/scores for one time frame.

Compare two dataframes Pyspark

馋奶兔 提交于 2021-02-06 06:31:48
问题 I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1 df2.withColumn("column_names",udf()) DF1 +------+---------+--------+------+ | id | |name | sal | Address | +------+---------+--------+------+ | 1| ABC | 5000 | US

Compare two dataframes Pyspark

假如想象 提交于 2021-02-06 06:31:33
问题 I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1 df2.withColumn("column_names",udf()) DF1 +------+---------+--------+------+ | id | |name | sal | Address | +------+---------+--------+------+ | 1| ABC | 5000 | US

How to handle small file problem in spark structured streaming?

半世苍凉 提交于 2021-02-06 02:59:53
问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

How to handle small file problem in spark structured streaming?

拥有回忆 提交于 2021-02-06 02:59:49
问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

Difference between explode and explode_outer

六眼飞鱼酱① 提交于 2021-02-05 11:39:22
问题 What is the difference between explode and explode_outer? The documentation for both functions is the same and also the examples for both functions are identical: SELECT explode(array(10, 20)); 10 20 and SELECT explode_outer(array(10, 20)); 10 20 The Spark source suggests that there is a difference between the two functions expression[Explode]("explode"), expressionGeneratorOuter[Explode]("explode_outer") but what is the effect of expressionGeneratorOuter compared to expression? 回答1: explode

Normalize a complex nested JSON file

心不动则不痛 提交于 2021-02-05 09:37:41
问题 Im trying to normalize the below json file into 4 tables - "content", "Modules", "Images" and "Everything Else in another table" { "id": "0000050a", "revision": 1580225050941, "slot": "product-description", "type": "E", "create_date": 1580225050941, "modified_date": 1580225050941, "creator": "Auto", "modifier": "Auto", "audit_info": { "date": 1580225050941, "source": "AutoService", "username": "Auto" }, "total_ID": 1, "name": "Auto_A1AM78C64UM0Y8_B07JCJR5HW", "content": [{ "ID": ["B01"],

Does the state also gets removed on event timeout with spark structured streaming?

試著忘記壹切 提交于 2021-02-05 09:26:35
问题 Q. Does the state gets timed out and also gets removed at the same time or only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout? I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout. Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say : ds.withWatermark("timestamp", "10 seconds") .groupByKey(...) .mapGroupsWithState(

pyspark: groupby and aggregate avg and first on multiple columns

我的梦境 提交于 2021-02-05 09:26:33
问题 I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'], ['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4']) +---+----+----+----+----+ | id|col1|col2|col3|col4| +---+----+----+----+----+ | a| 2| 4| cc| anc| | a| 4| 7| cd| abc| | b| 6| 0|