apache-spark-sql | 易学教程

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

阅读更多关于 Implementing a recursive algorithm in pyspark to find pairings within a dataframe

问题 I have a spark dataframe ( prof_student_df ) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame. For example, here are the pairings/scores for one time frame.

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

阅读更多关于 Implementing a recursive algorithm in pyspark to find pairings within a dataframe

Compare two dataframes Pyspark

阅读更多关于 Compare two dataframes Pyspark

问题 I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1 df2.withColumn("column_names",udf()) DF1 +------+---------+--------+------+ | id | |name | sal | Address | +------+---------+--------+------+ | 1| ABC | 5000 | US

Compare two dataframes Pyspark

阅读更多关于 Compare two dataframes Pyspark

How to handle small file problem in spark structured streaming?

阅读更多关于 How to handle small file problem in spark structured streaming?

问题 I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into respective parquet files in hdfs store. I am able to store and read parquet files, I kept a trigger time of 15 seconds to 1 minutes. These files are very small in size hence resulting into many files. These parquet files need to be read latter by hive

How to handle small file problem in spark structured streaming?

阅读更多关于 How to handle small file problem in spark structured streaming?

Difference between explode and explode_outer

阅读更多关于 Difference between explode and explode_outer

问题 What is the difference between explode and explode_outer? The documentation for both functions is the same and also the examples for both functions are identical: SELECT explode(array(10, 20)); 10 20 and SELECT explode_outer(array(10, 20)); 10 20 The Spark source suggests that there is a difference between the two functions expression[Explode]("explode"), expressionGeneratorOuter[Explode]("explode_outer") but what is the effect of expressionGeneratorOuter compared to expression? 回答1: explode

Normalize a complex nested JSON file

阅读更多关于 Normalize a complex nested JSON file

问题 Im trying to normalize the below json file into 4 tables - "content", "Modules", "Images" and "Everything Else in another table" { "id": "0000050a", "revision": 1580225050941, "slot": "product-description", "type": "E", "create_date": 1580225050941, "modified_date": 1580225050941, "creator": "Auto", "modifier": "Auto", "audit_info": { "date": 1580225050941, "source": "AutoService", "username": "Auto" }, "total_ID": 1, "name": "Auto_A1AM78C64UM0Y8_B07JCJR5HW", "content": [{ "ID": ["B01"],

Does the state also gets removed on event timeout with spark structured streaming?

阅读更多关于 Does the state also gets removed on event timeout with spark structured streaming?

问题 Q. Does the state gets timed out and also gets removed at the same time or only the state gets timed out and state still remains for both ProcessingTimeout and EventTimeout? I was doing some experiment with mapGroupsWithState/flatmapGroupsWithState and having some confusion with the state timeout. Considering I am maintaining a state with a watermark of 10 seconds and applying time out based on event time say : ds.withWatermark("timestamp", "10 seconds") .groupByKey(...) .mapGroupsWithState(

pyspark: groupby and aggregate avg and first on multiple columns

阅读更多关于 pyspark: groupby and aggregate avg and first on multiple columns

问题 I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'], ['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4']) +---+----+----+----+----+ | id|col1|col2|col3|col4| +---+----+----+----+----+ | a| 2| 4| cc| anc| | a| 4| 7| cd| abc| | b| 6| 0|