pyspark

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

懵懂的女人 提交于 2021-02-06 09:21:11
问题 I have a spark dataframe ( prof_student_df ) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time frame, I need to find the one to one pairing between professors/students that maximizes the overall score. Each professor can only be matched with one student for a single time frame. For example, here are the pairings/scores for one time frame.

Compare two dataframes Pyspark

馋奶兔 提交于 2021-02-06 06:31:48
问题 I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1 df2.withColumn("column_names",udf()) DF1 +------+---------+--------+------+ | id | |name | sal | Address | +------+---------+--------+------+ | 1| ABC | 5000 | US

Compare two dataframes Pyspark

假如想象 提交于 2021-02-06 06:31:33
问题 I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames df1 = spark.read.csv("/path/to/data1.csv") df2 = spark.read.csv("/path/to/data2.csv") Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1 df2.withColumn("column_names",udf()) DF1 +------+---------+--------+------+ | id | |name | sal | Address | +------+---------+--------+------+ | 1| ABC | 5000 | US

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

醉酒当歌 提交于 2021-02-06 02:41:06
问题 I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App: results = result.toJSON().collect() An example entry in my json file is below. I then tried to run a for loop in order to get specific results: {"userId":"1","systemId":"30","title":"interest"} for i in results: print i["userId"] This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer I used

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

。_饼干妹妹 提交于 2021-02-06 02:36:22
问题 I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App: results = result.toJSON().collect() An example entry in my json file is below. I then tried to run a for loop in order to get specific results: {"userId":"1","systemId":"30","title":"interest"} for i in results: print i["userId"] This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer I used

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

拟墨画扇 提交于 2021-02-06 02:32:39
问题 I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App: results = result.toJSON().collect() An example entry in my json file is below. I then tried to run a for loop in order to get specific results: {"userId":"1","systemId":"30","title":"interest"} for i in results: print i["userId"] This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer I used

SparkContext Error - File not found /tmp/spark-events does not exist

狂风中的少年 提交于 2021-02-05 20:19:06
问题 Running a Python Spark Application via API call - On submitting the Application - response - Failed SSH into the Worker My python application exists in /root/spark/work/driver-id/wordcount.py Error can be found in /root/spark/work/driver-id/stderr Show the following error - Traceback (most recent call last): File "/root/wordcount.py", line 34, in <module> main() File "/root/wordcount.py", line 18, in main sc = SparkContext(conf=conf) File "/root/spark/python/lib/pyspark.zip/pyspark/context.py

SparkContext Error - File not found /tmp/spark-events does not exist

被刻印的时光 ゝ 提交于 2021-02-05 20:07:09
问题 Running a Python Spark Application via API call - On submitting the Application - response - Failed SSH into the Worker My python application exists in /root/spark/work/driver-id/wordcount.py Error can be found in /root/spark/work/driver-id/stderr Show the following error - Traceback (most recent call last): File "/root/wordcount.py", line 34, in <module> main() File "/root/wordcount.py", line 18, in main sc = SparkContext(conf=conf) File "/root/spark/python/lib/pyspark.zip/pyspark/context.py

SparkContext Error - File not found /tmp/spark-events does not exist

不打扰是莪最后的温柔 提交于 2021-02-05 20:02:00
问题 Running a Python Spark Application via API call - On submitting the Application - response - Failed SSH into the Worker My python application exists in /root/spark/work/driver-id/wordcount.py Error can be found in /root/spark/work/driver-id/stderr Show the following error - Traceback (most recent call last): File "/root/wordcount.py", line 34, in <module> main() File "/root/wordcount.py", line 18, in main sc = SparkContext(conf=conf) File "/root/spark/python/lib/pyspark.zip/pyspark/context.py

pySpark, aggregate complex function (difference of consecutive events)

喜你入骨 提交于 2021-02-05 11:16:29
问题 I have a DataFrame ( df ) whose columns are userid (the user id), day (the day). I'm interested in computing, for every user, the average time interval between each day he/she was active. For instance, for a given user the DataFrame may look something like this userid day 1 2016-09-18 1 2016-09-20 1 2016-09-25 If the DataFrame is a Pandas DataFrame, I could compute the quantity I'm interested in like this import numpy as np np.mean(np.diff(df[df.userid==1].day)) However, this is quite