pyspark

Pyspark socket timeout exception after application running for a while

余生长醉 提交于 2019-12-30 06:34:06
问题 I am using pyspark to estimate parameters for a logistic regression model. I use spark to calculate the likelihood and gradients and then use scipy's minimize function for optimization (L-BFGS-B). I use yarn-client mode to run my application. My application could start to run without any problem. However, after a while it reports the following error: Traceback (most recent call last): File "/home/panc/research/MixedLogistic/software/mixedlogistic/mixedlogistic_spark/simulation/20160716-1626

PySpark - Add a new nested column or change the value of existing nested columns

走远了吗. 提交于 2019-12-30 04:46:07
问题 Supposing, I have a json file with lines in follow structure: { "a": 1, "b": { "bb1": 1, "bb2": 2 } } I want to change the value of key bb1 or add a new key, like: bb3 . Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame. The workflow works as follow: def map_func(row): dictionary = row.asDict(True) adding

PySpark broadcast variables from local functions

女生的网名这么多〃 提交于 2019-12-30 04:24:04
问题 I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers. Let's say I have this setup: def main(): sc = SparkContext() SomeMethod(sc) def SomeMethod(sc): someValue = rand() V = sc.broadcast(someValue) A = sc.parallelize().map(worker) def worker(element): element *= V.value ### NameError: global name 'V'

How to calculate date difference in pyspark?

笑着哭i 提交于 2019-12-30 04:03:20
问题 I have data like this: df = sqlContext.createDataFrame([ ('1986/10/15', 'z', 'null'), ('1986/10/15', 'z', 'null'), ('1986/10/15', 'c', 'null'), ('1986/10/15', 'null', 'null'), ('1986/10/16', 'null', '4.0')], ('low', 'high', 'normal')) I want to calculate the date difference between low column and 2017-05-02 and replace low column with the difference. I've tried related solutions on stackoverflow but neither of them works. 回答1: You need to cast the column low to class date and then you can use

How to calculate date difference in pyspark?

混江龙づ霸主 提交于 2019-12-30 04:03:06
问题 I have data like this: df = sqlContext.createDataFrame([ ('1986/10/15', 'z', 'null'), ('1986/10/15', 'z', 'null'), ('1986/10/15', 'c', 'null'), ('1986/10/15', 'null', 'null'), ('1986/10/16', 'null', '4.0')], ('low', 'high', 'normal')) I want to calculate the date difference between low column and 2017-05-02 and replace low column with the difference. I've tried related solutions on stackoverflow but neither of them works. 回答1: You need to cast the column low to class date and then you can use

Group By, Rank and aggregate spark data frame using pyspark

会有一股神秘感。 提交于 2019-12-30 02:11:32
问题 I have a dataframe that looks like: A B C --------------- A1 B1 0.8 A1 B2 0.55 A1 B3 0.43 A2 B1 0.7 A2 B2 0.5 A2 B3 0.5 A3 B1 0.2 A3 B2 0.3 A3 B3 0.4 How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output: A B Rank --------------- A1 B1 1 A1 B2 2 A1 B3 3 A2 B1 1 A2 B2 2 A2 B3 2 A3 B1 3 A3 B2 2 A3 B3 1 The ultimate state I want to reach is to aggregate column B and store the ranks for each A: Example: B Ranks B1 [1,1,3] B2 [2,2,2] B3 [3,2

Pyspark DataFrame UDF on Text Column

帅比萌擦擦* 提交于 2019-12-30 01:58:04
问题 I'm trying to do some NLP text clean up of some Unicode columns in a PySpark DataFrame. I've tried in Spark 1.3, 1.5 and 1.6 and can't seem to get things to work for the life of me. I've also tried using Python 2.7 and Python 3.4. I've created an extremely simple udf as seen below that should just return a string back for each record in a new column. Other functions will manipulate the text and then return the changed text back in a new column. import pyspark from pyspark.sql import

Pyspark dataframe operator “IS NOT IN”

空扰寡人 提交于 2019-12-30 01:40:07
问题 I would like to rewrite this from R to Pyspark, any nice looking suggestions? array <- c(1,2,3) dataset <- filter(!(column %in% array)) 回答1: In pyspark you can do it like this: array = [1, 2, 3] dataframe.filter(dataframe.column.isin(*array) == False) Or using the binary NOT operator: dataframe.filter(~dataframe.column.isin(*array)) 回答2: Take the operator ~ which means contrary : df_filtered = df.filter(~df["column_name"].isin([1, 2, 3])) 回答3: df_result = df[df.column_name.isin([1, 2, 3]) ==

How to convert type Row into Vector to feed to the KMeans

泄露秘密 提交于 2019-12-30 00:39:49
问题 when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector df2 is a dataframe created as follow: df = sqlContext.read.json("data/ALS3.json") df2 = df.select('latitude','longitude') df2.show() latitude| longitude| 60.1643075| 24.9460844| 60.4686748| 22.2774728| how can i convert this two columns to Vector and feed it to KMeans

Pyspark convert a standard list to data frame [duplicate]

走远了吗. 提交于 2019-12-30 00:35:49
问题 This question already has an answer here : Create Spark DataFrame. Can not infer schema for type: <type 'float'> (1 answer) Closed last year . The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema = StructType([StructField("value", IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df =