pyspark | 易学教程

Pyspark socket timeout exception after application running for a while

阅读更多关于 Pyspark socket timeout exception after application running for a while

问题 I am using pyspark to estimate parameters for a logistic regression model. I use spark to calculate the likelihood and gradients and then use scipy's minimize function for optimization (L-BFGS-B). I use yarn-client mode to run my application. My application could start to run without any problem. However, after a while it reports the following error: Traceback (most recent call last): File "/home/panc/research/MixedLogistic/software/mixedlogistic/mixedlogistic_spark/simulation/20160716-1626

PySpark - Add a new nested column or change the value of existing nested columns

阅读更多关于 PySpark - Add a new nested column or change the value of existing nested columns

问题 Supposing, I have a json file with lines in follow structure: { "a": 1, "b": { "bb1": 1, "bb2": 2 } } I want to change the value of key bb1 or add a new key, like: bb3 . Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame. The workflow works as follow: def map_func(row): dictionary = row.asDict(True) adding

PySpark broadcast variables from local functions

阅读更多关于 PySpark broadcast variables from local functions

问题 I'm attempting to create broadcast variables from within Python methods (trying to abstract some utility methods I'm creating that rely on distributed operations). However, I can't seem to access the broadcast variables from within the Spark workers. Let's say I have this setup: def main(): sc = SparkContext() SomeMethod(sc) def SomeMethod(sc): someValue = rand() V = sc.broadcast(someValue) A = sc.parallelize().map(worker) def worker(element): element *= V.value ### NameError: global name 'V'

How to calculate date difference in pyspark?

阅读更多关于 How to calculate date difference in pyspark?

问题 I have data like this: df = sqlContext.createDataFrame([ ('1986/10/15', 'z', 'null'), ('1986/10/15', 'z', 'null'), ('1986/10/15', 'c', 'null'), ('1986/10/15', 'null', 'null'), ('1986/10/16', 'null', '4.0')], ('low', 'high', 'normal')) I want to calculate the date difference between low column and 2017-05-02 and replace low column with the difference. I've tried related solutions on stackoverflow but neither of them works. 回答1: You need to cast the column low to class date and then you can use

How to calculate date difference in pyspark?

阅读更多关于 How to calculate date difference in pyspark?

Group By, Rank and aggregate spark data frame using pyspark

阅读更多关于 Group By, Rank and aggregate spark data frame using pyspark

问题 I have a dataframe that looks like: A B C --------------- A1 B1 0.8 A1 B2 0.55 A1 B3 0.43 A2 B1 0.7 A2 B2 0.5 A2 B3 0.5 A3 B1 0.2 A3 B2 0.3 A3 B3 0.4 How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output: A B Rank --------------- A1 B1 1 A1 B2 2 A1 B3 3 A2 B1 1 A2 B2 2 A2 B3 2 A3 B1 3 A3 B2 2 A3 B3 1 The ultimate state I want to reach is to aggregate column B and store the ranks for each A: Example: B Ranks B1 [1,1,3] B2 [2,2,2] B3 [3,2

Pyspark DataFrame UDF on Text Column

阅读更多关于 Pyspark DataFrame UDF on Text Column

问题 I'm trying to do some NLP text clean up of some Unicode columns in a PySpark DataFrame. I've tried in Spark 1.3, 1.5 and 1.6 and can't seem to get things to work for the life of me. I've also tried using Python 2.7 and Python 3.4. I've created an extremely simple udf as seen below that should just return a string back for each record in a new column. Other functions will manipulate the text and then return the changed text back in a new column. import pyspark from pyspark.sql import

Pyspark dataframe operator “IS NOT IN”

阅读更多关于 Pyspark dataframe operator “IS NOT IN”

问题 I would like to rewrite this from R to Pyspark, any nice looking suggestions? array <- c(1,2,3) dataset <- filter(!(column %in% array)) 回答1: In pyspark you can do it like this: array = [1, 2, 3] dataframe.filter(dataframe.column.isin(*array) == False) Or using the binary NOT operator: dataframe.filter(~dataframe.column.isin(*array)) 回答2: Take the operator ~ which means contrary : df_filtered = df.filter(~df["column_name"].isin([1, 2, 3])) 回答3: df_result = df[df.column_name.isin([1, 2, 3]) ==

How to convert type Row into Vector to feed to the KMeans

阅读更多关于 How to convert type Row into Vector to feed to the KMeans

问题 when i try to feed df2 to kmeans i get the following error clusters = KMeans.train(df2, 10, maxIterations=30, runs=10, initializationMode="random") The error i get: Cannot convert type <class 'pyspark.sql.types.Row'> into Vector df2 is a dataframe created as follow: df = sqlContext.read.json("data/ALS3.json") df2 = df.select('latitude','longitude') df2.show() latitude| longitude| 60.1643075| 24.9460844| 60.4686748| 22.2774728| how can i convert this two columns to Vector and feed it to KMeans

Pyspark convert a standard list to data frame [duplicate]

阅读更多关于 Pyspark convert a standard list to data frame [duplicate]

问题 This question already has an answer here : Create Spark DataFrame. Can not infer schema for type: <type 'float'> (1 answer) Closed last year . The case is really simple, I need to convert a python list into data frame with following code from pyspark.sql.types import StructType from pyspark.sql.types import StructField from pyspark.sql.types import StringType, IntegerType schema = StructType([StructField("value", IntegerType(), True)]) my_list = [1, 2, 3, 4] rdd = sc.parallelize(my_list) df =