pyspark

How to create multiple flag columns based on list values found in the dataframe column?

余生颓废 提交于 2020-02-04 05:33:21
问题 The table looks like this : ID |CITY ---------------------------------- 1 |London|Paris|Tokyo 2 |Tokyo|Barcelona|Mumbai|London 3 |Vienna|Paris|Seattle The city column contains around 1000+ values which are | delimited I want to create a flag column to indicate if a person visited only the city of interest. city_of_interest=['Paris','Seattle','Tokyo'] There are 20 such values in the list. Ouput should look like this : ID |Paris | Seattle | Tokyo ------------------------------------------- 1 |1

Cannot load a saved Spark model in pyspark: “java.lang.NoSuchMethodException”

点点圈 提交于 2020-02-03 10:12:09
问题 When I run the following Python program from pyspark.ml.classification import LinearSVC from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Sparkmodel").getOrCreate() data = spark.read.format("libsvm").load("/usr/local/spark/data/mllib/sample_libsvm_data.txt") model = LinearSVC().fit(data) model.save("mymodel") LinearSVC.load("mymodel") the load fails with a "java.lang.NoSuchMethodException". /anaconda3/envs/scratch/bin/python /Users/billmcn/src/toy/sparkmodel

What is causing 'unicode' object has no attribute 'toordinal' in pyspark?

筅森魡賤 提交于 2020-02-03 08:20:26
问题 I got this error but I don't what causes it. My python code ran in pyspark. The stacktrace is long and i just show some of them. All the stacktrace doesn't show my code in it so I don't know where to look for. What is possible the cause for this error? /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".",

What is causing 'unicode' object has no attribute 'toordinal' in pyspark?

我怕爱的太早我们不能终老 提交于 2020-02-03 08:19:06
问题 I got this error but I don't what causes it. My python code ran in pyspark. The stacktrace is long and i just show some of them. All the stacktrace doesn't show my code in it so I don't know where to look for. What is possible the cause for this error? /usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 306 raise Py4JJavaError( 307 "An error occurred while calling {0}{1}{2}.\n". --> 308 format(target_id, ".",

Chained spark column expressions with distinct windows specs produce inefficient DAG

醉酒当歌 提交于 2020-02-03 05:16:34
问题 Context Let's say you deal with time series data. Your desired outcome relies on multiple window functions with distinct window specifications. The result may resemble a single spark column expression, like an identifier for intervals. Status Quo Usually, I don't store intermediate results with df.withColumn but rather chain/stack column expressions and trust Spark to find the most effective DAG (when dealing with DataFrame). Reproducible example However, in the following example (PySpark 2.4

Concat multiple columns of a dataframe using pyspark

允我心安 提交于 2020-02-02 13:46:30
问题 Suppose I have a list of columns, for example: col_list = ['col1','col2'] df = spark.read.json(path_to_file) print(df.columns) # ['col1','col2','col3'] I need to create a new column by concatenating col1 and col2 . I don't want to hard code the column names while concatenating but need to pick it from the list. How can I do this? 回答1: You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list . Keep on passing them as arguments. from pyspark.sql

Concat multiple columns of a dataframe using pyspark

青春壹個敷衍的年華 提交于 2020-02-02 13:46:29
问题 Suppose I have a list of columns, for example: col_list = ['col1','col2'] df = spark.read.json(path_to_file) print(df.columns) # ['col1','col2','col3'] I need to create a new column by concatenating col1 and col2 . I don't want to hard code the column names while concatenating but need to pick it from the list. How can I do this? 回答1: You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list . Keep on passing them as arguments. from pyspark.sql

Concat multiple columns of a dataframe using pyspark

淺唱寂寞╮ 提交于 2020-02-02 13:46:02
问题 Suppose I have a list of columns, for example: col_list = ['col1','col2'] df = spark.read.json(path_to_file) print(df.columns) # ['col1','col2','col3'] I need to create a new column by concatenating col1 and col2 . I don't want to hard code the column names while concatenating but need to pick it from the list. How can I do this? 回答1: You can use pyspark.sql.functions.concat() to concatenate as many columns as you specify in your list . Keep on passing them as arguments. from pyspark.sql

How to overwrite the rdd saveAsPickleFile(path) if file already exist in pyspark?

狂风中的少年 提交于 2020-02-02 12:45:36
问题 How to overwrite RDD output objects any existing path when we are saving time. test1: 975078|56691|2.000|20171001_926_570_1322 975078|42993|1.690|20171001_926_570_1322 975078|46462|2.000|20171001_926_570_1322 975078|87815|1.000|20171001_926_570_1322 rdd=sc.textFile('/home/administrator/work/test1').map( lambda x: x.split("|")[:4]).map( lambda r: Row( user_code = r[0],item_code = r[1],qty = float(r[2]))) rdd.coalesce(1).saveAsPickleFile("/home/administrator/work/foobar_seq1") The first time it

PySpark - Split/Filter DataFrame by column's values

筅森魡賤 提交于 2020-02-02 04:43:19
问题 I have a DataFrame similar to this example: Timestamp | Word | Count 30/12/2015 | example_1 | 3 29/12/2015 | example_2 | 1 28/12/2015 | example_2 | 9 27/12/2015 | example_3 | 7 ... | ... | ... and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example: DF1 Timestamp | Word | Count 30/12/2015 | example_1 | 3 DF2 Timestamp | Word | Count 29/12/2015 | example_2 | 1 28/12/2015 | example_2 | 9 DF3 Timestamp |