apache-spark-sql

Add column to pyspark dataframe based on a condition [duplicate]

寵の児 提交于 2020-06-28 01:57:57
问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed last year . My data.csv file has three columns like given below. I have converted this file to python spark dataframe. A B C | 1 | -3 | 4 | | 2 | 0 | 5 | | 6 | 6 | 6 | I want to add another column D in spark dataframe with values as Yes or No based on the condition that if corresponding value in B column is greater than 0 then yes otherwise No. A B C D | 1 | -3 | 4 | No | | 2 | 0 | 5 | No | | 6 | 6 |

Spark 2.1 cannot write Vector field on CSV

此生再无相见时 提交于 2020-06-27 21:55:37
问题 I was migrating my code from Spark 2.0 to 2.1 when I stumbled into a problem related to Dataframe saving. Here's the code import org.apache.spark.sql.types._ import org.apache.spark.ml.linalg.VectorUDT val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values") val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df) toSave.write.csv(path) This code succeeds when using Spark 2.0.0 Using Spark 2.1.0.cloudera1, I get the following error : java

Spark 2.1 cannot write Vector field on CSV

牧云@^-^@ 提交于 2020-06-27 21:53:34
问题 I was migrating my code from Spark 2.0 to 2.1 when I stumbled into a problem related to Dataframe saving. Here's the code import org.apache.spark.sql.types._ import org.apache.spark.ml.linalg.VectorUDT val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values") val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df) toSave.write.csv(path) This code succeeds when using Spark 2.0.0 Using Spark 2.1.0.cloudera1, I get the following error : java

Pyspark transform method that's equivalent to the Scala Dataset#transform method

随声附和 提交于 2020-06-27 17:49:05
问题 The Spark Scala API has a Dataset#transform method that makes it easy to chain custom DataFrame transformations like so: val weirdDf = df .transform(myFirstCustomTransformation) .transform(anotherCustomTransformation) I don't see an equivalent transform method for pyspark in the documentation. Is there a PySpark way to chain custom transformations? If not, how can the pyspark.sql.DataFrame class be monkey patched to add a transform method? Update The transform method was added to PySpark as

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

[亡魂溺海] 提交于 2020-06-27 17:00:29
问题 I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column ( MeaningfulWords column) of df1 , according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list ( df1 ) is not in the dictionary ( df2 ), zero is scored. The Dataframes looks like this: df1.select("ID","MeaningfulWords").show

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

浪尽此生 提交于 2020-06-27 17:00:05
问题 I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column ( MeaningfulWords column) of df1 , according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list ( df1 ) is not in the dictionary ( df2 ), zero is scored. The Dataframes looks like this: df1.select("ID","MeaningfulWords").show

Spark decimal type precision loss

大兔子大兔子 提交于 2020-06-27 08:53:12
问题 I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example below is not reassuring of that. Can anyone tell me why this is happening with spark sql? Currently on version 2.3.0 val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as decimal(38,14)) as decimal(38,14)) val""" spark.sql(sql).show This

Modify a struct column in spark dataframe

时光毁灭记忆、已成空白 提交于 2020-06-27 04:17:13
问题 I have a pyspark dataframe which contains a column "student" as follows: "student" : { "name" : "kaleem", "rollno" : "12" } Schema for this in dataframe is : structType(List( name: String, rollno: String)) I need to modify this column as "student" : { "student_details" : { "name" : "kaleem", "rollno" : "12" } } Schema for this in dataframe must be : structType(List( student_details: structType(List( name: String, rollno: String)) )) How to do this in spark? 回答1: Use named_struct function to

Serialization issues DF vs. RDD

半城伤御伤魂 提交于 2020-06-27 04:11:12
问题 Hardest thing in Spark is Serialization imho. This https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54 I looked at some time ago and I think I am pretty sure I get it, the Object aspects. I run the code and it is as per the examples. However, I am curious on a few other aspects when testing in a Notebook on a Databricks Community Edition account - not a real cluster BTW. I did check, confirm also on a Spark Standalone cluster via the spark-shell. This does

Rename written CSV file Spark

偶尔善良 提交于 2020-06-27 03:52:09
问题 I'm running spark 2.1 and I want to write a csv with results into Amazon S3. After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename. I'm using the databricks lib for writing into S3. dataframe .repartition(1) .write .format("com.databricks.spark.csv") .option("header", "true") .save("folder/dataframe/") Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and