apache-spark-sql | 易学教程

Add column to pyspark dataframe based on a condition [duplicate]

阅读更多关于 Add column to pyspark dataframe based on a condition [duplicate]

问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed last year . My data.csv file has three columns like given below. I have converted this file to python spark dataframe. A B C | 1 | -3 | 4 | | 2 | 0 | 5 | | 6 | 6 | 6 | I want to add another column D in spark dataframe with values as Yes or No based on the condition that if corresponding value in B column is greater than 0 then yes otherwise No. A B C D | 1 | -3 | 4 | No | | 2 | 0 | 5 | No | | 6 | 6 |

Spark 2.1 cannot write Vector field on CSV

阅读更多关于 Spark 2.1 cannot write Vector field on CSV

问题 I was migrating my code from Spark 2.0 to 2.1 when I stumbled into a problem related to Dataframe saving. Here's the code import org.apache.spark.sql.types._ import org.apache.spark.ml.linalg.VectorUDT val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values") val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df) toSave.write.csv(path) This code succeeds when using Spark 2.0.0 Using Spark 2.1.0.cloudera1, I get the following error : java

Spark 2.1 cannot write Vector field on CSV

阅读更多关于 Spark 2.1 cannot write Vector field on CSV

Pyspark transform method that's equivalent to the Scala Dataset#transform method

阅读更多关于 Pyspark transform method that's equivalent to the Scala Dataset#transform method

问题 The Spark Scala API has a Dataset#transform method that makes it easy to chain custom DataFrame transformations like so: val weirdDf = df .transform(myFirstCustomTransformation) .transform(anotherCustomTransformation) I don't see an equivalent transform method for pyspark in the documentation. Is there a PySpark way to chain custom transformations? If not, how can the pyspark.sql.DataFrame class be monkey patched to add a transform method? Update The transform method was added to PySpark as

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

阅读更多关于 Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

问题 I am using spark 2.4.5 and I need to calculate the sentiment score from a token list column ( MeaningfulWords column) of df1 , according to the words in df2 (spanish sentiment dictionary). In df1 I must create a new column with the scores list of tokens and another column with the mean of scores (sum of scores / count words) of each record. If any token in the list ( df1 ) is not in the dictionary ( df2 ), zero is scored. The Dataframes looks like this: df1.select("ID","MeaningfulWords").show

Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

阅读更多关于 Calculate new column in spark Dataframe, crossing a tokens list column in df1 with a text column in df2 with pyspark

Spark decimal type precision loss

阅读更多关于 Spark decimal type precision loss

问题 I'm doing some testing of spark decimal types for currency measures and am seeing some odd precision results when I set the scale and precision as shown below. I want to be sure that I won't have any data loss during calculations but the example below is not reassuring of that. Can anyone tell me why this is happening with spark sql? Currently on version 2.3.0 val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as decimal(38,14)) as decimal(38,14)) val""" spark.sql(sql).show This

Modify a struct column in spark dataframe

阅读更多关于 Modify a struct column in spark dataframe

问题 I have a pyspark dataframe which contains a column "student" as follows: "student" : { "name" : "kaleem", "rollno" : "12" } Schema for this in dataframe is : structType(List( name: String, rollno: String)) I need to modify this column as "student" : { "student_details" : { "name" : "kaleem", "rollno" : "12" } } Schema for this in dataframe must be : structType(List( student_details: structType(List( name: String, rollno: String)) )) How to do this in spark? 回答1: Use named_struct function to

Serialization issues DF vs. RDD

阅读更多关于 Serialization issues DF vs. RDD

问题 Hardest thing in Spark is Serialization imho. This https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54 I looked at some time ago and I think I am pretty sure I get it, the Object aspects. I run the code and it is as per the examples. However, I am curious on a few other aspects when testing in a Notebook on a Databricks Community Edition account - not a real cluster BTW. I did check, confirm also on a Spark Standalone cluster via the spark-shell. This does

Rename written CSV file Spark

阅读更多关于 Rename written CSV file Spark

问题 I'm running spark 2.1 and I want to write a csv with results into Amazon S3. After repartitioning the csv file has kind of a long kryptic name and I want to change that into a specific filename. I'm using the databricks lib for writing into S3. dataframe .repartition(1) .write .format("com.databricks.spark.csv") .option("header", "true") .save("folder/dataframe/") Is there a way to rename the file afterwards or even save it directly with the correct name? I've already looked for solutions and