apache-spark-sql | 易学教程

Error creating transactional connection factory during running Spark on Hive project in IDEA

阅读更多关于 Error creating transactional connection factory during running Spark on Hive project in IDEA

问题 I am trying to setup a develop environment for a Spark Streaming project which requires write data into Hive. I have a cluster with 1 master, 2 slaves and 1 develop machine (coding in Intellij Idea 14). Within the spark shell, everything seems working fine and I am able to store data into default database in Hive via Spark 1.5 using DataFrame.write.insertInto("testtable") However when creating a scala project in IDEA and run it using same cluster with same setting, Error was thrown when

Error creating transactional connection factory during running Spark on Hive project in IDEA

阅读更多关于 Error creating transactional connection factory during running Spark on Hive project in IDEA

Iterate though Columns of a Spark Dataframe and update specified values

阅读更多关于 Iterate though Columns of a Spark Dataframe and update specified values

问题 To iterate through columns of a Spark Dataframe created from Hive table and update all occurrences of desired column values, I tried the following code. import org.apache.spark.sql.{DataFrame} import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions.udf val a: DataFrame = spark.sql(s"select * from default.table_a") val column_names: Array[String] = a.columns val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date")) val func = udf((value:

Iterate though Columns of a Spark Dataframe and update specified values

阅读更多关于 Iterate though Columns of a Spark Dataframe and update specified values

Validate date format in a dataframe column in pyspark

阅读更多关于 Validate date format in a dataframe column in pyspark

问题 I have a dataframe with column as Date along with few other columns. I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage

Validate date format in a dataframe column in pyspark

阅读更多关于 Validate date format in a dataframe column in pyspark

Validate date format in a dataframe column in pyspark

阅读更多关于 Validate date format in a dataframe column in pyspark

How to dynamically slice an Array column in Spark?

阅读更多关于 How to dynamically slice an Array column in Spark?

问题 Spark 2.4 introduced the new SQL function slice , which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick from that column. However, simply passing the column to the slice function fails, the function appears to expect integers for start and end values. Is there a way of doing this without writing a UDF? To visualize the problem with an example: I have

How to dynamically slice an Array column in Spark?

阅读更多关于 How to dynamically slice an Array column in Spark?

Calculate value based on value from same column of the previous row in spark

阅读更多关于 Calculate value based on value from same column of the previous row in spark

问题 I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row. I am unable to figure it out using withColumn API. I need to calculate a new column, using the formula: MovingRate = MonthlyRate + (0.7 * MovingRatePrevious) ... where the MovingRatePrevious is the MovingRate of the prior row. For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I