apache-spark-sql

Error creating transactional connection factory during running Spark on Hive project in IDEA

一笑奈何 提交于 2021-01-27 04:51:34
问题 I am trying to setup a develop environment for a Spark Streaming project which requires write data into Hive. I have a cluster with 1 master, 2 slaves and 1 develop machine (coding in Intellij Idea 14). Within the spark shell, everything seems working fine and I am able to store data into default database in Hive via Spark 1.5 using DataFrame.write.insertInto("testtable") However when creating a scala project in IDEA and run it using same cluster with same setting, Error was thrown when

Error creating transactional connection factory during running Spark on Hive project in IDEA

…衆ロ難τιáo~ 提交于 2021-01-27 04:49:52
问题 I am trying to setup a develop environment for a Spark Streaming project which requires write data into Hive. I have a cluster with 1 master, 2 slaves and 1 develop machine (coding in Intellij Idea 14). Within the spark shell, everything seems working fine and I am able to store data into default database in Hive via Spark 1.5 using DataFrame.write.insertInto("testtable") However when creating a scala project in IDEA and run it using same cluster with same setting, Error was thrown when

Iterate though Columns of a Spark Dataframe and update specified values

我怕爱的太早我们不能终老 提交于 2021-01-24 21:23:37
问题 To iterate through columns of a Spark Dataframe created from Hive table and update all occurrences of desired column values, I tried the following code. import org.apache.spark.sql.{DataFrame} import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions.udf val a: DataFrame = spark.sql(s"select * from default.table_a") val column_names: Array[String] = a.columns val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date")) val func = udf((value:

Iterate though Columns of a Spark Dataframe and update specified values

淺唱寂寞╮ 提交于 2021-01-24 21:13:13
问题 To iterate through columns of a Spark Dataframe created from Hive table and update all occurrences of desired column values, I tried the following code. import org.apache.spark.sql.{DataFrame} import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions.udf val a: DataFrame = spark.sql(s"select * from default.table_a") val column_names: Array[String] = a.columns val required_columns: Array[String] = column_names.filter(name => name.endsWith("_date")) val func = udf((value:

Validate date format in a dataframe column in pyspark

自闭症网瘾萝莉.ら 提交于 2021-01-21 12:12:28
问题 I have a dataframe with column as Date along with few other columns. I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage

Validate date format in a dataframe column in pyspark

陌路散爱 提交于 2021-01-21 12:07:05
问题 I have a dataframe with column as Date along with few other columns. I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage

Validate date format in a dataframe column in pyspark

谁说胖子不能爱 提交于 2021-01-21 12:06:32
问题 I have a dataframe with column as Date along with few other columns. I wanted to validate Date column value and check if the format is of "dd/MM/yyyy". If Date column holds any other format than should mark it as bad record. So I am using option option("dateFormat", "dd/MM/yyyy") to accept date in mentioned format and it accepts the date properly in format "dd/MM/yyyy", but if I pass invalid format (YYYY/mm/dd) still record is not marking as invalid and passed date is converting to garbage

How to dynamically slice an Array column in Spark?

人盡茶涼 提交于 2021-01-21 10:36:52
问题 Spark 2.4 introduced the new SQL function slice , which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick from that column. However, simply passing the column to the slice function fails, the function appears to expect integers for start and end values. Is there a way of doing this without writing a UDF? To visualize the problem with an example: I have

How to dynamically slice an Array column in Spark?

女生的网名这么多〃 提交于 2021-01-21 10:36:49
问题 Spark 2.4 introduced the new SQL function slice , which can be used extract a certain range of elements from an array column. I want to define that range dynamically per row, based on an Integer column that has the number of elements I want to pick from that column. However, simply passing the column to the slice function fails, the function appears to expect integers for start and end values. Is there a way of doing this without writing a UDF? To visualize the problem with an example: I have

Calculate value based on value from same column of the previous row in spark

非 Y 不嫁゛ 提交于 2021-01-21 10:23:29
问题 I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row. I am unable to figure it out using withColumn API. I need to calculate a new column, using the formula: MovingRate = MonthlyRate + (0.7 * MovingRatePrevious) ... where the MovingRatePrevious is the MovingRate of the prior row. For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I