apache-spark-sql

How to use CROSS JOIN and CROSS APPLY in Spark SQL

…衆ロ難τιáo~ 提交于 2021-01-27 13:51:38
问题 I am very new to Spark and Scala, I writing Spark SQL code. I am in situation to apply CROSS JOIN and CROSS APPLY in my logic. Here I will post the SQL query which I have to convert to spark SQL. select Table1.Column1,Table2.Column2,Table3.Column3 from Table1 CROSS JOIN Table2 CROSS APPLY Table3 I need the above query to convert in to SQLContext in Spark SQL. Kindly help me. Thanks in Advance. 回答1: First set the below property in spark conf spark.sql.crossJoin.enabled=true then dataFrame1

How to find the index of the maximum value in a vector column?

丶灬走出姿态 提交于 2021-01-27 07:45:53
问题 I have a Spark DataFrame with the following structure: root |-- distribution: vector (nullable = true) +--------------------+ | topicDistribution| +--------------------+ | [0.1, 0.2] | | [0.3, 0.2] | | [0.5, 0.2] | | [0.1, 0.7] | | [0.1, 0.8] | | [0.1, 0.9] | +--------------------+ My question is: How to add a column with the index of the maximum value for each row? It should be something like this: root |-- distribution: vector (nullable = true) |-- max_index: integer (nullable = true) +----

How to find the index of the maximum value in a vector column?

£可爱£侵袭症+ 提交于 2021-01-27 07:45:31
问题 I have a Spark DataFrame with the following structure: root |-- distribution: vector (nullable = true) +--------------------+ | topicDistribution| +--------------------+ | [0.1, 0.2] | | [0.3, 0.2] | | [0.5, 0.2] | | [0.1, 0.7] | | [0.1, 0.8] | | [0.1, 0.9] | +--------------------+ My question is: How to add a column with the index of the maximum value for each row? It should be something like this: root |-- distribution: vector (nullable = true) |-- max_index: integer (nullable = true) +----

How to find the index of the maximum value in a vector column?

二次信任 提交于 2021-01-27 07:44:21
问题 I have a Spark DataFrame with the following structure: root |-- distribution: vector (nullable = true) +--------------------+ | topicDistribution| +--------------------+ | [0.1, 0.2] | | [0.3, 0.2] | | [0.5, 0.2] | | [0.1, 0.7] | | [0.1, 0.8] | | [0.1, 0.9] | +--------------------+ My question is: How to add a column with the index of the maximum value for each row? It should be something like this: root |-- distribution: vector (nullable = true) |-- max_index: integer (nullable = true) +----

Spark efficiently filtering entries from big dataframe that exist in a small dataframe

自作多情 提交于 2021-01-27 07:44:02
问题 I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe I was wondering what would be an efficient solution to get this working and why? Several options I had in mind: Broadcast join the 2 dataframes Broadcast the small dataframe and

Spark efficiently filtering entries from big dataframe that exist in a small dataframe

本秂侑毒 提交于 2021-01-27 07:42:00
问题 I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe I was wondering what would be an efficient solution to get this working and why? Several options I had in mind: Broadcast join the 2 dataframes Broadcast the small dataframe and

How to find the index of the maximum value in a vector column?

…衆ロ難τιáo~ 提交于 2021-01-27 07:41:15
问题 I have a Spark DataFrame with the following structure: root |-- distribution: vector (nullable = true) +--------------------+ | topicDistribution| +--------------------+ | [0.1, 0.2] | | [0.3, 0.2] | | [0.5, 0.2] | | [0.1, 0.7] | | [0.1, 0.8] | | [0.1, 0.9] | +--------------------+ My question is: How to add a column with the index of the maximum value for each row? It should be something like this: root |-- distribution: vector (nullable = true) |-- max_index: integer (nullable = true) +----

Spark Streaming Exception: java.util.NoSuchElementException: None.get

我怕爱的太早我们不能终老 提交于 2021-01-27 06:33:10
问题 I am writing SparkStreaming data to HDFS by converting it to a dataframe: Code object KafkaSparkHdfs { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka") sparkConf.set("spark.driver.allowMultipleContexts", "true"); val sc = new SparkContext(sparkConf) def main(args: Array[String]): Unit = { val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val ssc = new StreamingContext(sparkConf, Seconds(20)) val kafkaParams = Map[String,

How does Spark SQL read compressed csv files?

早过忘川 提交于 2021-01-27 05:43:11
问题 I have tried with api spark.read.csv to read compressed csv file with extension bz or gzip . It worked. But in source code I don't find any option parameter that we can declare the codec type. Even in this link, there is only setting for codec in writing side. Could anyone tell me or give the path to source code that showing how spark 2.x version deal with the compressed csv file. 回答1: All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in

How does Spark SQL read compressed csv files?

南楼画角 提交于 2021-01-27 05:42:58
问题 I have tried with api spark.read.csv to read compressed csv file with extension bz or gzip . It worked. But in source code I don't find any option parameter that we can declare the codec type. Even in this link, there is only setting for codec in writing side. Could anyone tell me or give the path to source code that showing how spark 2.x version deal with the compressed csv file. 回答1: All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in