apache-spark-sql | 易学教程

How to use CROSS JOIN and CROSS APPLY in Spark SQL

阅读更多关于 How to use CROSS JOIN and CROSS APPLY in Spark SQL

问题 I am very new to Spark and Scala, I writing Spark SQL code. I am in situation to apply CROSS JOIN and CROSS APPLY in my logic. Here I will post the SQL query which I have to convert to spark SQL. select Table1.Column1,Table2.Column2,Table3.Column3 from Table1 CROSS JOIN Table2 CROSS APPLY Table3 I need the above query to convert in to SQLContext in Spark SQL. Kindly help me. Thanks in Advance. 回答1: First set the below property in spark conf spark.sql.crossJoin.enabled=true then dataFrame1

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

问题 I have a Spark DataFrame with the following structure: root |-- distribution: vector (nullable = true) +--------------------+ | topicDistribution| +--------------------+ | [0.1, 0.2] | | [0.3, 0.2] | | [0.5, 0.2] | | [0.1, 0.7] | | [0.1, 0.8] | | [0.1, 0.9] | +--------------------+ My question is: How to add a column with the index of the maximum value for each row? It should be something like this: root |-- distribution: vector (nullable = true) |-- max_index: integer (nullable = true) +----

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

Spark efficiently filtering entries from big dataframe that exist in a small dataframe

阅读更多关于 Spark efficiently filtering entries from big dataframe that exist in a small dataframe

问题 I have a Spark program that reads a relatively big dataframe (~3.2 terabyte) that contains 2 columns: id, name and another relatively small dataframe (~20k entries) that contain a single column: id What I'm trying to do is take both the id and the name from the big dataframe if they appear in the small dataframe I was wondering what would be an efficient solution to get this working and why? Several options I had in mind: Broadcast join the 2 dataframes Broadcast the small dataframe and

Spark efficiently filtering entries from big dataframe that exist in a small dataframe

阅读更多关于 Spark efficiently filtering entries from big dataframe that exist in a small dataframe

How to find the index of the maximum value in a vector column?

阅读更多关于 How to find the index of the maximum value in a vector column?

Spark Streaming Exception: java.util.NoSuchElementException: None.get

阅读更多关于 Spark Streaming Exception: java.util.NoSuchElementException: None.get

问题 I am writing SparkStreaming data to HDFS by converting it to a dataframe: Code object KafkaSparkHdfs { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka") sparkConf.set("spark.driver.allowMultipleContexts", "true"); val sc = new SparkContext(sparkConf) def main(args: Array[String]): Unit = { val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val ssc = new StreamingContext(sparkConf, Seconds(20)) val kafkaParams = Map[String,

How does Spark SQL read compressed csv files?

阅读更多关于 How does Spark SQL read compressed csv files?

问题 I have tried with api spark.read.csv to read compressed csv file with extension bz or gzip . It worked. But in source code I don't find any option parameter that we can declare the codec type. Even in this link, there is only setting for codec in writing side. Could anyone tell me or give the path to source code that showing how spark 2.x version deal with the compressed csv file. 回答1: All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in

How does Spark SQL read compressed csv files?

阅读更多关于 How does Spark SQL read compressed csv files?