apache-spark-sql

Processing Data on Spark Structured Streaming before outputting to the console

萝らか妹 提交于 2020-06-26 09:57:07
问题 I'll try to keep it simple. I periodically read some data from a kafka producer and output the following using Spark Structured streaming I have data that outputs like this: +------------------------------------------+-------------------+--------------+-----------------+ |window |timestamp |Online_Emp |Available_Emp | +------------------------------------------+-------------------+--------------+-----------------+ |[2017-12-31 16:01:00, 2017-12-31 16:02:00]|2017-12-31 16:01:27|1 |0 | |[2017

How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

孤街浪徒 提交于 2020-06-25 18:11:28
问题 Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with, sqlContext.getAllConfs.foreach(println) But, I am not sure on which property can actually required to disable/enable hive support. or Is there any other way to do this? 回答1: Spark >= 2.0 Enable and disable of Hive context is possible with config spark.sql.catalogImplementation Possible values for spark

NOT IN implementation of Presto v.s Spark SQL

我们两清 提交于 2020-06-25 10:51:33
问题 I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware. SELECT field FROM test1 WHERE field NOT IN (SELECT field FROM test2) After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)) . Spark SQL introduces

How to parse json column in dataframe in scala [duplicate]

|▌冷眼眸甩不掉的悲伤 提交于 2020-06-25 07:21:25
问题 This question already has answers here : How to query JSON data column using Spark DataFrames? (5 answers) Closed last year . I have a data frame which is json column with json string. example below. There are 3 columns - a,b,c. Column c is stringType | a | b | c | -------------------------------------------------------- |77 |ABC | {"12549":38,"333513":39} | |78 |ABC | {"12540":38,"333513":39} | I want to make them into columns of the data frame(pivot). the example below - | a | b | 12549 |

Converting epoch to datetime in PySpark data frame using udf

若如初见. 提交于 2020-06-25 04:03:11
问题 I have a PySpark dataframe with this schema: root |-- epoch: double (nullable = true) |-- var1: double (nullable = true) |-- var2: double (nullable = true) Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows: from pyspark.sql.functions import udf import time def epoch_to_datetime(x): return time.localtime(x) # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)) # return x * 0 + 1 epoch_to_datetime_udf =

Converting epoch to datetime in PySpark data frame using udf

给你一囗甜甜゛ 提交于 2020-06-25 04:02:59
问题 I have a PySpark dataframe with this schema: root |-- epoch: double (nullable = true) |-- var1: double (nullable = true) |-- var2: double (nullable = true) Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows: from pyspark.sql.functions import udf import time def epoch_to_datetime(x): return time.localtime(x) # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)) # return x * 0 + 1 epoch_to_datetime_udf =

Does Kryo help in SparkSQL?

家住魔仙堡 提交于 2020-06-25 02:30:29
问题 Kryo helps improve the performance of Spark applications by the efficient serialization approach. I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it. In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2") , and the schema of DataFrame Row is not quite static. Not sure how to register one or several serializer classes for the use case. For example: case class Info(name: String, address: String) ... val df = spark

Does Kryo help in SparkSQL?

北城以北 提交于 2020-06-25 02:28:11
问题 Kryo helps improve the performance of Spark applications by the efficient serialization approach. I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it. In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2") , and the schema of DataFrame Row is not quite static. Not sure how to register one or several serializer classes for the use case. For example: case class Info(name: String, address: String) ... val df = spark

Trying to use map on a Spark DataFrame

こ雲淡風輕ζ 提交于 2020-06-24 22:24:07
问题 I recently started experimenting with both Spark and Java. I initially went through the famous WordCount example using RDD and everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs. So I am reading a dataset from a file with DataFrame df = sqlContext.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("delimiter", ";") .option("header", "true") .load(inputFilePath); and then I try to select a specific column

Pyspark filter using startswith from list

有些话、适合烂在心里 提交于 2020-06-24 07:44:33
问题 I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of yes and no , they should match yes23 and no3 but not 35yes or 41no . Using pyspark, how can i use startswith any element in list or tuple. An example DF would be: +-----+------+ |index| label| +-----+------+ | 1|yes342| | 2| 45yes| | 3| no123| | 4| 75no| +-----+------+ When I try: Element_List = ['yes','no'] filter_DF = DF.where(DF.label.startswith(tuple(Element_List))) The