apache-spark-sql | 易学教程

Processing Data on Spark Structured Streaming before outputting to the console

阅读更多关于 Processing Data on Spark Structured Streaming before outputting to the console

问题 I'll try to keep it simple. I periodically read some data from a kafka producer and output the following using Spark Structured streaming I have data that outputs like this: +------------------------------------------+-------------------+--------------+-----------------+ |window |timestamp |Online_Emp |Available_Emp | +------------------------------------------+-------------------+--------------+-----------------+ |[2017-12-31 16:01:00, 2017-12-31 16:02:00]|2017-12-31 16:01:27|1 |0 | |[2017

How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

阅读更多关于 How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

问题 Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with, sqlContext.getAllConfs.foreach(println) But, I am not sure on which property can actually required to disable/enable hive support. or Is there any other way to do this? 回答1: Spark >= 2.0 Enable and disable of Hive context is possible with config spark.sql.catalogImplementation Possible values for spark

NOT IN implementation of Presto v.s Spark SQL

阅读更多关于 NOT IN implementation of Presto v.s Spark SQL

问题 I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware. SELECT field FROM test1 WHERE field NOT IN (SELECT field FROM test2) After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)) . Spark SQL introduces

How to parse json column in dataframe in scala [duplicate]

阅读更多关于 How to parse json column in dataframe in scala [duplicate]

问题 This question already has answers here : How to query JSON data column using Spark DataFrames? (5 answers) Closed last year . I have a data frame which is json column with json string. example below. There are 3 columns - a,b,c. Column c is stringType | a | b | c | -------------------------------------------------------- |77 |ABC | {"12549":38,"333513":39} | |78 |ABC | {"12540":38,"333513":39} | I want to make them into columns of the data frame(pivot). the example below - | a | b | 12549 |

Converting epoch to datetime in PySpark data frame using udf

阅读更多关于 Converting epoch to datetime in PySpark data frame using udf

问题 I have a PySpark dataframe with this schema: root |-- epoch: double (nullable = true) |-- var1: double (nullable = true) |-- var2: double (nullable = true) Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows: from pyspark.sql.functions import udf import time def epoch_to_datetime(x): return time.localtime(x) # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)) # return x * 0 + 1 epoch_to_datetime_udf =

Converting epoch to datetime in PySpark data frame using udf

阅读更多关于 Converting epoch to datetime in PySpark data frame using udf

Does Kryo help in SparkSQL?

阅读更多关于 Does Kryo help in SparkSQL?

问题 Kryo helps improve the performance of Spark applications by the efficient serialization approach. I'm wondering, if Kryo will help in the case of SparkSQL, and how should I use it. In SparkSQL applications, we'll do a lot of column based operations like df.select($"c1", $"c2") , and the schema of DataFrame Row is not quite static. Not sure how to register one or several serializer classes for the use case. For example: case class Info(name: String, address: String) ... val df = spark

Does Kryo help in SparkSQL?

阅读更多关于 Does Kryo help in SparkSQL?

Trying to use map on a Spark DataFrame

阅读更多关于 Trying to use map on a Spark DataFrame

问题 I recently started experimenting with both Spark and Java. I initially went through the famous WordCount example using RDD and everything went as expected. Now I am trying to implement my own example but using DataFrames and not RDDs. So I am reading a dataset from a file with DataFrame df = sqlContext.read() .format("com.databricks.spark.csv") .option("inferSchema", "true") .option("delimiter", ";") .option("header", "true") .load(inputFilePath); and then I try to select a specific column

Pyspark filter using startswith from list

阅读更多关于 Pyspark filter using startswith from list

问题 I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of yes and no , they should match yes23 and no3 but not 35yes or 41no . Using pyspark, how can i use startswith any element in list or tuple. An example DF would be: +-----+------+ |index| label| +-----+------+ | 1|yes342| | 2| 45yes| | 3| no123| | 4| 75no| +-----+------+ When I try: Element_List = ['yes','no'] filter_DF = DF.where(DF.label.startswith(tuple(Element_List))) The