apache-spark

Getting error saying “Queries with streaming sources must be executed with writeStream.start()” on spark structured streaming [duplicate]

拟墨画扇 提交于 2021-02-08 12:00:31
问题 This question already has answers here : How to display a streaming DataFrame (as show fails with AnalysisException)? (2 answers) Closed 2 years ago . I am getting some issues while executing spark SQL on top of spark structures streaming. PFA for error. here is my code object sparkSqlIntegration { def main(args: Array[String]) { val spark = SparkSession .builder .appName("StructuredStreaming") .master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work

Array of struct parsing in Spark dataframe

丶灬走出姿态 提交于 2021-02-08 11:54:14
问题 I have a Dataframe with one struct type column. Sample dataframe schema is: root |-- Data: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- value: string (nullable = true) Field name holds column name and fields value holds column value. Number of elements in Data column is not defined so it can vary. I need to parse that data and get rid of nested structure. (Array Explode will not work in this case because data in one row

scala - how to substring column names after the last dot?

倖福魔咒の 提交于 2021-02-08 11:27:34
问题 After exploding a nested structure I have a DataFrame with column names like this: sales_data.metric1 sales_data.type.metric2 sales_data.type3.metric3 When performing a select I'm getting the error: cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3] How should I select from the DataFrame so the column names are parsed correctly? I've tried the following: the substrings after dots are extracted successfully. But

Pyspark TextParsingException while loading a file

限于喜欢 提交于 2021-02-08 11:26:30
问题 I am loading a csv file having 1 million records using pyspark, but getting the error. TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000) I checked if any of my record in the file has data greater than 1000000 characters, but none of the record is like that. maximum record length in my file is 850. Please help.... CODE SNIPPET: input_df = spark.read.format('com.databricks.spark.csv').option("delimiter","

pyspark split dataframe by two columns without creating a folder structure for the 2nd

[亡魂溺海] 提交于 2021-02-08 11:05:56
问题 Two part question. I have a pyspark dataframe that I'm reading from a list of JSON files in my azure blob storage. after some simple ETL I need to move this from blob storage to a datalake as a parquet file, simple so far. I'm unsucessfully trying to efficiently write this into a folder defined by two columns, one which is a date column and the other an ID using partitionBy gets me close id | date | nested_json_data | path 1 | 2019-01-01 12:01:01 | {data : [data]} | dbfs:\mnt\.. 1 | 2019-01

Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

梦想与她 提交于 2021-02-08 11:00:42
问题 I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I have created below function in my scala class. def date_part(date_column:Column) = { val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give value as 01

Kryo vs Encoder vs Java Serialization in Spark?

廉价感情. 提交于 2021-02-08 10:40:35
问题 Which serialization is used for which case, From spark documentation it says : It provides two serialization libraries: 1. Java(default) and 2. Kryo Now where did Encoders come from and why is it not given in the doc. And also from databricks it says Encoders performs faster for Datasets,what about RDD, and how do all these maps together. In which case which serializer should we use? 回答1: Encoders are used in Dataset only. Kryo is used internally in spark. Kryo and Java serialization is

Create Empty dataframe Java Spark

帅比萌擦擦* 提交于 2021-02-08 10:22:21
问题 There are many examples on how to create empty dataframe/Dataset using Spark Scala/Python. But I would like to know how to create an empty dataframe/Dataset in Java Spark. I have to create an empty dataframe with just one column with header as Column_1 and type String. 回答1: Alternative-1 Create empty dataframe with the user defined schema // alternative - 1 StructType s = new StructType() .add(new StructField("Column_1", DataTypes.StringType, true, Metadata.empty())); Dataset<Row> csv = spark

Create Empty dataframe Java Spark

浪子不回头ぞ 提交于 2021-02-08 10:20:19
问题 There are many examples on how to create empty dataframe/Dataset using Spark Scala/Python. But I would like to know how to create an empty dataframe/Dataset in Java Spark. I have to create an empty dataframe with just one column with header as Column_1 and type String. 回答1: Alternative-1 Create empty dataframe with the user defined schema // alternative - 1 StructType s = new StructType() .add(new StructField("Column_1", DataTypes.StringType, true, Metadata.empty())); Dataset<Row> csv = spark

Scala: Get every combination of the last 24 months

笑着哭i 提交于 2021-02-08 10:20:17
问题 I'm trying to generate a DataFrame in Spark (but perhaps just Scala is enough) in which I have every combination of the last 24 months where the second year-month is always > the first year-month. For example, it is the 1 March 2019 as of writing this, I'm after something like: List( (2017, 3, 2017, 4), (2017, 3, 2017, 5), (2017, 3, 2017, 6), // .. (2017, 3, 2019, 3), (2017, 4, 2017, 5), // .. (2019, 1, 2019, 3), (2019, 2, 2019, 3), ) 回答1: This is easiest done with pure Scala without