apache-spark | 易学教程

Getting error saying “Queries with streaming sources must be executed with writeStream.start()” on spark structured streaming [duplicate]

阅读更多关于 Getting error saying “Queries with streaming sources must be executed with writeStream.start()” on spark structured streaming [duplicate]

问题 This question already has answers here : How to display a streaming DataFrame (as show fails with AnalysisException)? (2 answers) Closed 2 years ago . I am getting some issues while executing spark SQL on top of spark structures streaming. PFA for error. here is my code object sparkSqlIntegration { def main(args: Array[String]) { val spark = SparkSession .builder .appName("StructuredStreaming") .master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work

Array of struct parsing in Spark dataframe

阅读更多关于 Array of struct parsing in Spark dataframe

问题 I have a Dataframe with one struct type column. Sample dataframe schema is: root |-- Data: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- name: string (nullable = true) | | |-- value: string (nullable = true) Field name holds column name and fields value holds column value. Number of elements in Data column is not defined so it can vary. I need to parse that data and get rid of nested structure. (Array Explode will not work in this case because data in one row

scala - how to substring column names after the last dot?

阅读更多关于 scala - how to substring column names after the last dot?

问题 After exploding a nested structure I have a DataFrame with column names like this: sales_data.metric1 sales_data.type.metric2 sales_data.type3.metric3 When performing a select I'm getting the error: cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3] How should I select from the DataFrame so the column names are parsed correctly? I've tried the following: the substrings after dots are extracted successfully. But

Pyspark TextParsingException while loading a file

阅读更多关于 Pyspark TextParsingException while loading a file

问题 I am loading a csv file having 1 million records using pyspark, but getting the error. TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000) I checked if any of my record in the file has data greater than 1000000 characters, but none of the record is like that. maximum record length in my file is 850. Please help.... CODE SNIPPET: input_df = spark.read.format('com.databricks.spark.csv').option("delimiter","

pyspark split dataframe by two columns without creating a folder structure for the 2nd

阅读更多关于 pyspark split dataframe by two columns without creating a folder structure for the 2nd

问题 Two part question. I have a pyspark dataframe that I'm reading from a list of JSON files in my azure blob storage. after some simple ETL I need to move this from blob storage to a datalake as a parquet file, simple so far. I'm unsucessfully trying to efficiently write this into a folder defined by two columns, one which is a date column and the other an ID using partitionBy gets me close id | date | nested_json_data | path 1 | 2019-01-01 12:01:01 | {data : [data]} | dbfs:\mnt\.. 1 | 2019-01

Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

阅读更多关于 Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala

问题 I am creating one function in scala which i want to use in my spark-sql query.my query is working fine in hive or if i am giving the same query in spark sql but the same query i'm using at multiple places so i want to create it as reusable function/method so whenever its required i can just call it. I have created below function in my scala class. def date_part(date_column:Column) = { val m1: Column = month(to_date(from_unixtime(unix_timestamp(date_column, "dd-MM-yyyy")))) //give value as 01

Kryo vs Encoder vs Java Serialization in Spark?

阅读更多关于 Kryo vs Encoder vs Java Serialization in Spark?

问题 Which serialization is used for which case, From spark documentation it says : It provides two serialization libraries: 1. Java(default) and 2. Kryo Now where did Encoders come from and why is it not given in the doc. And also from databricks it says Encoders performs faster for Datasets,what about RDD, and how do all these maps together. In which case which serializer should we use? 回答1: Encoders are used in Dataset only. Kryo is used internally in spark. Kryo and Java serialization is

Create Empty dataframe Java Spark

阅读更多关于 Create Empty dataframe Java Spark

问题 There are many examples on how to create empty dataframe/Dataset using Spark Scala/Python. But I would like to know how to create an empty dataframe/Dataset in Java Spark. I have to create an empty dataframe with just one column with header as Column_1 and type String. 回答1: Alternative-1 Create empty dataframe with the user defined schema // alternative - 1 StructType s = new StructType() .add(new StructField("Column_1", DataTypes.StringType, true, Metadata.empty())); Dataset<Row> csv = spark

Create Empty dataframe Java Spark

阅读更多关于 Create Empty dataframe Java Spark

Scala: Get every combination of the last 24 months

阅读更多关于 Scala: Get every combination of the last 24 months

问题 I'm trying to generate a DataFrame in Spark (but perhaps just Scala is enough) in which I have every combination of the last 24 months where the second year-month is always > the first year-month. For example, it is the 1 March 2019 as of writing this, I'm after something like: List( (2017, 3, 2017, 4), (2017, 3, 2017, 5), (2017, 3, 2017, 6), // .. (2017, 3, 2019, 3), (2017, 4, 2017, 5), // .. (2019, 1, 2019, 3), (2019, 2, 2019, 3), ) 回答1: This is easiest done with pure Scala without