apache-spark-sql | 易学教程

Create a new dataset based given operation column

阅读更多关于 Create a new dataset based given operation column

问题 I am using spark-sql-2.3.1v and have the below scenario: Given a dataset: val ds = Seq( (1, "x1", "y1", "0.1992019"), (2, null, "y2", "2.2500000"), (3, "x3", null, "15.34567"), (4, null, "y4", null), (5, "x4", "y4", "0") ).toDF("id","col_x", "col_y","value") i.e. +---+-----+-----+---------+ | id|col_x|col_y| value| +---+-----+-----+---------+ | 1| x1| y1|0.1992019| | 2| null| y2|2.2500000| | 3| x3| null| 15.34567| | 4| null| y4| null| | 5| x4| y4| 0| +---+-----+-----+---------+ Requirement: I

How to pass configuration file that hosted in HDFS to Spark Application?

阅读更多关于 How to pass configuration file that hosted in HDFS to Spark Application?

问题 I'm working with Spark Structured Streaming. Also, I'm working with Scala . I want to pass config file to my spark application. This configuration file hosted in HDFS . For example; spark_job.conf (HOCON) spark { appName: "", master: "", shuffle.size: 4 etc.. } kafkaSource { servers: "", topic: "", etc.. } redisSink { host: "", port: 999, timeout: 2000, checkpointLocation: "hdfs location", etc.. } How can I pass it to Spark Application? How can I read this file( hosted HDFS ) in Spark? 回答1:

Error including a column in a join between spark dataframes

阅读更多关于 Error including a column in a join between spark dataframes

问题 I have a join between cleanDF and sentiment_df using array_contains that works fine (from solution 61687997). And I need include in the Result df a new column ('Year') from cleanDF . This is the join: from pyspark.sql import functions Result = cleanDF.join(sentiment_df, expr("""array_contains(MeaningfulWords,word)"""), how='left')\ .groupBy("ID")\ .agg(first("MeaningfulWords").alias("MeaningfulWords")\ ,collect_list("score").alias("ScoreList")\ ,mean("score").alias("MeanScore")) This is the

Parsing Nested JSON into a Spark DataFrame Using PySpark

阅读更多关于 Parsing Nested JSON into a Spark DataFrame Using PySpark

问题 I would really love some help with parsing nested JSON data using PySpark-SQL. The data has the following schema (blank spaces are edits for confidentiality purposes...) Schema root |-- location_info: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- restaurant_type: string (nullable = true) | | | | | | | | |-- other_data: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- other_data_1 string (nullable = true) | | | | |-- other_data_2

Self join in spark and apply multiple filter criteria in spark Scala

阅读更多关于 Self join in spark and apply multiple filter criteria in spark Scala

问题 I want to write Spark code using the Scala language to filter out the rows to fill. I already have a spark sql query but want to convert it into a Spark Scala code. In the query I am performing the inner join on a same data frame and and applied some filter criteria such as difference between 2 date filed should be with in the range of 1 to 9. Spark query is self explanatory hence I am not explaining it. spark.sql("select * from df1 where Container not in(select a.Container from df1 a inner

Self join in spark and apply multiple filter criteria in spark Scala

阅读更多关于 Self join in spark and apply multiple filter criteria in spark Scala

From the following code how to convert a JavaRDD<Integer> to DataFrame or DataSet

阅读更多关于 From the following code how to convert a JavaRDD to DataFrame or DataSet

问题 public static void main(String[] args) { SparkSession sessn = SparkSession.builder().appName("RDD2DF").master("local").getOrCreate(); List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20); Dataset<Integer> DF = sessn.createDataset(lst, Encoders.INT()); System.out.println(DF.javaRDD().getNumPartitions()); JavaRDD<Integer> mappartRdd = DF.repartition(3).javaRDD().mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator()); } From

spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

阅读更多关于 spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

问题 There are two dataframes: df1, and df2 with the same schema. ID is the primary key. I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1. df1: ID col1 col2 1 AA 2019 2 B 2018 df2: ID col1 col2 1 A 2019 3 C 2017 I need the following output: df1: ID col1 col2 1 AA 2019 2 B 2018 3 C 2017 How to do this? Thanks. I think it is possible to register two tmp tables, do full

Changing the date format of the column values in aSspark dataframe

阅读更多关于 Changing the date format of the column values in aSspark dataframe

问题 I am reading a Excel sheet into a Dataframe in Spark 2.0 and then trying to convert some columns with date values in MM/DD/YY format into YYYY-MM-DD format. The values are in string format. Below is the sample: +---------------+--------------+ |modified | created | +---------------+--------------+ | null| 12/4/17 13:45| | 2/20/18| 2/2/18 20:50| | 3/20/18| 2/2/18 21:10| | 2/20/18| 2/2/18 21:23| | 2/28/18|12/12/17 15:42| | 1/25/18| 11/9/17 13:10| | 1/29/18| 12/6/17 10:07| +---------------+-----

Add column to pyspark dataframe based on a condition [duplicate]

阅读更多关于 Add column to pyspark dataframe based on a condition [duplicate]

问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed last year . My data.csv file has three columns like given below. I have converted this file to python spark dataframe. A B C | 1 | -3 | 4 | | 2 | 0 | 5 | | 6 | 6 | 6 | I want to add another column D in spark dataframe with values as Yes or No based on the condition that if corresponding value in B column is greater than 0 then yes otherwise No. A B C D | 1 | -3 | 4 | No | | 2 | 0 | 5 | No | | 6 | 6 |