apache-spark-sql

Create a new dataset based given operation column

半腔热情 提交于 2020-06-30 08:38:59
问题 I am using spark-sql-2.3.1v and have the below scenario: Given a dataset: val ds = Seq( (1, "x1", "y1", "0.1992019"), (2, null, "y2", "2.2500000"), (3, "x3", null, "15.34567"), (4, null, "y4", null), (5, "x4", "y4", "0") ).toDF("id","col_x", "col_y","value") i.e. +---+-----+-----+---------+ | id|col_x|col_y| value| +---+-----+-----+---------+ | 1| x1| y1|0.1992019| | 2| null| y2|2.2500000| | 3| x3| null| 15.34567| | 4| null| y4| null| | 5| x4| y4| 0| +---+-----+-----+---------+ Requirement: I

How to pass configuration file that hosted in HDFS to Spark Application?

心已入冬 提交于 2020-06-29 08:03:05
问题 I'm working with Spark Structured Streaming. Also, I'm working with Scala . I want to pass config file to my spark application. This configuration file hosted in HDFS . For example; spark_job.conf (HOCON) spark { appName: "", master: "", shuffle.size: 4 etc.. } kafkaSource { servers: "", topic: "", etc.. } redisSink { host: "", port: 999, timeout: 2000, checkpointLocation: "hdfs location", etc.. } How can I pass it to Spark Application? How can I read this file( hosted HDFS ) in Spark? 回答1:

Error including a column in a join between spark dataframes

我怕爱的太早我们不能终老 提交于 2020-06-29 06:42:21
问题 I have a join between cleanDF and sentiment_df using array_contains that works fine (from solution 61687997). And I need include in the Result df a new column ('Year') from cleanDF . This is the join: from pyspark.sql import functions Result = cleanDF.join(sentiment_df, expr("""array_contains(MeaningfulWords,word)"""), how='left')\ .groupBy("ID")\ .agg(first("MeaningfulWords").alias("MeaningfulWords")\ ,collect_list("score").alias("ScoreList")\ ,mean("score").alias("MeanScore")) This is the

Parsing Nested JSON into a Spark DataFrame Using PySpark

陌路散爱 提交于 2020-06-29 05:44:49
问题 I would really love some help with parsing nested JSON data using PySpark-SQL. The data has the following schema (blank spaces are edits for confidentiality purposes...) Schema root |-- location_info: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- restaurant_type: string (nullable = true) | | | | | | | | |-- other_data: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- other_data_1 string (nullable = true) | | | | |-- other_data_2

Self join in spark and apply multiple filter criteria in spark Scala

删除回忆录丶 提交于 2020-06-29 05:10:43
问题 I want to write Spark code using the Scala language to filter out the rows to fill. I already have a spark sql query but want to convert it into a Spark Scala code. In the query I am performing the inner join on a same data frame and and applied some filter criteria such as difference between 2 date filed should be with in the range of 1 to 9. Spark query is self explanatory hence I am not explaining it. spark.sql("select * from df1 where Container not in(select a.Container from df1 a inner

Self join in spark and apply multiple filter criteria in spark Scala

那年仲夏 提交于 2020-06-29 05:08:35
问题 I want to write Spark code using the Scala language to filter out the rows to fill. I already have a spark sql query but want to convert it into a Spark Scala code. In the query I am performing the inner join on a same data frame and and applied some filter criteria such as difference between 2 date filed should be with in the range of 1 to 9. Spark query is self explanatory hence I am not explaining it. spark.sql("select * from df1 where Container not in(select a.Container from df1 a inner

From the following code how to convert a JavaRDD<Integer> to DataFrame or DataSet

笑着哭i 提交于 2020-06-29 03:56:07
问题 public static void main(String[] args) { SparkSession sessn = SparkSession.builder().appName("RDD2DF").master("local").getOrCreate(); List<Integer> lst = Arrays.asList(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20); Dataset<Integer> DF = sessn.createDataset(lst, Encoders.INT()); System.out.println(DF.javaRDD().getNumPartitions()); JavaRDD<Integer> mappartRdd = DF.repartition(3).javaRDD().mapPartitions(it-> Arrays.asList(JavaConversions.asScalaIterator(it).length()).iterator()); } From

spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

痴心易碎 提交于 2020-06-28 08:17:29
问题 There are two dataframes: df1, and df2 with the same schema. ID is the primary key. I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1. df1: ID col1 col2 1 AA 2019 2 B 2018 df2: ID col1 col2 1 A 2019 3 C 2017 I need the following output: df1: ID col1 col2 1 AA 2019 2 B 2018 3 C 2017 How to do this? Thanks. I think it is possible to register two tmp tables, do full

Changing the date format of the column values in aSspark dataframe

耗尽温柔 提交于 2020-06-28 04:05:25
问题 I am reading a Excel sheet into a Dataframe in Spark 2.0 and then trying to convert some columns with date values in MM/DD/YY format into YYYY-MM-DD format. The values are in string format. Below is the sample: +---------------+--------------+ |modified | created | +---------------+--------------+ | null| 12/4/17 13:45| | 2/20/18| 2/2/18 20:50| | 3/20/18| 2/2/18 21:10| | 2/20/18| 2/2/18 21:23| | 2/28/18|12/12/17 15:42| | 1/25/18| 11/9/17 13:10| | 1/29/18| 12/6/17 10:07| +---------------+-----

Add column to pyspark dataframe based on a condition [duplicate]

老子叫甜甜 提交于 2020-06-28 01:59:05
问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed last year . My data.csv file has three columns like given below. I have converted this file to python spark dataframe. A B C | 1 | -3 | 4 | | 2 | 0 | 5 | | 6 | 6 | 6 | I want to add another column D in spark dataframe with values as Yes or No based on the condition that if corresponding value in B column is greater than 0 then yes otherwise No. A B C D | 1 | -3 | 4 | No | | 2 | 0 | 5 | No | | 6 | 6 |