apache-spark-sql

PySpark groupby and max value selection

六月ゝ 毕业季﹏ 提交于 2020-06-11 05:33:19
问题 I have a PySpark dataframe like name city date satya Mumbai 13/10/2016 satya Pune 02/11/2016 satya Mumbai 22/11/2016 satya Pune 29/11/2016 satya Delhi 30/11/2016 panda Delhi 29/11/2016 brata BBSR 28/11/2016 brata Goa 30/10/2016 brata Goa 30/10/2016 I need to find-out most preferred CITY for each name and Logic is " take city as fav_city if city having max no. of city occurrence on aggregate 'name'+'city' pair". And if multiple same occurrence found then consider city with latest Date. WIll

Why do I get so many empty partitions when repartionning a Spark Dataframe?

ぃ、小莉子 提交于 2020-06-10 05:09:27
问题 I want to partition a dataframe "df1" on 3 columns. This dataframe has exactly 990 unique combinaisons for those 3 columns: In [17]: df1.createOrReplaceTempView("df1_view") In [18]: spark.sql("select count(*) from (select distinct(col1,col2,col3) from df1_view) as t").show() +--------+ |count(1)| +--------+ | 990| +--------+ In order to optimize the processing of this dataframe, I want to partition df1 in order to get 990 partitions, one for each key possibility: In [19]: df1.rdd

Building a StructType from a dataframe in pyspark

我怕爱的太早我们不能终老 提交于 2020-06-09 11:17:46
问题 I am new spark and python and facing this difficulty of building a schema from a metadata file that can be applied to my data file. Scenario: Metadata File for the Data file(csv format), contains the columns and their types: for example: id,int,10,"","",id,"","",TRUE,"",0 created_at,timestamp,"","","",created_at,"","",FALSE,"",0 I have successfully converted this to a dataframe that looks like: +--------------------+---------------+ | name| type| +--------------------+---------------+ | id|

Building a StructType from a dataframe in pyspark

让人想犯罪 __ 提交于 2020-06-09 11:17:09
问题 I am new spark and python and facing this difficulty of building a schema from a metadata file that can be applied to my data file. Scenario: Metadata File for the Data file(csv format), contains the columns and their types: for example: id,int,10,"","",id,"","",TRUE,"",0 created_at,timestamp,"","","",created_at,"","",FALSE,"",0 I have successfully converted this to a dataframe that looks like: +--------------------+---------------+ | name| type| +--------------------+---------------+ | id|

Perform NLTK in pyspark

依然范特西╮ 提交于 2020-06-09 07:08:06
问题 I am very new in pyspark and I have developed a program to perform NLTK on HDFS file, The following are the steps for that.I'm using spark 2.3.1 1. Get file from HDFS 2. perform Lemmatization 3. Remove punctuation mark. 4. Convert RDD to DataFrame 5. Perform Tokenizer 6. Remove Stop words 7. Explode columns data to create a unique row for each record 8. I want to keep all files data into a single file so I am merging the output with old fil 9. Now write this entire merged output into HDFS 10.

How to use Databricks Job Spark Configuration spark_conf?

半世苍凉 提交于 2020-06-09 05:49:08
问题 I have a sample Spark Code where I am trying to access the Values for tables from the Spark Configurations provided by spark_conf Option by using the typeSafe application.conf and Spark Conf in the Databricks UI. The code I am using is below, When I hit the Run Button in the Databricks UI, the job is finishing successfully, but the println function is printing dummyValue instead of ThisIsTableAOne,ThisIsTableBOne... I can see from the Spark UI that, the Configurations for TableNames are being

remove null array field from dataframe while converting it to JSON

▼魔方 西西 提交于 2020-06-09 05:29:26
问题 Is there any method where i can create a json from a spark dataframe by not using those fields which are null: Lets suppose i have a data frame: +-------+----------------+ | name| hit_songs| +-------+----------------+ |beatles|[help, hey jude]| | romeo| [eres mia]| | juliet| null | +-------+----------------+ i want to convert it into a json like: [{ name: "beatles", hit_songs: [help, hey jude] }, { name: "romeo", hit_songs: [eres mia] }, { name: "juliet" } ] i dont want the field hit_songs in

remove null array field from dataframe while converting it to JSON

落花浮王杯 提交于 2020-06-09 05:28:06
问题 Is there any method where i can create a json from a spark dataframe by not using those fields which are null: Lets suppose i have a data frame: +-------+----------------+ | name| hit_songs| +-------+----------------+ |beatles|[help, hey jude]| | romeo| [eres mia]| | juliet| null | +-------+----------------+ i want to convert it into a json like: [{ name: "beatles", hit_songs: [help, hey jude] }, { name: "romeo", hit_songs: [eres mia] }, { name: "juliet" } ] i dont want the field hit_songs in

Apache spark case with multiple when clauses on different columns

依然范特西╮ 提交于 2020-06-08 05:59:07
问题 Given the below structure: val df = Seq("Color", "Shape", "Range","Size").map(Tuple1.apply).toDF("color") val df1 = df.withColumn("Success", when($"color"<=> "white", "Diamond").otherwise(0)) I want to write one more WHEN condition at above where size > 10 and Shape column value is Rhombus then "Diamond" value should be inserted to the column else 0. I tried like below but it's failing val df1 = df.withColumn("Success", when($"color" <=> "white", "Diamond").otherwise(0)).when($"size">10)

pySpark mapping multiple variables

天涯浪子 提交于 2020-06-05 11:39:15
问题 The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputItemNameByValue . However, I'm trying to add the rule that when PrimaryLookupAttributeValue = DEFAULT to also return the OutputItemNameByValue . The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values