apache-spark-sql | 易学教程

PySpark groupby and max value selection

阅读更多关于 PySpark groupby and max value selection

问题 I have a PySpark dataframe like name city date satya Mumbai 13/10/2016 satya Pune 02/11/2016 satya Mumbai 22/11/2016 satya Pune 29/11/2016 satya Delhi 30/11/2016 panda Delhi 29/11/2016 brata BBSR 28/11/2016 brata Goa 30/10/2016 brata Goa 30/10/2016 I need to find-out most preferred CITY for each name and Logic is " take city as fav_city if city having max no. of city occurrence on aggregate 'name'+'city' pair". And if multiple same occurrence found then consider city with latest Date. WIll

Why do I get so many empty partitions when repartionning a Spark Dataframe?

阅读更多关于 Why do I get so many empty partitions when repartionning a Spark Dataframe?

问题 I want to partition a dataframe "df1" on 3 columns. This dataframe has exactly 990 unique combinaisons for those 3 columns: In [17]: df1.createOrReplaceTempView("df1_view") In [18]: spark.sql("select count(*) from (select distinct(col1,col2,col3) from df1_view) as t").show() +--------+ |count(1)| +--------+ | 990| +--------+ In order to optimize the processing of this dataframe, I want to partition df1 in order to get 990 partitions, one for each key possibility: In [19]: df1.rdd

Building a StructType from a dataframe in pyspark

阅读更多关于 Building a StructType from a dataframe in pyspark

问题 I am new spark and python and facing this difficulty of building a schema from a metadata file that can be applied to my data file. Scenario: Metadata File for the Data file(csv format), contains the columns and their types: for example: id,int,10,"","",id,"","",TRUE,"",0 created_at,timestamp,"","","",created_at,"","",FALSE,"",0 I have successfully converted this to a dataframe that looks like: +--------------------+---------------+ | name| type| +--------------------+---------------+ | id|

Building a StructType from a dataframe in pyspark

阅读更多关于 Building a StructType from a dataframe in pyspark

Perform NLTK in pyspark

阅读更多关于 Perform NLTK in pyspark

问题 I am very new in pyspark and I have developed a program to perform NLTK on HDFS file, The following are the steps for that.I'm using spark 2.3.1 1. Get file from HDFS 2. perform Lemmatization 3. Remove punctuation mark. 4. Convert RDD to DataFrame 5. Perform Tokenizer 6. Remove Stop words 7. Explode columns data to create a unique row for each record 8. I want to keep all files data into a single file so I am merging the output with old fil 9. Now write this entire merged output into HDFS 10.

How to use Databricks Job Spark Configuration spark_conf?

阅读更多关于 How to use Databricks Job Spark Configuration spark_conf?

问题 I have a sample Spark Code where I am trying to access the Values for tables from the Spark Configurations provided by spark_conf Option by using the typeSafe application.conf and Spark Conf in the Databricks UI. The code I am using is below, When I hit the Run Button in the Databricks UI, the job is finishing successfully, but the println function is printing dummyValue instead of ThisIsTableAOne,ThisIsTableBOne... I can see from the Spark UI that, the Configurations for TableNames are being

remove null array field from dataframe while converting it to JSON

阅读更多关于 remove null array field from dataframe while converting it to JSON

问题 Is there any method where i can create a json from a spark dataframe by not using those fields which are null: Lets suppose i have a data frame: +-------+----------------+ | name| hit_songs| +-------+----------------+ |beatles|[help, hey jude]| | romeo| [eres mia]| | juliet| null | +-------+----------------+ i want to convert it into a json like: [{ name: "beatles", hit_songs: [help, hey jude] }, { name: "romeo", hit_songs: [eres mia] }, { name: "juliet" } ] i dont want the field hit_songs in

remove null array field from dataframe while converting it to JSON

阅读更多关于 remove null array field from dataframe while converting it to JSON

Apache spark case with multiple when clauses on different columns

阅读更多关于 Apache spark case with multiple when clauses on different columns

问题 Given the below structure: val df = Seq("Color", "Shape", "Range","Size").map(Tuple1.apply).toDF("color") val df1 = df.withColumn("Success", when($"color"<=> "white", "Diamond").otherwise(0)) I want to write one more WHEN condition at above where size > 10 and Shape column value is Rhombus then "Diamond" value should be inserted to the column else 0. I tried like below but it's failing val df1 = df.withColumn("Success", when($"color" <=> "white", "Diamond").otherwise(0)).when($"size">10)

pySpark mapping multiple variables

阅读更多关于 pySpark mapping multiple variables

问题 The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputItemNameByValue . However, I'm trying to add the rule that when PrimaryLookupAttributeValue = DEFAULT to also return the OutputItemNameByValue . The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values