apache-spark-sql

In Spark, how to do One Hot Encoding for top N frequent values only?

◇◆丶佛笑我妖孽 提交于 2021-01-29 22:22:16
问题 Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using: df.groupBy("my_category").count().show() value count a 197 b 166 c 210 d 5 e 2 f 9 g 3 Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3 ), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like: a b c default 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 ... 0

In Spark, how to do One Hot Encoding for top N frequent values only?

二次信任 提交于 2021-01-29 21:47:51
问题 Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using: df.groupBy("my_category").count().show() value count a 197 b 166 c 210 d 5 e 2 f 9 g 3 Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3 ), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like: a b c default 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 ... 0

Discard Bad record and load only good records to dataframe from json file in pyspark

不羁的心 提交于 2021-01-29 21:35:52
问题 The API generated json file looks like below. The Format of the JSON file is not correct. can we handle the bad records to discard and load only good rows to dataframe using pyspark. { "name": "PowerAmplifier", "Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ", "url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/amplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime":

Rowwise sum per group and add total as a new row in dataframe in Pyspark

自古美人都是妖i 提交于 2021-01-29 14:50:09
问题 I have a dataframe like this sample df = spark.createDataFrame( [(2, "A" , "A2" , 2500), (2, "A" , "A11" , 3500), (2, "A" , "A12" , 5500), (4, "B" , "B25" , 7600), (4, "B", "B26" ,5600), (5, "C" , "c25" ,2658), (5, "C" , "c27" , 1100), (5, "C" , "c28" , 1200)], ['parent', 'group' , "brand" , "usage"]) output : +------+-----+-----+-----+ |parent|group|brand|usage| +------+-----+-----+-----+ | 2| A| A2| 2500| | 2| A| A11| 3500| | 4| B| B25| 7600| | 4| B| B26| 5600| | 5| C| c25| 2658| | 5| C|

How to parse the JSON data using Spark-Scala

偶尔善良 提交于 2021-01-29 12:50:33
问题 I've requirement to parse the JSON data as shown in the expected results below, currently i'm not getting how to include the signals name(ABS, ADA, ADW) in Signal column. Any help would be much appreciated. I tried something which gives the results as shown below, but i will need to include all the signals in SIGNAL column as well which is shown in the expected results. jsonDF.select(explode($"ABS") as "element").withColumn("stime", col("element.E")).withColumn("can_value", col("element.V"))

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

狂风中的少年 提交于 2021-01-29 11:19:31
问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

spark streaming kafka : Unknown error fetching data for topic-partition

自作多情 提交于 2021-01-29 10:31:12
问题 I'm trying to read a Kafka topic from a Spark cluster using Structured Streaming API with Kafka integration in Spark val sparkSession = SparkSession.builder() .master("local[*]") .appName("some-app") .getOrCreate() Kafka stream creation import sparkSession.implicits._ val dataFrame = sparkSession .readStream .format("kafka") .option("subscribepattern", "preprod-*") .option("kafka.bootstrap.servers", "<brokerUrl>:9094") .option("kafka.ssl.protocol", "TLS") .option("kafka.security.protocol",

spark.sql.files.maxPartitionBytes not limiting max size of written partitions

≯℡__Kan透↙ 提交于 2021-01-29 10:07:27
问题 I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe

Difference between loading a csv file into RDD and Dataframe in spark

孤街浪徒 提交于 2021-01-29 08:19:58
问题 I am not sure if this specific question is asked earlier or not. could be a possible duplicate but I was not able to find a use case persisting to this. As we know that we can load a csv file directly to dataframe and can load it into RDD also and then convert that RDD to dataframe later. RDD = sc.textFile("pathlocation") we can apply some Map, filter and other operations on this RDD and can convert it into dataframe. Also we can create a dataframe directly reading a csv file Dataframe =

Save a dataframe view after groupBy using pyspark

谁都会走 提交于 2021-01-29 08:11:27
问题 My homework is giving me a hard time with pyspark. I have this view of my "df2" after a groupBy: df2.groupBy('years').count().show() +-----+-----+ |years|count| +-----+-----+ | 2003|11904| | 2006| 3476| | 1997| 3979| | 2004|13362| | 1996| 3180| | 1998| 4969| | 1995| 1995| | 2001|11532| | 2005|11389| | 2000| 7462| | 1999| 6593| | 2002|11799| +-----+-----+ Every attempt to save this (and then load with pandas) to a file gives back the original source data text file form I read with pypspark