apache-spark-sql | 易学教程

In Spark, how to do One Hot Encoding for top N frequent values only?

阅读更多关于 In Spark, how to do One Hot Encoding for top N frequent values only?

问题 Let, in my dataframe df, I have a column my_category in which I have different values, and I can view the value counts using: df.groupBy("my_category").count().show() value count a 197 b 166 c 210 d 5 e 2 f 9 g 3 Now, I'd like to apply One Hot Encoding (OHE) on this column, but for the top N frequent values only (say N = 3 ), and put all the rest infrequent values in a dummy column (say, "default"). E.g., the output should be something like: a b c default 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 ... 0

In Spark, how to do One Hot Encoding for top N frequent values only?

阅读更多关于 In Spark, how to do One Hot Encoding for top N frequent values only?

Discard Bad record and load only good records to dataframe from json file in pyspark

阅读更多关于 Discard Bad record and load only good records to dataframe from json file in pyspark

问题 The API generated json file looks like below. The Format of the JSON file is not correct. can we handle the bad records to discard and load only good rows to dataframe using pyspark. { "name": "PowerAmplifier", "Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ", "url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/amplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime":

Rowwise sum per group and add total as a new row in dataframe in Pyspark

阅读更多关于 Rowwise sum per group and add total as a new row in dataframe in Pyspark

问题 I have a dataframe like this sample df = spark.createDataFrame( [(2, "A" , "A2" , 2500), (2, "A" , "A11" , 3500), (2, "A" , "A12" , 5500), (4, "B" , "B25" , 7600), (4, "B", "B26" ,5600), (5, "C" , "c25" ,2658), (5, "C" , "c27" , 1100), (5, "C" , "c28" , 1200)], ['parent', 'group' , "brand" , "usage"]) output : +------+-----+-----+-----+ |parent|group|brand|usage| +------+-----+-----+-----+ | 2| A| A2| 2500| | 2| A| A11| 3500| | 4| B| B25| 7600| | 4| B| B26| 5600| | 5| C| c25| 2658| | 5| C|

How to parse the JSON data using Spark-Scala

阅读更多关于 How to parse the JSON data using Spark-Scala

问题 I've requirement to parse the JSON data as shown in the expected results below, currently i'm not getting how to include the signals name(ABS, ADA, ADW) in Signal column. Any help would be much appreciated. I tried something which gives the results as shown below, but i will need to include all the signals in SIGNAL column as well which is shown in the expected results. jsonDF.select(explode($"ABS") as "element").withColumn("stime", col("element.E")).withColumn("can_value", col("element.V"))

Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

阅读更多关于 Best way to get null counts, min and max values of multiple (100+) columns from a pyspark dataframe

问题 Say I have a list of column names and they all exist in the dataframe Cols = ['A', 'B', 'C', 'D'], I am looking for a quick way to get a table/dataframe like NA_counts min max A 5 0 100 B 10 0 120 C 8 1 99 D 2 0 500 TIA 回答1: You can calculate each metric separately and then union all like this: nulls_cols = [sum(when(col(c).isNull(), lit(1)).otherwise(lit(0))).alias(c) for c in cols] max_cols = [max(col(c)).alias(c) for c in cols] min_cols = [min(col(c)).alias(c) for c in cols] nulls_df = df

spark streaming kafka : Unknown error fetching data for topic-partition

阅读更多关于 spark streaming kafka : Unknown error fetching data for topic-partition

问题 I'm trying to read a Kafka topic from a Spark cluster using Structured Streaming API with Kafka integration in Spark val sparkSession = SparkSession.builder() .master("local[*]") .appName("some-app") .getOrCreate() Kafka stream creation import sparkSession.implicits._ val dataFrame = sparkSession .readStream .format("kafka") .option("subscribepattern", "preprod-*") .option("kafka.bootstrap.servers", "<brokerUrl>:9094") .option("kafka.ssl.protocol", "TLS") .option("kafka.security.protocol",

spark.sql.files.maxPartitionBytes not limiting max size of written partitions

阅读更多关于 spark.sql.files.maxPartitionBytes not limiting max size of written partitions

问题 I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe

Difference between loading a csv file into RDD and Dataframe in spark

阅读更多关于 Difference between loading a csv file into RDD and Dataframe in spark

问题 I am not sure if this specific question is asked earlier or not. could be a possible duplicate but I was not able to find a use case persisting to this. As we know that we can load a csv file directly to dataframe and can load it into RDD also and then convert that RDD to dataframe later. RDD = sc.textFile("pathlocation") we can apply some Map, filter and other operations on this RDD and can convert it into dataframe. Also we can create a dataframe directly reading a csv file Dataframe =

Save a dataframe view after groupBy using pyspark

阅读更多关于 Save a dataframe view after groupBy using pyspark

问题 My homework is giving me a hard time with pyspark. I have this view of my "df2" after a groupBy: df2.groupBy('years').count().show() +-----+-----+ |years|count| +-----+-----+ | 2003|11904| | 2006| 3476| | 1997| 3979| | 2004|13362| | 1996| 3180| | 1998| 4969| | 1995| 1995| | 2001|11532| | 2005|11389| | 2000| 7462| | 1999| 6593| | 2002|11799| +-----+-----+ Every attempt to save this (and then load with pandas) to a file gives back the original source data text file form I read with pypspark