apache-spark | 易学教程

spark streaming kafka : Unknown error fetching data for topic-partition

阅读更多关于 spark streaming kafka : Unknown error fetching data for topic-partition

问题 I'm trying to read a Kafka topic from a Spark cluster using Structured Streaming API with Kafka integration in Spark val sparkSession = SparkSession.builder() .master("local[*]") .appName("some-app") .getOrCreate() Kafka stream creation import sparkSession.implicits._ val dataFrame = sparkSession .readStream .format("kafka") .option("subscribepattern", "preprod-*") .option("kafka.bootstrap.servers", "<brokerUrl>:9094") .option("kafka.ssl.protocol", "TLS") .option("kafka.security.protocol",

How to select the N highest values for each category in spark scala

阅读更多关于 How to select the N highest values for each category in spark scala

问题 Say I have this dataset: val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17), ("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits") which looks like this: I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this, Since the 2 highest homerun totals for them were 8 and 6. How would I do this in

spark.sql.files.maxPartitionBytes not limiting max size of written partitions

阅读更多关于 spark.sql.files.maxPartitionBytes not limiting max size of written partitions

问题 I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

阅读更多关于 TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

问题 I use pyspark streaming to read kafka data, but it went wrong: import os from pyspark.streaming.kafka import KafkaUtils from pyspark.streaming import StreamingContext from pyspark import SparkContext os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' sc = SparkContext(appName="test") sc.setLogLevel("WARN") ssc = StreamingContext(sc, 60) kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2}) kafkaStream

Scala — Conditional replace column value of a data frame

阅读更多关于 Scala — Conditional replace column value of a data frame

问题 DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2. Transfer is the big category; e-transfer and IMT are subcategories. The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a

How to filter in Spark using two “conditions”?

阅读更多关于 How to filter in Spark using two “conditions”?

问题 I'm doing an excercice where I'm required to filter the amount of crimes per year based on a file that has more than 13 millions of lines (in case that's an important info). For that, I did this and it's working fine: JavaRDD<String> anoRDD = arquivo.map(s -> {String[] campos = s.split(";") ; return campos[2]; }); System.out.println(anoRDD.countByValue()); But, the next question to be answered is "How many "NARCOTIC" crimes happen per YEAR?", I managed to filter the total amount, but not per

Spark Scala S3 storage: permission denied

阅读更多关于 Spark Scala S3 storage: permission denied

问题 I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly. I've downloaded : Spark 2.3.2 with hadoop 2.7 and above. I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder: hadoop-aws-2.7.7.jar hadoop-auth-2.7.7.jar aws-java-sdk-1.7.4.jar Still I can't use nor S3N nor S3A to get my file read by spark: For S3A I have this exception: sc.hadoopConfiguration.set("fs.s3a.access.key",

Save a dataframe view after groupBy using pyspark

阅读更多关于 Save a dataframe view after groupBy using pyspark

问题 My homework is giving me a hard time with pyspark. I have this view of my "df2" after a groupBy: df2.groupBy('years').count().show() +-----+-----+ |years|count| +-----+-----+ | 2003|11904| | 2006| 3476| | 1997| 3979| | 2004|13362| | 1996| 3180| | 1998| 4969| | 1995| 1995| | 2001|11532| | 2005|11389| | 2000| 7462| | 1999| 6593| | 2002|11799| +-----+-----+ Every attempt to save this (and then load with pandas) to a file gives back the original source data text file form I read with pypspark

Pyspark explode json string

阅读更多关于 Pyspark explode json string

问题 Input_dataframe id name collection 111 aaaaa {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } 222 bbbbb {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } here id ==> string name ==> string collection ==> string

How to store nested custom objects in Spark Dataset?

阅读更多关于 How to store nested custom objects in Spark Dataset?

问题 The question is a follow-up of How to store custom objects in Dataset? Spark version: 3.0.1 Non-nested custom type is achievable: import spark.implicits._ import org.apache.spark.sql.{Encoder, Encoders} class AnObj(val a: Int, val b: String) implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj] val d = spark.createDataset(Seq(new AnObj(1, "a"))) d.printSchema root |-- value: binary (nullable = true) However, if the custom type is nested inside a product type (i.e. case class ), it