apache-spark

spark streaming kafka : Unknown error fetching data for topic-partition

自作多情 提交于 2021-01-29 10:31:12
问题 I'm trying to read a Kafka topic from a Spark cluster using Structured Streaming API with Kafka integration in Spark val sparkSession = SparkSession.builder() .master("local[*]") .appName("some-app") .getOrCreate() Kafka stream creation import sparkSession.implicits._ val dataFrame = sparkSession .readStream .format("kafka") .option("subscribepattern", "preprod-*") .option("kafka.bootstrap.servers", "<brokerUrl>:9094") .option("kafka.ssl.protocol", "TLS") .option("kafka.security.protocol",

How to select the N highest values for each category in spark scala

感情迁移 提交于 2021-01-29 10:18:56
问题 Say I have this dataset: val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17), ("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits") which looks like this: I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this, Since the 2 highest homerun totals for them were 8 and 6. How would I do this in

spark.sql.files.maxPartitionBytes not limiting max size of written partitions

≯℡__Kan透↙ 提交于 2021-01-29 10:07:27
问题 I'm trying to copy parquet data from another s3 bucket to my s3 bucket. I want to limit the size of each partition to a max of 128 MB. I thought by default spark.sql.files.maxPartitionBytes would have been set to 128 MB, but when I look at the partition files in s3 after my copy I see individual partition files around 226 MB instead. I was looking at this post which suggested that I set this spark config key in order to limit the max size of my partitions: Limiting maximum size of dataframe

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

自闭症网瘾萝莉.ら 提交于 2021-01-29 09:48:01
问题 I use pyspark streaming to read kafka data, but it went wrong: import os from pyspark.streaming.kafka import KafkaUtils from pyspark.streaming import StreamingContext from pyspark import SparkContext os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' sc = SparkContext(appName="test") sc.setLogLevel("WARN") ssc = StreamingContext(sc, 60) kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2}) kafkaStream

Scala — Conditional replace column value of a data frame

蹲街弑〆低调 提交于 2021-01-29 08:43:08
问题 DataFrame 1 is what I have now, and I want to write a Scala function to make DataFrame 1 look like DataFrame 2. Transfer is the big category; e-transfer and IMT are subcategories. The Logic is that for a same ID (31898), if both Transfer and e-Transfer tagged to it, it should only be e-Transfer; if Transfer and IMT and e-Transfer all tagged to a same ID (32614), it should be e-Transfer + IMT; If only Transfer tagged to one ID (33987), it should be Other; if only e-Transfer or IMT tagged to a

How to filter in Spark using two “conditions”?

北城以北 提交于 2021-01-29 08:25:50
问题 I'm doing an excercice where I'm required to filter the amount of crimes per year based on a file that has more than 13 millions of lines (in case that's an important info). For that, I did this and it's working fine: JavaRDD<String> anoRDD = arquivo.map(s -> {String[] campos = s.split(";") ; return campos[2]; }); System.out.println(anoRDD.countByValue()); But, the next question to be answered is "How many "NARCOTIC" crimes happen per YEAR?", I managed to filter the total amount, but not per

Spark Scala S3 storage: permission denied

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-29 08:12:33
问题 I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly. I've downloaded : Spark 2.3.2 with hadoop 2.7 and above. I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder: hadoop-aws-2.7.7.jar hadoop-auth-2.7.7.jar aws-java-sdk-1.7.4.jar Still I can't use nor S3N nor S3A to get my file read by spark: For S3A I have this exception: sc.hadoopConfiguration.set("fs.s3a.access.key",

Save a dataframe view after groupBy using pyspark

谁都会走 提交于 2021-01-29 08:11:27
问题 My homework is giving me a hard time with pyspark. I have this view of my "df2" after a groupBy: df2.groupBy('years').count().show() +-----+-----+ |years|count| +-----+-----+ | 2003|11904| | 2006| 3476| | 1997| 3979| | 2004|13362| | 1996| 3180| | 1998| 4969| | 1995| 1995| | 2001|11532| | 2005|11389| | 2000| 7462| | 1999| 6593| | 2002|11799| +-----+-----+ Every attempt to save this (and then load with pandas) to a file gives back the original source data text file form I read with pypspark

Pyspark explode json string

拜拜、爱过 提交于 2021-01-29 08:04:04
问题 Input_dataframe id name collection 111 aaaaa {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } 222 bbbbb {"1":{"city":"city_1","state":"state_1","country":"country_1"}, "2":{"city":"city_2","state":"state_2","country":"country_2"}, "3":{"city":"city_3","state":"state_3","country":"country_3"} } here id ==> string name ==> string collection ==> string

How to store nested custom objects in Spark Dataset?

a 夏天 提交于 2021-01-29 07:48:20
问题 The question is a follow-up of How to store custom objects in Dataset? Spark version: 3.0.1 Non-nested custom type is achievable: import spark.implicits._ import org.apache.spark.sql.{Encoder, Encoders} class AnObj(val a: Int, val b: String) implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj] val d = spark.createDataset(Seq(new AnObj(1, "a"))) d.printSchema root |-- value: binary (nullable = true) However, if the custom type is nested inside a product type (i.e. case class ), it