apache-spark

Spark writing to Cassandra with varying TTL

江枫思渺然 提交于 2021-02-10 18:04:59
问题 In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to. I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp) , where CONST_TTL is a constant TTL that I configured. Currently I am writing to Cassandra with spark using a

Convert a JavaRDD<Tuple2<Object, long[]>> into a Spark Dataset<Row> in Java

跟風遠走 提交于 2021-02-10 16:19:55
问题 In Java (not Scala!) Spark 3.0.1 have a JavaRDD instance object neighborIdsRDD which its type is JavaRDD<Tuple2<Object, long[]>> . Part of my code related to the generation of the JavaRDD is the following: GraphOps<String, String> graphOps = new GraphOps<>(graph, stringTag, stringTag); JavaRDD<Tuple2<Object, long[]>> neighborIdsRDD = graphOps.collectNeighborIds(EdgeDirection.Either()).toJavaRDD(); I have had to get a JavaRDD using toJavaRDD() because collectNeighborIds returns a org.apache

Logger is not working inside spark UDF on cluster

陌路散爱 提交于 2021-02-10 15:54:51
问题 I have placed log.info statements inside my UDF but it is getting failed on cluster. Local working fine. Here is the snippet: def relType = udf((colValue: String, relTypeV: String) => { var relValue = "NA" val relType = relTypeV.split(",").toList val relTypeMap = relType.map { col => val split = col.split(":") (split(0), split(1)) }.toMap // val keySet = relTypeMap relTypeMap.foreach { x => if ((x._1 != null || colValue != null || x._1.trim() != "" || colValue.trim() != "") && colValue

How can missing columns be added as null while read a nested JSON using pyspark and a predefined struct schema

梦想与她 提交于 2021-02-10 15:49:45
问题 Python=3.6 Spark=2.4 My sample JSON data: {"data":{"header":"someheader","body":{"name":"somename","value":"somevalue","books":[{"name":"somename"},{"value":"somevalue"},{"author":"someauthor"}]}}}, {"data":{"header":"someheader1","body":{"name":"somename1","value":"somevalue1","books":[{"name":"somename1"},{"value":"somevalue1"},{"author":"someauthor1"}]}}},.... My Struct Schema: Schema = StructType([StructField('header',StringType(),True),StructField('body',StructType([StructField('name1'

Spark: subtract dataframes but preserve duplicate values

南笙酒味 提交于 2021-02-10 14:51:08
问题 Suppose I have two Spark SQL dataframes A and B . I want to subtract the items in B from the items in A while preserving duplicates from A . I followed the instructions to use DataFrame.except() that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A . As a conceptual example, if I have two dataframes: words = [the, quick, fox, a, brown, fox] stopWords = [the, a] then I want the output to be, in

Scala cannot infer

☆樱花仙子☆ 提交于 2021-02-10 14:33:30
问题 I have a very simple snipper of Spark code which was working on Scala 2.11 and stop compiling after 2.12. import spark.implicits._ val ds = Seq("val").toDF("col1") ds.foreachPartition(part => { part.foreach(println) }) It fails with the error: Error:(22, 12) value foreach is not a member of Object part.foreach(println) The workaround is to help the compiler with such code: import spark.implicits._ val ds = Seq("val").toDF("col1") println(ds.getClass) ds.foreachPartition((part: Iterator[Row])

spark-class java no such file or derectory

眉间皱痕 提交于 2021-02-10 14:20:46
问题 I am a newbie to spark / scala ... I have set up on a fully distributed cluster spark / scala and sbt. when I test and issue the command pyspark I get the following error : /home/hadoop/spark/bin/spark-class line 75 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java - no such file or directory my bashrc contains: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 hadoop-env.sh contains export JAVA_HOME=/usr/lib/jvm/java7-openjdk-amd64/jre/ conf/spark-env.sh contains JAVA_HOME=usr/lib/jvm/java7

What is the performance difference between accumulator and collect() in Spark?

泄露秘密 提交于 2021-02-10 14:11:31
问题 Accumulator are basically the shared variable in spark to be updated by executors but read by driver only. Collect() in spark is to get all the data into the driver from executors. So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST? Code to convert dataframe to List using accumulator val queryOutput = spark.sql(query) val acc = spark.sparkContext.collectionAccumulator

What is the performance difference between accumulator and collect() in Spark?

被刻印的时光 ゝ 提交于 2021-02-10 14:07:15
问题 Accumulator are basically the shared variable in spark to be updated by executors but read by driver only. Collect() in spark is to get all the data into the driver from executors. So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST? Code to convert dataframe to List using accumulator val queryOutput = spark.sql(query) val acc = spark.sparkContext.collectionAccumulator

spark-nlp 'JavaPackage' object is not callable

会有一股神秘感。 提交于 2021-02-10 12:56:19
问题 I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: import sparknlp from pyspark.sql import SparkSession from sparknlp.pretrained import PretrainedPipeline #create or get Spark Session #spark = sparknlp.start() spark = SparkSession.builder \ .appName("ner")\ .master("local[4]")\ .config("spark.driver.memory","8G")\ .config("spark.driver.maxResultSize", "2G") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5")\