apache-spark | 易学教程

Spark writing to Cassandra with varying TTL

阅读更多关于 Spark writing to Cassandra with varying TTL

问题 In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to. I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp) , where CONST_TTL is a constant TTL that I configured. Currently I am writing to Cassandra with spark using a

Convert a JavaRDD<Tuple2<Object, long[]>> into a Spark Dataset<Row> in Java

阅读更多关于 Convert a JavaRDD into a Spark Dataset in Java

问题 In Java (not Scala!) Spark 3.0.1 have a JavaRDD instance object neighborIdsRDD which its type is JavaRDD<Tuple2<Object, long[]>> . Part of my code related to the generation of the JavaRDD is the following: GraphOps<String, String> graphOps = new GraphOps<>(graph, stringTag, stringTag); JavaRDD<Tuple2<Object, long[]>> neighborIdsRDD = graphOps.collectNeighborIds(EdgeDirection.Either()).toJavaRDD(); I have had to get a JavaRDD using toJavaRDD() because collectNeighborIds returns a org.apache

Logger is not working inside spark UDF on cluster

阅读更多关于 Logger is not working inside spark UDF on cluster

问题 I have placed log.info statements inside my UDF but it is getting failed on cluster. Local working fine. Here is the snippet: def relType = udf((colValue: String, relTypeV: String) => { var relValue = "NA" val relType = relTypeV.split(",").toList val relTypeMap = relType.map { col => val split = col.split(":") (split(0), split(1)) }.toMap // val keySet = relTypeMap relTypeMap.foreach { x => if ((x._1 != null || colValue != null || x._1.trim() != "" || colValue.trim() != "") && colValue

How can missing columns be added as null while read a nested JSON using pyspark and a predefined struct schema

阅读更多关于 How can missing columns be added as null while read a nested JSON using pyspark and a predefined struct schema

问题 Python=3.6 Spark=2.4 My sample JSON data: {"data":{"header":"someheader","body":{"name":"somename","value":"somevalue","books":[{"name":"somename"},{"value":"somevalue"},{"author":"someauthor"}]}}}, {"data":{"header":"someheader1","body":{"name":"somename1","value":"somevalue1","books":[{"name":"somename1"},{"value":"somevalue1"},{"author":"someauthor1"}]}}},.... My Struct Schema: Schema = StructType([StructField('header',StringType(),True),StructField('body',StructType([StructField('name1'

Spark: subtract dataframes but preserve duplicate values

阅读更多关于 Spark: subtract dataframes but preserve duplicate values

问题 Suppose I have two Spark SQL dataframes A and B . I want to subtract the items in B from the items in A while preserving duplicates from A . I followed the instructions to use DataFrame.except() that I found in another StackOverflow question ("Spark: subtract two DataFrames"), but that function removes all duplicates from the original dataframe A . As a conceptual example, if I have two dataframes: words = [the, quick, fox, a, brown, fox] stopWords = [the, a] then I want the output to be, in

Scala cannot infer

阅读更多关于 Scala cannot infer

问题 I have a very simple snipper of Spark code which was working on Scala 2.11 and stop compiling after 2.12. import spark.implicits._ val ds = Seq("val").toDF("col1") ds.foreachPartition(part => { part.foreach(println) }) It fails with the error: Error:(22, 12) value foreach is not a member of Object part.foreach(println) The workaround is to help the compiler with such code: import spark.implicits._ val ds = Seq("val").toDF("col1") println(ds.getClass) ds.foreachPartition((part: Iterator[Row])

spark-class java no such file or derectory

阅读更多关于 spark-class java no such file or derectory

问题 I am a newbie to spark / scala ... I have set up on a fully distributed cluster spark / scala and sbt. when I test and issue the command pyspark I get the following error : /home/hadoop/spark/bin/spark-class line 75 /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java - no such file or directory my bashrc contains: export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 hadoop-env.sh contains export JAVA_HOME=/usr/lib/jvm/java7-openjdk-amd64/jre/ conf/spark-env.sh contains JAVA_HOME=usr/lib/jvm/java7

What is the performance difference between accumulator and collect() in Spark?

阅读更多关于 What is the performance difference between accumulator and collect() in Spark?

问题 Accumulator are basically the shared variable in spark to be updated by executors but read by driver only. Collect() in spark is to get all the data into the driver from executors. So, in both when I am get the data ultimately in driver only. so, what is the difference in performance when we use accumulator or collect() to convert a large RDD into a LIST? Code to convert dataframe to List using accumulator val queryOutput = spark.sql(query) val acc = spark.sparkContext.collectionAccumulator

What is the performance difference between accumulator and collect() in Spark?

阅读更多关于 What is the performance difference between accumulator and collect() in Spark?

spark-nlp 'JavaPackage' object is not callable

阅读更多关于 spark-nlp 'JavaPackage' object is not callable

问题 I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code: import sparknlp from pyspark.sql import SparkSession from sparknlp.pretrained import PretrainedPipeline #create or get Spark Session #spark = sparknlp.start() spark = SparkSession.builder \ .appName("ner")\ .master("local[4]")\ .config("spark.driver.memory","8G")\ .config("spark.driver.maxResultSize", "2G") \ .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.6.5")\