apache-spark

Escape quotes is not working in spark 2.2.0 while reading csv

断了今生、忘了曾经 提交于 2021-02-07 10:34:18
问题 I am trying to read my delimited file which is tab separated but not able to read all records. Here is my input records: head1 head2 head3 a b c a2 a3 a4 a1 "b1 "c1 My code: var inputDf = sparkSession.read .option("delimiter","\t") .option("header", "true") // .option("inferSchema", "true") .option("nullValue", "") .option("escape","\"") .option("multiLine", true) .option("nullValue", null) .option("nullValue", "NULL") .schema(finalSchema) .csv("file:///C:/Users/prhasija/Desktop

Spark Dataset : data transformation

旧城冷巷雨未停 提交于 2021-02-07 10:19:06
问题 I have a Spark Dataset of the format - +--------------+--------+-----+ |name |type |cost | +--------------+--------+-----+ |AAAAAAAAAAAAAA|XXXXX |0.24| |AAAAAAAAAAAAAA|YYYYY |1.14| |BBBBBBBBBBBBBB|XXXXX |0.78| |BBBBBBBBBBBBBB|YYYYY |2.67| |BBBBBBBBBBBBBB|ZZZZZ |0.15| |CCCCCCCCCCCCCC|XXXXX |1.86| |CCCCCCCCCCCCCC|YYYYY |1.50| |CCCCCCCCCCCCCC|ZZZZZ |1.00| +--------------+--------+----+ I want to transform this into an object of type - public class CostPerName { private String name; private Map

Spark Dataset : data transformation

泪湿孤枕 提交于 2021-02-07 10:18:18
问题 I have a Spark Dataset of the format - +--------------+--------+-----+ |name |type |cost | +--------------+--------+-----+ |AAAAAAAAAAAAAA|XXXXX |0.24| |AAAAAAAAAAAAAA|YYYYY |1.14| |BBBBBBBBBBBBBB|XXXXX |0.78| |BBBBBBBBBBBBBB|YYYYY |2.67| |BBBBBBBBBBBBBB|ZZZZZ |0.15| |CCCCCCCCCCCCCC|XXXXX |1.86| |CCCCCCCCCCCCCC|YYYYY |1.50| |CCCCCCCCCCCCCC|ZZZZZ |1.00| +--------------+--------+----+ I want to transform this into an object of type - public class CostPerName { private String name; private Map

Apache Kudu slow insert, high queuing time

两盒软妹~` 提交于 2021-02-07 10:15:38
问题 I have been using Spark Data Source to write to Kudu from Parquet, and the write performance is terrible: about 12000 rows / seconds. Each row roughly 160 bytes. We have 7 kudu nodes, 24 core + 64 GB RAM each + 12 SATA disk each. None of the resources seem to be the bottleneck: tserver cpu usage ~3-4 core, RAM 10G, no disk congestion. Still I see most of the time write requests were stuck in queuing. Any ideas are appreciated. W0811 12:34:03.526340 7753 rpcz_store.cc:251] Call kudu.tserver

Spark: Accumulators does not work properly when I use it in Range

Deadly 提交于 2021-02-07 10:10:31
问题 I don't understand why my accumulator hasn't been updated properly by Spark. object AccumulatorsExample extends App { val acc = sc.accumulator(0L, "acc") sc range(0, 20000, step = 25) map { _ => acc += 1 } count() assert(acc.value == 800) // not equals } My Spark config: setMaster("local[*]") // should use 8 cpu cores I'm not sure if Spark distribute computations of accumulator on every core and maybe that's the problem. My question is how can I aggregate all acc values in one single sum and

Spark: Accumulators does not work properly when I use it in Range

不想你离开。 提交于 2021-02-07 10:10:29
问题 I don't understand why my accumulator hasn't been updated properly by Spark. object AccumulatorsExample extends App { val acc = sc.accumulator(0L, "acc") sc range(0, 20000, step = 25) map { _ => acc += 1 } count() assert(acc.value == 800) // not equals } My Spark config: setMaster("local[*]") // should use 8 cpu cores I'm not sure if Spark distribute computations of accumulator on every core and maybe that's the problem. My question is how can I aggregate all acc values in one single sum and

How to check if a DataFrame was already cached/persisted before?

旧巷老猫 提交于 2021-02-07 09:57:38
问题 For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone? 回答1: You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2. Code example : val sparkSession = SparkSession.builder. master("local") .appName("example") .getOrCreate() val df = sparkSession.read.csv("src/main/resources/sales.csv") df.createTempView("sales") //interacting with

spark start-slave not connecting to master

最后都变了- 提交于 2021-02-07 09:27:53
问题 I am using ubuntu 16 and trying to set up spark cluster on my lan. I have managed to configure a spark master, and manage to connect a slave from the same machine and see it on localhost:8080 When i try to connect from another machine, problems start, i configured passwordless ssh as explained here when i try to connect to the master using start-slave.sh spark://master:port as explained here I am getting this error log I tried accesing the master using the local ip and the local name (i

spark start-slave not connecting to master

╄→尐↘猪︶ㄣ 提交于 2021-02-07 09:27:31
问题 I am using ubuntu 16 and trying to set up spark cluster on my lan. I have managed to configure a spark master, and manage to connect a slave from the same machine and see it on localhost:8080 When i try to connect from another machine, problems start, i configured passwordless ssh as explained here when i try to connect to the master using start-slave.sh spark://master:port as explained here I am getting this error log I tried accesing the master using the local ip and the local name (i

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

北战南征 提交于 2021-02-07 09:11:16
问题 I am doing binary classification using Spark ML Multilayer Perceptron Classifier. mlp = MultilayerPerceptronClassifier(labelCol="evt", featuresCol="features", layers=[inputneurons,(inputneurons*2)+1,2]) The output layer has of two neurons as it is a binary classification problem. Now I would like get the values two neurons for each of the rows in the test set instead of just getting the prediction column containing either 0 or 1. I could not find anything to get that in the API document. 回答1: