apache-spark | 易学教程

Escape quotes is not working in spark 2.2.0 while reading csv

阅读更多关于 Escape quotes is not working in spark 2.2.0 while reading csv

问题 I am trying to read my delimited file which is tab separated but not able to read all records. Here is my input records: head1 head2 head3 a b c a2 a3 a4 a1 "b1 "c1 My code: var inputDf = sparkSession.read .option("delimiter","\t") .option("header", "true") // .option("inferSchema", "true") .option("nullValue", "") .option("escape","\"") .option("multiLine", true) .option("nullValue", null) .option("nullValue", "NULL") .schema(finalSchema) .csv("file:///C:/Users/prhasija/Desktop

Spark Dataset : data transformation

阅读更多关于 Spark Dataset : data transformation

问题 I have a Spark Dataset of the format - +--------------+--------+-----+ |name |type |cost | +--------------+--------+-----+ |AAAAAAAAAAAAAA|XXXXX |0.24| |AAAAAAAAAAAAAA|YYYYY |1.14| |BBBBBBBBBBBBBB|XXXXX |0.78| |BBBBBBBBBBBBBB|YYYYY |2.67| |BBBBBBBBBBBBBB|ZZZZZ |0.15| |CCCCCCCCCCCCCC|XXXXX |1.86| |CCCCCCCCCCCCCC|YYYYY |1.50| |CCCCCCCCCCCCCC|ZZZZZ |1.00| +--------------+--------+----+ I want to transform this into an object of type - public class CostPerName { private String name; private Map

Spark Dataset : data transformation

阅读更多关于 Spark Dataset : data transformation

Apache Kudu slow insert, high queuing time

阅读更多关于 Apache Kudu slow insert, high queuing time

问题 I have been using Spark Data Source to write to Kudu from Parquet, and the write performance is terrible: about 12000 rows / seconds. Each row roughly 160 bytes. We have 7 kudu nodes, 24 core + 64 GB RAM each + 12 SATA disk each. None of the resources seem to be the bottleneck: tserver cpu usage ~3-4 core, RAM 10G, no disk congestion. Still I see most of the time write requests were stuck in queuing. Any ideas are appreciated. W0811 12:34:03.526340 7753 rpcz_store.cc:251] Call kudu.tserver

Spark: Accumulators does not work properly when I use it in Range

阅读更多关于 Spark: Accumulators does not work properly when I use it in Range

问题 I don't understand why my accumulator hasn't been updated properly by Spark. object AccumulatorsExample extends App { val acc = sc.accumulator(0L, "acc") sc range(0, 20000, step = 25) map { _ => acc += 1 } count() assert(acc.value == 800) // not equals } My Spark config: setMaster("local[*]") // should use 8 cpu cores I'm not sure if Spark distribute computations of accumulator on every core and maybe that's the problem. My question is how can I aggregate all acc values in one single sum and

Spark: Accumulators does not work properly when I use it in Range

阅读更多关于 Spark: Accumulators does not work properly when I use it in Range

How to check if a DataFrame was already cached/persisted before?

阅读更多关于 How to check if a DataFrame was already cached/persisted before?

问题 For spark's RDD object this is quite trivial as it exposes a getStorageLevel method, but DF does not seem to expose anything similar. anyone? 回答1: You can check weather a DataFrame is cached or not using Catalog (org.apache.spark.sql.catalog.Catalog) which comes in Spark 2. Code example : val sparkSession = SparkSession.builder. master("local") .appName("example") .getOrCreate() val df = sparkSession.read.csv("src/main/resources/sales.csv") df.createTempView("sales") //interacting with

spark start-slave not connecting to master

阅读更多关于 spark start-slave not connecting to master

问题 I am using ubuntu 16 and trying to set up spark cluster on my lan. I have managed to configure a spark master, and manage to connect a slave from the same machine and see it on localhost:8080 When i try to connect from another machine, problems start, i configured passwordless ssh as explained here when i try to connect to the master using start-slave.sh spark://master:port as explained here I am getting this error log I tried accesing the master using the local ip and the local name (i

spark start-slave not connecting to master

阅读更多关于 spark start-slave not connecting to master

PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

阅读更多关于 PySpark: Getting output layer neuron values for Spark ML Multilayer Perceptron Classifier

问题 I am doing binary classification using Spark ML Multilayer Perceptron Classifier. mlp = MultilayerPerceptronClassifier(labelCol="evt", featuresCol="features", layers=[inputneurons,(inputneurons*2)+1,2]) The output layer has of two neurons as it is a binary classification problem. Now I would like get the values two neurons for each of the rows in the test set instead of just getting the prediction column containing either 0 or 1. I could not find anything to get that in the API document. 回答1: