apache-zeppelin

Build error from Installing Apache Zeppelin

谁说胖子不能爱 提交于 2019-12-23 12:08:08
问题 I am at my wits end trying to get Apache Zeppelin running on my linux VM. I am following this tutorial: http://madhukaudantha.blogspot.ca/2015/03/building-apache-zeppelin.html I have done the following: git clone to the machine now trying to run 'mvn clean package' I get the following error summary, I really need to get this running. [INFO] Reactor Summary: [INFO] [INFO] Zeppelin .......................................... SUCCESS [16.124s] [INFO] Zeppelin: Interpreter ........................

Spark throws java.util.NoSuchElementException: key not found: 67

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-23 07:55:54
问题 Running the Spark bisecting kmmeans algorithm in Zeppelin. //I transform my data using the TF-IDF algorithm val idf = new IDF(minFreq).fit(data) val hashIDF_features = idf.transform(dbTF) //and parse the transformed data to the clustering algorithm. val bkm = new BisectingKMeans().setK(100).setMaxIterations(2) val model = bkm.run(hashIDF_features) val cluster_rdd = model.predict(hashIDF_features) I always get this error though: org.apache.spark.SparkException: Job aborted due to stage failure

Using pyspark on Windows not working- py4j

半城伤御伤魂 提交于 2019-12-23 06:00:32
问题 I installed Zeppelin on Windows using this tutorial and this. I also installed java 8 to avoid problems. I'm now able to start the Zeppelin server, and I'm trying to run this code - %pyspark a=5*4 print("value = %i" % (a)) sc.version I'm getting this error, related to py4j . I had other problems with this library before (same as here), and to avoid them I replaced the library of py4j in the Zeppelin and Spark on my computer with the latest version- py4j 0.10.7 . This is the error I get-

Using pyspark on Windows not working- py4j

℡╲_俬逩灬. 提交于 2019-12-23 05:58:07
问题 I installed Zeppelin on Windows using this tutorial and this. I also installed java 8 to avoid problems. I'm now able to start the Zeppelin server, and I'm trying to run this code - %pyspark a=5*4 print("value = %i" % (a)) sc.version I'm getting this error, related to py4j . I had other problems with this library before (same as here), and to avoid them I replaced the library of py4j in the Zeppelin and Spark on my computer with the latest version- py4j 0.10.7 . This is the error I get-

SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

為{幸葍}努か 提交于 2019-12-22 12:25:40
问题 I have a DataFrame, I map it into an RDD of () to test an SVMModel. I am using Zeppelin, and Spark 1.6.1 Here is my code: val loadedSVMModel = SVMModel.load(sc, pathToSvmModel) // Clear the default threshold. loadedSVMModel.clearThreshold() // Compute raw scores on the test set. val scoreAndLabels = df.select($"features", $"label") .map { case Row(features:Vector, label: Double) => val score = loadedSVMModel.predict(features) (score,label) } // Get evaluation metrics. val metrics = new

pyspark interpreter not found in apache zeppelin

旧巷老猫 提交于 2019-12-22 10:46:04
问题 I am having issue with using pyspark in Apache-Zeppelin (version 0.6.0) notebook. Running the following simple code gives me pyspark interpreter not found error %pyspark a = 1+3 Running sc.version gave me res2: String = 1.6.0 which is the version of spark installed on my machine. And running z return res0: org.apache.zeppelin.spark.ZeppelinContext = {} Pyspark works from CLI (using spark 1.6.0 and python 2.6.6) The default python on the machine 2.6.6, while anaconda-python 3.5 is also

structured streaming Kafka 2.1->Zeppelin 0.8->Spark 2.4: spark does not use jar

被刻印的时光 ゝ 提交于 2019-12-22 08:54:49
问题 I have a Kafka 2.1 message broker and want to do some processing with data of the messages within Spark 2.4. I want to use Zeppelin 0.8.1 notebooks for rapid prototyping. I downloaded the spark-streaming-kafka-0-10_2.11.jar that is necessarry for structured streaming (http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) and added it as "Dependencies-artifact" to the "spark"-interpreter of Zeppelin (that also deals with the %pyspark paragraphs). I restarted this

Spark DataFrame filtering: retain element belonging to a list

狂风中的少年 提交于 2019-12-22 08:37:37
问题 I am using Spark 1.5.1 with Scala on Zeppelin notebook. I have a DataFrame with a column called userID with Long type. In total I have about 4 million rows and 200,000 unique userID. I have also a list of 50,000 userID to exclude. I can easily build the list of userID to retain. What is the best way to delete all the rows that belong to the users to exclude? Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain? I saw this post and

How to set spark.driver.memory for Spark/Zeppelin on EMR

£可爱£侵袭症+ 提交于 2019-12-22 07:00:03
问题 When using EMR (with Spark, Zeppelin), changing spark.driver.memory in Zeppelin Spark interpreter settings won't work. I wonder what is the best and quickest way to set Spark driver memory when using EMR web interface (not aws CLI) to create clusters? Is Bootstrap action could be a solution? If yes, can you please provide an example of how the bootstrap action file should look like? 回答1: You can always try to add the following configuration on job flow/cluster creation : [ { "Classification":

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

孤街浪徒 提交于 2019-12-21 16:17:49
问题 I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that: [ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] } ] } ] I am pasting this object in the