apache-zeppelin | 易学教程

Build error from Installing Apache Zeppelin

阅读更多关于 Build error from Installing Apache Zeppelin

问题 I am at my wits end trying to get Apache Zeppelin running on my linux VM. I am following this tutorial: http://madhukaudantha.blogspot.ca/2015/03/building-apache-zeppelin.html I have done the following: git clone to the machine now trying to run 'mvn clean package' I get the following error summary, I really need to get this running. [INFO] Reactor Summary: [INFO] [INFO] Zeppelin .......................................... SUCCESS [16.124s] [INFO] Zeppelin: Interpreter ........................

Spark throws java.util.NoSuchElementException: key not found: 67

阅读更多关于 Spark throws java.util.NoSuchElementException: key not found: 67

问题 Running the Spark bisecting kmmeans algorithm in Zeppelin. //I transform my data using the TF-IDF algorithm val idf = new IDF(minFreq).fit(data) val hashIDF_features = idf.transform(dbTF) //and parse the transformed data to the clustering algorithm. val bkm = new BisectingKMeans().setK(100).setMaxIterations(2) val model = bkm.run(hashIDF_features) val cluster_rdd = model.predict(hashIDF_features) I always get this error though: org.apache.spark.SparkException: Job aborted due to stage failure

Using pyspark on Windows not working- py4j

阅读更多关于 Using pyspark on Windows not working- py4j

问题 I installed Zeppelin on Windows using this tutorial and this. I also installed java 8 to avoid problems. I'm now able to start the Zeppelin server, and I'm trying to run this code - %pyspark a=5*4 print("value = %i" % (a)) sc.version I'm getting this error, related to py4j . I had other problems with this library before (same as here), and to avoid them I replaced the library of py4j in the Zeppelin and Spark on my computer with the latest version- py4j 0.10.7 . This is the error I get-

Using pyspark on Windows not working- py4j

阅读更多关于 Using pyspark on Windows not working- py4j

SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

阅读更多关于 SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

问题 I have a DataFrame, I map it into an RDD of () to test an SVMModel. I am using Zeppelin, and Spark 1.6.1 Here is my code: val loadedSVMModel = SVMModel.load(sc, pathToSvmModel) // Clear the default threshold. loadedSVMModel.clearThreshold() // Compute raw scores on the test set. val scoreAndLabels = df.select($"features", $"label") .map { case Row(features:Vector, label: Double) => val score = loadedSVMModel.predict(features) (score,label) } // Get evaluation metrics. val metrics = new

pyspark interpreter not found in apache zeppelin

阅读更多关于 pyspark interpreter not found in apache zeppelin

问题 I am having issue with using pyspark in Apache-Zeppelin (version 0.6.0) notebook. Running the following simple code gives me pyspark interpreter not found error %pyspark a = 1+3 Running sc.version gave me res2: String = 1.6.0 which is the version of spark installed on my machine. And running z return res0: org.apache.zeppelin.spark.ZeppelinContext = {} Pyspark works from CLI (using spark 1.6.0 and python 2.6.6) The default python on the machine 2.6.6, while anaconda-python 3.5 is also

structured streaming Kafka 2.1->Zeppelin 0.8->Spark 2.4: spark does not use jar

阅读更多关于 structured streaming Kafka 2.1->Zeppelin 0.8->Spark 2.4: spark does not use jar

问题 I have a Kafka 2.1 message broker and want to do some processing with data of the messages within Spark 2.4. I want to use Zeppelin 0.8.1 notebooks for rapid prototyping. I downloaded the spark-streaming-kafka-0-10_2.11.jar that is necessarry for structured streaming (http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) and added it as "Dependencies-artifact" to the "spark"-interpreter of Zeppelin (that also deals with the %pyspark paragraphs). I restarted this

Spark DataFrame filtering: retain element belonging to a list

阅读更多关于 Spark DataFrame filtering: retain element belonging to a list

问题 I am using Spark 1.5.1 with Scala on Zeppelin notebook. I have a DataFrame with a column called userID with Long type. In total I have about 4 million rows and 200,000 unique userID. I have also a list of 50,000 userID to exclude. I can easily build the list of userID to retain. What is the best way to delete all the rows that belong to the users to exclude? Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain? I saw this post and

How to set spark.driver.memory for Spark/Zeppelin on EMR

阅读更多关于 How to set spark.driver.memory for Spark/Zeppelin on EMR

问题 When using EMR (with Spark, Zeppelin), changing spark.driver.memory in Zeppelin Spark interpreter settings won't work. I wonder what is the best and quickest way to set Spark driver memory when using EMR web interface (not aws CLI) to create clusters? Is Bootstrap action could be a solution? If yes, can you please provide an example of how the bootstrap action file should look like? 回答1: You can always try to add the following configuration on job flow/cluster creation : [ { "Classification":

Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

阅读更多关于 Configure Zeppelin's Spark Interpreter on EMR when starting a cluster

问题 I am creating clusters on EMR and configure Zeppelin to read the notebooks from S3. To do that I am using a json object that looks like that: [ { "Classification": "zeppelin-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo", "ZEPPELIN_NOTEBOOK_S3_BUCKET":"hs-zeppelin-notebooks", "ZEPPELIN_NOTEBOOK_USER":"user" }, "Configurations": [ ] } ] } ] I am pasting this object in the