apache-zeppelin | 易学教程

Apache Zeppelin & Spark Streaming: Twitter Example only works local

阅读更多关于 Apache Zeppelin & Spark Streaming: Twitter Example only works local

问题 I just added the example project to my Zeppelin Notebook from http://zeppelin-project.org/docs/tutorial/tutorial.html (section "Tutorial with Streaming Data"). The problem I now have is that the application seems only to work local. If I change the Spark interpreter setting "master" from "local[*]" to "spark://master:7077" the application won't bring any result anymore when I'm doing the same SQL statement. Am I doing anything wrong? I already restarted the Zeppelin interpreter, also the

Connect Apache Zeppelin to Hive

阅读更多关于 Connect Apache Zeppelin to Hive

I try to connect my apache zeppelin with my hive metastore. I use zeppelin 0.7.3 so there is not a hive interpreter only jdbc. I have copied my hive-site.xml to zeppelin conf folder but I don't know how to create a new hive interpreter. I also tried to access hive tables through spark's hive context but when I try this way, I can not see my hive databases only a default database is shown. Can someone explain either how to create a hive interpreter or how to access my hive metastore through spark correctly? Any answer is appreciated. I solved it by following this documentation . After adding

How to use dependencies from S3 in Zeppelin?

阅读更多关于 How to use dependencies from S3 in Zeppelin?

问题 Is there a way to add jars that are in a bucket on S3 as a dependency of Zeppelin? tried z.load(s3n://...) and z.addRepo(some_name).url(s3n://...) but they don't seem to do the job.. 回答1: You could download jars from S3 and put it on the local FS. It could be done inside %dep interpreter like this: %dep import com.amazonaws.services.s3.AmazonS3Client import java.io.File import java.nio.file.{Files, StandardCopyOption} val dest = "/tmp/dependency.jar" val s3 = new AmazonS3Client() val stream =

SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

阅读更多关于 SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

I have a DataFrame, I map it into an RDD of () to test an SVMModel. I am using Zeppelin, and Spark 1.6.1 Here is my code: val loadedSVMModel = SVMModel.load(sc, pathToSvmModel) // Clear the default threshold. loadedSVMModel.clearThreshold() // Compute raw scores on the test set. val scoreAndLabels = df.select($"features", $"label") .map { case Row(features:Vector, label: Double) => val score = loadedSVMModel.predict(features) (score,label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " +

Apache Zeppelin - Set default interpreter

阅读更多关于 Apache Zeppelin - Set default interpreter

问题 In Zeppelin, at each row I am having to provide the interpreter at each row. Is there a way to set the interpreter for the whole session. %pyspark import re Took 0 seconds. import pandas as pd console :1: error: '.' expected but identifier found. import pandas as pd %pyspark import pandas as pd Took 0 seconds. How do I set the interpreter for the whole session? 回答1: The Spark Interpreter group currently has 4 interpreter as listed here... https://zeppelin.incubator.apache.org/docs/0.5.0

Apache zeppelin build process failure in zeppelin-web with bower

阅读更多关于 Apache zeppelin build process failure in zeppelin-web with bower

I am trying to build zeppelin locally with windows and babun/cygwin. This site got me headed in the right direction, but I run into the following error when the build gets to Web Application: [ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.23:bower (bower install) on project zeppelin-web: Failed to run task: 'bower --allow-root install' failed. (error code 8) -> [Help 1] I can go into the zeppelin-web directory and run bower install successfully, but I'm not sure where to go from here? If I try to do mvn install -DskipTests , it tries to run the bower command again

Is it possible to integrate Zeppelin notes with git?

阅读更多关于 Is it possible to integrate Zeppelin notes with git?

Is it possible to integrate Zeppelin notes with git? One can set the repository location but how to set that to a remote git repository. This functionality is however, available on Amazon EMR Yes. It's possible I use following way. Create a github repo and push all note book. like git clone https://github.com/rockiey/zeppelin-notebooks.git cd zeppelin-notebooks cp -rf ../zeppelin/notebook/* . git add -A git commit -m "init" git push Delete notebook directory cd zeppelin rm -rf notebook Clone github repo to notebook. cd zeppelin git clone https://github.com/rockiey/zeppelin-notebooks.git

How is an imported name resolved in Scala? (Spark / Zeppelin)

阅读更多关于 How is an imported name resolved in Scala? (Spark / Zeppelin)

I have a script running in a paragraph with the Spark interpreter in Zeppelin. It has an import and the name imported can be resolved from the global namespace and also from a function, but not from a method inside a class. This runs well on my computer's installation of Scala (2.12) but it doesn't work in Zeppelin (Scala 2.11). import java.util.Calendar def myFun: String = { // this works return Calendar.getInstance.toString } class MyClass { def myFun(): String = { // this doesn't return Calendar.getInstance.toString // this works return java.util.Calendar.getInstance.toString } } The error

Spark DataFrame filtering: retain element belonging to a list

阅读更多关于 Spark DataFrame filtering: retain element belonging to a list

I am using Spark 1.5.1 with Scala on Zeppelin notebook. I have a DataFrame with a column called userID with Long type. In total I have about 4 million rows and 200,000 unique userID. I have also a list of 50,000 userID to exclude. I can easily build the list of userID to retain. What is the best way to delete all the rows that belong to the users to exclude? Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain? I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1

Spark 1.6: filtering DataFrames generated by describe()

阅读更多关于 Spark 1.6: filtering DataFrames generated by describe()

The problem arises when I call describe function on a DataFrame: val statsDF = myDataFrame.describe() Calling describe function yields the following output: statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string] I can show statsDF normally by calling statsDF.show() +-------+------------------+ |summary| count| +-------+------------------+ | count| 53173| | mean|104.76128862392568| | stddev|3577.8184333911513| | min| 1| | max| 558407| +-------+------------------+ I would like now to get the standard deviation and the mean from statsDF , but when I am trying to collect the