apache-zeppelin

Apache Zeppelin & Spark Streaming: Twitter Example only works local

百般思念 提交于 2019-12-06 13:35:58
问题 I just added the example project to my Zeppelin Notebook from http://zeppelin-project.org/docs/tutorial/tutorial.html (section "Tutorial with Streaming Data"). The problem I now have is that the application seems only to work local. If I change the Spark interpreter setting "master" from "local[*]" to "spark://master:7077" the application won't bring any result anymore when I'm doing the same SQL statement. Am I doing anything wrong? I already restarted the Zeppelin interpreter, also the

Connect Apache Zeppelin to Hive

隐身守侯 提交于 2019-12-06 13:34:04
I try to connect my apache zeppelin with my hive metastore. I use zeppelin 0.7.3 so there is not a hive interpreter only jdbc. I have copied my hive-site.xml to zeppelin conf folder but I don't know how to create a new hive interpreter. I also tried to access hive tables through spark's hive context but when I try this way, I can not see my hive databases only a default database is shown. Can someone explain either how to create a hive interpreter or how to access my hive metastore through spark correctly? Any answer is appreciated. I solved it by following this documentation . After adding

How to use dependencies from S3 in Zeppelin?

跟風遠走 提交于 2019-12-06 12:23:10
问题 Is there a way to add jars that are in a bucket on S3 as a dependency of Zeppelin? tried z.load(s3n://...) and z.addRepo(some_name).url(s3n://...) but they don't seem to do the job.. 回答1: You could download jars from S3 and put it on the local FS. It could be done inside %dep interpreter like this: %dep import com.amazonaws.services.s3.AmazonS3Client import java.io.File import java.nio.file.{Files, StandardCopyOption} val dest = "/tmp/dependency.jar" val s3 = new AmazonS3Client() val stream =

SPARK 1.6.1: Task not serializable when evaluating a classifier on a DataFrame

最后都变了- 提交于 2019-12-06 09:13:45
I have a DataFrame, I map it into an RDD of () to test an SVMModel. I am using Zeppelin, and Spark 1.6.1 Here is my code: val loadedSVMModel = SVMModel.load(sc, pathToSvmModel) // Clear the default threshold. loadedSVMModel.clearThreshold() // Compute raw scores on the test set. val scoreAndLabels = df.select($"features", $"label") .map { case Row(features:Vector, label: Double) => val score = loadedSVMModel.predict(features) (score,label) } // Get evaluation metrics. val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() println("Area under ROC = " +

Apache Zeppelin - Set default interpreter

回眸只為那壹抹淺笑 提交于 2019-12-06 02:36:47
问题 In Zeppelin, at each row I am having to provide the interpreter at each row. Is there a way to set the interpreter for the whole session. %pyspark import re Took 0 seconds. import pandas as pd console :1: error: '.' expected but identifier found. import pandas as pd %pyspark import pandas as pd Took 0 seconds. How do I set the interpreter for the whole session? 回答1: The Spark Interpreter group currently has 4 interpreter as listed here... https://zeppelin.incubator.apache.org/docs/0.5.0

Apache zeppelin build process failure in zeppelin-web with bower

房东的猫 提交于 2019-12-05 22:51:52
I am trying to build zeppelin locally with windows and babun/cygwin. This site got me headed in the right direction, but I run into the following error when the build gets to Web Application: [ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.23:bower (bower install) on project zeppelin-web: Failed to run task: 'bower --allow-root install' failed. (error code 8) -> [Help 1] I can go into the zeppelin-web directory and run bower install successfully, but I'm not sure where to go from here? If I try to do mvn install -DskipTests , it tries to run the bower command again

Is it possible to integrate Zeppelin notes with git?

南笙酒味 提交于 2019-12-05 18:59:04
Is it possible to integrate Zeppelin notes with git? One can set the repository location but how to set that to a remote git repository. This functionality is however, available on Amazon EMR Yes. It's possible I use following way. Create a github repo and push all note book. like git clone https://github.com/rockiey/zeppelin-notebooks.git cd zeppelin-notebooks cp -rf ../zeppelin/notebook/* . git add -A git commit -m "init" git push Delete notebook directory cd zeppelin rm -rf notebook Clone github repo to notebook. cd zeppelin git clone https://github.com/rockiey/zeppelin-notebooks.git

How is an imported name resolved in Scala? (Spark / Zeppelin)

落花浮王杯 提交于 2019-12-05 12:49:57
I have a script running in a paragraph with the Spark interpreter in Zeppelin. It has an import and the name imported can be resolved from the global namespace and also from a function, but not from a method inside a class. This runs well on my computer's installation of Scala (2.12) but it doesn't work in Zeppelin (Scala 2.11). import java.util.Calendar def myFun: String = { // this works return Calendar.getInstance.toString } class MyClass { def myFun(): String = { // this doesn't return Calendar.getInstance.toString // this works return java.util.Calendar.getInstance.toString } } The error

Spark DataFrame filtering: retain element belonging to a list

谁都会走 提交于 2019-12-05 12:13:56
I am using Spark 1.5.1 with Scala on Zeppelin notebook. I have a DataFrame with a column called userID with Long type. In total I have about 4 million rows and 200,000 unique userID. I have also a list of 50,000 userID to exclude. I can easily build the list of userID to retain. What is the best way to delete all the rows that belong to the users to exclude? Another way to ask the same question is: what is the best way to keep the rows that belong to the users to retain? I saw this post and applied its solution (see the code below), but the execution is slow, knowing that I am running SPARK 1

Spark 1.6: filtering DataFrames generated by describe()

a 夏天 提交于 2019-12-05 02:44:43
The problem arises when I call describe function on a DataFrame: val statsDF = myDataFrame.describe() Calling describe function yields the following output: statsDF: org.apache.spark.sql.DataFrame = [summary: string, count: string] I can show statsDF normally by calling statsDF.show() +-------+------------------+ |summary| count| +-------+------------------+ | count| 53173| | mean|104.76128862392568| | stddev|3577.8184333911513| | min| 1| | max| 558407| +-------+------------------+ I would like now to get the standard deviation and the mean from statsDF , but when I am trying to collect the