apache-spark

Getting “org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Interpreter process is not running null”

有些话、适合烂在心里 提交于 2021-01-07 02:49:34
问题 Hi I am on Docker on mac[K8 enabled] and trying to deploy Zeppelin on K8 by following https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html. After deploying the zeppelin server on K8, I am trying to run the Spark example but getting following exception: org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Interpreter process is not running null at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:134) at org.apache

Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource

≡放荡痞女 提交于 2021-01-07 02:32:57
问题 I'm using Rest Data Source and I keep running into an issue with the output saying the following: hope_prms = { 'url' : search_url , 'input' : 'new_view' , 'method' : 'GET' , 'readTimeout' : '10000' , 'connectionTimeout' : '2000' , 'partitions' : '10'} sodasDf = spark.read.format('org.apache.dsext.spark.datasource.rest.RestDataSource').options(**hope_prms).load() An error occurred while calling o117.load. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.dsext.spark

Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource

不打扰是莪最后的温柔 提交于 2021-01-07 02:32:15
问题 I'm using Rest Data Source and I keep running into an issue with the output saying the following: hope_prms = { 'url' : search_url , 'input' : 'new_view' , 'method' : 'GET' , 'readTimeout' : '10000' , 'connectionTimeout' : '2000' , 'partitions' : '10'} sodasDf = spark.read.format('org.apache.dsext.spark.datasource.rest.RestDataSource').options(**hope_prms).load() An error occurred while calling o117.load. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.dsext.spark

Converting XML string to Spark Dataframe in Databricks

主宰稳场 提交于 2021-01-07 02:01:29
问题 how can I build a Spark dataframe from a string which contains XML code? I can easily do it, if the code is saved in a file dfXml = (sqlContext.read.format("xml") .options(rowTag='my_row_tag') .load(xml_file_name)) However as said I have to build the dataframe from a string which contains regular XML. Thank you Mauro 回答1: On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame: val result = new XmlReader().xmlRdd(spark, rdd) If you have Dataframe as input, it can be

Converting XML string to Spark Dataframe in Databricks

大憨熊 提交于 2021-01-07 02:01:12
问题 how can I build a Spark dataframe from a string which contains XML code? I can easily do it, if the code is saved in a file dfXml = (sqlContext.read.format("xml") .options(rowTag='my_row_tag') .load(xml_file_name)) However as said I have to build the dataframe from a string which contains regular XML. Thank you Mauro 回答1: On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame: val result = new XmlReader().xmlRdd(spark, rdd) If you have Dataframe as input, it can be

Converting query from SQL to pyspark

霸气de小男生 提交于 2021-01-07 01:37:08
问题 I am trying to convert the following SQL query into pyspark: SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result The code I have in PySpark right now is this: Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) /

Converting query from SQL to pyspark

泪湿孤枕 提交于 2021-01-07 01:32:56
问题 I am trying to convert the following SQL query into pyspark: SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result The code I have in PySpark right now is this: Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) /

PySpark string syntax error on UDF that returns MapType(StringType(), StringType())

假如想象 提交于 2021-01-07 01:29:08
问题 I'm getting the following syntax error: pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.; When performing some aspect sentiment classification on the text column of a Spark dataframe df_text that looks more or less like the following: index id text 1995 ev0oyrq [sign up]( 2014 eugwxff No I am not. 2675 g9f914q It’s hard for her to move around and even sit down, hard for her to walk and squeeze her hands. She hunches now. 1310 echja0g Thank you! 2727 gc725t2

Spark: unusually slow data write to Cloud Storage

大憨熊 提交于 2021-01-07 01:24:25
问题 As the final stage of the pyspark job, I need to save 33Gb of data to Cloud Storage. My cluster is on Dataproc and consists of 15 n1-standard-v4 workers. I'm working with avro and the code I use to save the data: df = spark.createDataFrame(df.rdd, avro_schema_str) df \ .write \ .format("avro") \ .partitionBy('<field_with_<5_unique_values>', 'field_with_lots_of_unique_values>') \ .save(f"gs://{output_path}") The write stage stats from the UI: My worker stats: Quite strangely for the adequate

How to list file keys in Databricks dbfs **without** dbutils

给你一囗甜甜゛ 提交于 2021-01-07 01:21:53
问题 Apparently dbutils cannot be used in cmd-line spark-submits, you must use Jar Jobs for that, but I MUST use spark-submit style jobs due to other requirements, yet still have a need to list and iterate over file keys in dbfs to make some decisions about which files to use as input to a process... Using scala, what lib in spark or hadoop can I use to retrieve a list of dbfs:/filekeys of a particular pattern? import org.apache.hadoop.fs.Path import org.apache.spark.sql.SparkSession def ls