apache-spark | 易学教程

Getting “org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Interpreter process is not running null”

阅读更多关于 Getting “org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Interpreter process is not running null”

问题 Hi I am on Docker on mac[K8 enabled] and trying to deploy Zeppelin on K8 by following https://zeppelin.apache.org/docs/0.9.0-SNAPSHOT/quickstart/kubernetes.html. After deploying the zeppelin server on K8, I am trying to run the Spark example but getting following exception: org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Interpreter process is not running null at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:134) at org.apache

Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource

阅读更多关于 Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource

问题 I'm using Rest Data Source and I keep running into an issue with the output saying the following: hope_prms = { 'url' : search_url , 'input' : 'new_view' , 'method' : 'GET' , 'readTimeout' : '10000' , 'connectionTimeout' : '2000' , 'partitions' : '10'} sodasDf = spark.read.format('org.apache.dsext.spark.datasource.rest.RestDataSource').options(**hope_prms).load() An error occurred while calling o117.load. : java.lang.ClassNotFoundException: Failed to find data source: org.apache.dsext.spark

Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource

阅读更多关于 Failed to find data source: org.apache.dsext.spark.datasource.rest.RestDataSource

Converting XML string to Spark Dataframe in Databricks

阅读更多关于 Converting XML string to Spark Dataframe in Databricks

问题 how can I build a Spark dataframe from a string which contains XML code? I can easily do it, if the code is saved in a file dfXml = (sqlContext.read.format("xml") .options(rowTag='my_row_tag') .load(xml_file_name)) However as said I have to build the dataframe from a string which contains regular XML. Thank you Mauro 回答1: On Scala, class "XmlReader" can be used for convert RDD[String] to DataFrame: val result = new XmlReader().xmlRdd(spark, rdd) If you have Dataframe as input, it can be

Converting XML string to Spark Dataframe in Databricks

阅读更多关于 Converting XML string to Spark Dataframe in Databricks

Converting query from SQL to pyspark

阅读更多关于 Converting query from SQL to pyspark

问题 I am trying to convert the following SQL query into pyspark: SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0 AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1 THEN data.pred END) / COUNT(*) AS Result The code I have in PySpark right now is this: Result = data.select( count( (coalesce(data["pred"], lit(0)) != 0) & (coalesce(data["val"], lit(0)) != 0) & (abs( coalesce(data["pred"], lit(0)) - coalesce(data["val"], lit(0)) ) /

Converting query from SQL to pyspark

阅读更多关于 Converting query from SQL to pyspark

PySpark string syntax error on UDF that returns MapType(StringType(), StringType())

阅读更多关于 PySpark string syntax error on UDF that returns MapType(StringType(), StringType())

问题 I'm getting the following syntax error: pyspark.sql.utils.AnalysisException: syntax error in attribute name: No I am not.; When performing some aspect sentiment classification on the text column of a Spark dataframe df_text that looks more or less like the following: index id text 1995 ev0oyrq [sign up]( 2014 eugwxff No I am not. 2675 g9f914q It’s hard for her to move around and even sit down, hard for her to walk and squeeze her hands. She hunches now. 1310 echja0g Thank you! 2727 gc725t2

Spark: unusually slow data write to Cloud Storage

阅读更多关于 Spark: unusually slow data write to Cloud Storage

问题 As the final stage of the pyspark job, I need to save 33Gb of data to Cloud Storage. My cluster is on Dataproc and consists of 15 n1-standard-v4 workers. I'm working with avro and the code I use to save the data: df = spark.createDataFrame(df.rdd, avro_schema_str) df \ .write \ .format("avro") \ .partitionBy('<field_with_<5_unique_values>', 'field_with_lots_of_unique_values>') \ .save(f"gs://{output_path}") The write stage stats from the UI: My worker stats: Quite strangely for the adequate

How to list file keys in Databricks dbfs without dbutils

阅读更多关于 How to list file keys in Databricks dbfs **without** dbutils

问题 Apparently dbutils cannot be used in cmd-line spark-submits, you must use Jar Jobs for that, but I MUST use spark-submit style jobs due to other requirements, yet still have a need to list and iterate over file keys in dbfs to make some decisions about which files to use as input to a process... Using scala, what lib in spark or hadoop can I use to retrieve a list of dbfs:/filekeys of a particular pattern? import org.apache.hadoop.fs.Path import org.apache.spark.sql.SparkSession def ls