apache-spark

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

独自空忆成欢 提交于 2021-02-11 16:59:59
问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

放肆的年华 提交于 2021-02-11 16:57:30
问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

Converting dataframe to dictionary in pyspark without using pandas

大城市里の小女人 提交于 2021-02-11 16:55:20
问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

Converting dataframe to dictionary in pyspark without using pandas

…衆ロ難τιáo~ 提交于 2021-02-11 16:54:15
问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

libraryDependencies Spark in build.sbt error (IntelliJ)

十年热恋 提交于 2021-02-11 16:53:30
问题 I am trying to learning Scala with Spark. I am following a tutorial but I am having an error, when I try to import the library dependencies of Spark : libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3" I am getting the following error : And I have 3 Unkwons artifacts. What could be the problem here? My code is so simple, it is just a Hello World. 回答1: Probably you need to add to your build.sbt : resolvers += "spark-core" at "https://mvnrepository.com/artifact/org.apache.spark

libraryDependencies Spark in build.sbt error (IntelliJ)

☆樱花仙子☆ 提交于 2021-02-11 16:50:29
问题 I am trying to learning Scala with Spark. I am following a tutorial but I am having an error, when I try to import the library dependencies of Spark : libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3" I am getting the following error : And I have 3 Unkwons artifacts. What could be the problem here? My code is so simple, it is just a Hello World. 回答1: Probably you need to add to your build.sbt : resolvers += "spark-core" at "https://mvnrepository.com/artifact/org.apache.spark

Error when connecting spark structured streaming + kafka

别说谁变了你拦得住时间么 提交于 2021-02-11 15:45:49
问题 im trying to connect my structured streaming spark 2.4.5 with kafka, but all the times that im trying this Data Source Provider errors appears. Follow my scala code and my sbt build: import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.streaming.Trigger object streaming_app_demo { def main(args: Array[String]): Unit = { println("Spark Structured Streaming with Kafka Demo Application Started ...") val KAFKA_TOPIC

Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

余生颓废 提交于 2021-02-11 15:41:37
问题 How to do parallel model training per partition in spark using scala? The solution given here is in Pyspark. I'm looking for solution in scala. How can you efficiently build one ML model per partition in Spark with foreachPartition? 回答1: Get the distinct partitions using partition col Create a threadpool of say 100 threads create future object for each threads and run sample code may be as follows- // Get an ExecutorService val threadPoolExecutorService = getExecutionContext("name", 100) //

Reading from S3 in EMR

萝らか妹 提交于 2021-02-11 15:23:17
问题 I'm having troubles reading csv files stored on my bucket on AWS S3 from EMR. I have read quite a few posts about it and have done the following to make it works : Add an IAM policy allowing read & write access to s3 Tried to pass the uris in the Argument section of the spark-submit request I thought querying S3 from EMR on a common account was straight forward (because it works locally after defining a fileSystem and providing aws credentials), but when I run : df = spark.read.option(

Spark 3 Typed User Defined Aggregate Function over Window

*爱你&永不变心* 提交于 2021-02-11 15:12:56
问题 I am trying to use a custom user defined aggregator over a window. When I use an untyped aggregator, the query works. However, I am unable to use typed UDAF as a window function - I get an error stating The query operator ``Project`` contains one or more unsupported expression types Aggregate, Window or Generate . The following basic program showcases the problem. I think it could work using UserDefinedAggregateFunction rather then Aggregator , but the former is deprecated. import scala