apache-spark | 易学教程

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

阅读更多关于 How can I make the pyspark and SparkSQL to execute the Hive on Spark?

问题 I've installed and set up Spark on Yarn together with integrating Spark with Hive Tables. By using spark-shell / pyspark , I also follow the simple tutorial and achieve to create Hive table, load data and then select properly. Then I move to the next step, setting Hive on Spark. By using hive / beeline , I also achieve to create Hive table, load data and then select properly. Hive is executed on YARN/Spark properly. How do I know it work? The hive shell displays the following: - hive> select

How can I make the pyspark and SparkSQL to execute the Hive on Spark?

阅读更多关于 How can I make the pyspark and SparkSQL to execute the Hive on Spark?

Converting dataframe to dictionary in pyspark without using pandas

阅读更多关于 Converting dataframe to dictionary in pyspark without using pandas

问题 Following up this question and dataframes, I am trying to convert a dataframe into a dictionary. In pandas I was using this: dictionary = df_2.unstack().to_dict(orient='index') However, I need to convert this code to pyspark. Can anyone help me with this? As I understand from previous questions such as this I would indeed need to use pandas, but the dataframe is way too big for me to be able to do this. How can I solve this? EDIT: I have now tried the following approach: dictionary_list = map

Converting dataframe to dictionary in pyspark without using pandas

阅读更多关于 Converting dataframe to dictionary in pyspark without using pandas

libraryDependencies Spark in build.sbt error (IntelliJ)

阅读更多关于 libraryDependencies Spark in build.sbt error (IntelliJ)

问题 I am trying to learning Scala with Spark. I am following a tutorial but I am having an error, when I try to import the library dependencies of Spark : libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3" I am getting the following error : And I have 3 Unkwons artifacts. What could be the problem here? My code is so simple, it is just a Hello World. 回答1: Probably you need to add to your build.sbt : resolvers += "spark-core" at "https://mvnrepository.com/artifact/org.apache.spark

libraryDependencies Spark in build.sbt error (IntelliJ)

阅读更多关于 libraryDependencies Spark in build.sbt error (IntelliJ)

Error when connecting spark structured streaming + kafka

阅读更多关于 Error when connecting spark structured streaming + kafka

问题 im trying to connect my structured streaming spark 2.4.5 with kafka, but all the times that im trying this Data Source Provider errors appears. Follow my scala code and my sbt build: import org.apache.spark.sql._ import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.streaming.Trigger object streaming_app_demo { def main(args: Array[String]): Unit = { println("Spark Structured Streaming with Kafka Demo Application Started ...") val KAFKA_TOPIC

Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

阅读更多关于 Training ml models on spark per partitions. Such that there will be a trained model per partition of dataframe

问题 How to do parallel model training per partition in spark using scala? The solution given here is in Pyspark. I'm looking for solution in scala. How can you efficiently build one ML model per partition in Spark with foreachPartition? 回答1: Get the distinct partitions using partition col Create a threadpool of say 100 threads create future object for each threads and run sample code may be as follows- // Get an ExecutorService val threadPoolExecutorService = getExecutionContext("name", 100) //

Reading from S3 in EMR

阅读更多关于 Reading from S3 in EMR

问题 I'm having troubles reading csv files stored on my bucket on AWS S3 from EMR. I have read quite a few posts about it and have done the following to make it works : Add an IAM policy allowing read & write access to s3 Tried to pass the uris in the Argument section of the spark-submit request I thought querying S3 from EMR on a common account was straight forward (because it works locally after defining a fileSystem and providing aws credentials), but when I run : df = spark.read.option(

Spark 3 Typed User Defined Aggregate Function over Window

阅读更多关于 Spark 3 Typed User Defined Aggregate Function over Window

问题 I am trying to use a custom user defined aggregator over a window. When I use an untyped aggregator, the query works. However, I am unable to use typed UDAF as a window function - I get an error stating The query operator ``Project`` contains one or more unsupported expression types Aggregate, Window or Generate . The following basic program showcases the problem. I think it could work using UserDefinedAggregateFunction rather then Aggregator , but the former is deprecated. import scala