apache-spark | 易学教程

Calculate value based on value from same column of the previous row in spark

阅读更多关于 Calculate value based on value from same column of the previous row in spark

问题 I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row. I am unable to figure it out using withColumn API. I need to calculate a new column, using the formula: MovingRate = MonthlyRate + (0.7 * MovingRatePrevious) ... where the MovingRatePrevious is the MovingRate of the prior row. For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I

What is the difference between “predicate pushdown” and “projection pushdown”?

阅读更多关于 What is the difference between “predicate pushdown” and “projection pushdown”?

问题 I have come across several sources of information, such as the one found here, which explain "predicate pushdown" as : … if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic. However, I have also seen the term "projection pushdown" in other documentation such as here, which appears to be the same thing but I am not sure in my understanding. Is there a specific difference between the two terms?

Spark Select with a List of Columns Scala

阅读更多关于 Spark Select with a List of Columns Scala

问题 I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column. var columns = getColumns(x) // Returns a List[Column] tempDf.select(columns) //trying to get Trying to find a good way of doing this I know, if it were a string I could do something like val result = dataframe.select(columnNames.head, columnNames.tail: _*) 回答1: For spark 2.0 seems that you have two options. Both

Spark Select with a List of Columns Scala

阅读更多关于 Spark Select with a List of Columns Scala

Spark-HBase - GCP template (2/3) - Version issue of json4s?

阅读更多关于 Spark-HBase - GCP template (2/3) - Version issue of json4s?

问题 I'm trying to test the Spark-HBase connector in the GCP context and tried to follow 1, which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and get following error when submitting the job on Dataproc (after having completed [3]). Any idea ? Thanks for your support References 1 https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc [2] https://github.com/hortonworks-spark/shc/tree/branch-2.4 [3] Spark-HBase -

PySpark: ModuleNotFoundError: No module named 'app'

阅读更多关于 PySpark: ModuleNotFoundError: No module named 'app'

问题 I am saving a dataframe to a CSV file in PySpark using below statement: df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite') But i am getting below error Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib

How to deploy Spark application jar file to Kubernetes cluster?

阅读更多关于 How to deploy Spark application jar file to Kubernetes cluster?

问题 I am currently trying to deploy a spark example jar on a Kubernetes cluster running on IBM Cloud. If I try to follow these instructions to deploy spark on a kubernetes cluster, I am not able to launch Spark Pi, because I am always getting the error message: The system cannot find the file specified after entering the code bin/spark-submit \ --master k8s://<url of my kubernetes cluster> \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark

PySpark: ModuleNotFoundError: No module named 'app'

阅读更多关于 PySpark: ModuleNotFoundError: No module named 'app'

PySpark: ModuleNotFoundError: No module named 'app'

阅读更多关于 PySpark: ModuleNotFoundError: No module named 'app'

How to operate numPartitions, lowerBound, upperBound in the spark-jdbc connection?

阅读更多关于 How to operate numPartitions, lowerBound, upperBound in the spark-jdbc connection?

问题 I am trying to read a table on postgres db using spark-jdbc. For that I have come up with the following code: object PartitionRetrieval { var conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s").set("spark.network.timeout","12000s").set("spark.default.parallelism", "20") val log = LogManager.getLogger("Spark-JDBC Program") Logger.getLogger("org").setLevel(Level.ERROR) val conFile = "/home/myuser/ReconTest/inputdir/testconnection.properties" val