apache-spark

Calculate value based on value from same column of the previous row in spark

非 Y 不嫁゛ 提交于 2021-01-21 10:23:29
问题 I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row. I am unable to figure it out using withColumn API. I need to calculate a new column, using the formula: MovingRate = MonthlyRate + (0.7 * MovingRatePrevious) ... where the MovingRatePrevious is the MovingRate of the prior row. For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I

What is the difference between “predicate pushdown” and “projection pushdown”?

狂风中的少年 提交于 2021-01-21 05:22:46
问题 I have come across several sources of information, such as the one found here, which explain "predicate pushdown" as : … if you can “push down” parts of the query to where the data is stored, and thus filter out most of the data, then you can greatly reduce network traffic. However, I have also seen the term "projection pushdown" in other documentation such as here, which appears to be the same thing but I am not sure in my understanding. Is there a specific difference between the two terms?

Spark Select with a List of Columns Scala

喜欢而已 提交于 2021-01-21 04:22:35
问题 I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column. var columns = getColumns(x) // Returns a List[Column] tempDf.select(columns) //trying to get Trying to find a good way of doing this I know, if it were a string I could do something like val result = dataframe.select(columnNames.head, columnNames.tail: _*) 回答1: For spark 2.0 seems that you have two options. Both

Spark Select with a List of Columns Scala

一笑奈何 提交于 2021-01-21 04:22:05
问题 I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column. var columns = getColumns(x) // Returns a List[Column] tempDf.select(columns) //trying to get Trying to find a good way of doing this I know, if it were a string I could do something like val result = dataframe.select(columnNames.head, columnNames.tail: _*) 回答1: For spark 2.0 seems that you have two options. Both

Spark-HBase - GCP template (2/3) - Version issue of json4s?

♀尐吖头ヾ 提交于 2021-01-20 07:27:37
问题 I'm trying to test the Spark-HBase connector in the GCP context and tried to follow 1, which asks to locally package the connector [2] using Maven (I tried Maven 3.6.3) for Spark 2.4, and get following error when submitting the job on Dataproc (after having completed [3]). Any idea ? Thanks for your support References 1 https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc [2] https://github.com/hortonworks-spark/shc/tree/branch-2.4 [3] Spark-HBase -

PySpark: ModuleNotFoundError: No module named 'app'

牧云@^-^@ 提交于 2021-01-20 04:50:06
问题 I am saving a dataframe to a CSV file in PySpark using below statement: df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite') But i am getting below error Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib

How to deploy Spark application jar file to Kubernetes cluster?

混江龙づ霸主 提交于 2021-01-20 04:49:07
问题 I am currently trying to deploy a spark example jar on a Kubernetes cluster running on IBM Cloud. If I try to follow these instructions to deploy spark on a kubernetes cluster, I am not able to launch Spark Pi, because I am always getting the error message: The system cannot find the file specified after entering the code bin/spark-submit \ --master k8s://<url of my kubernetes cluster> \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark

PySpark: ModuleNotFoundError: No module named 'app'

我们两清 提交于 2021-01-20 04:48:21
问题 I am saving a dataframe to a CSV file in PySpark using below statement: df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite') But i am getting below error Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib

PySpark: ModuleNotFoundError: No module named 'app'

纵然是瞬间 提交于 2021-01-20 04:45:06
问题 I am saving a dataframe to a CSV file in PySpark using below statement: df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite') But i am getting below error Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type) File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib

How to operate numPartitions, lowerBound, upperBound in the spark-jdbc connection?

余生长醉 提交于 2021-01-19 08:24:31
问题 I am trying to read a table on postgres db using spark-jdbc. For that I have come up with the following code: object PartitionRetrieval { var conf = new SparkConf().setAppName("Spark-JDBC").set("spark.executor.heartbeatInterval","120s").set("spark.network.timeout","12000s").set("spark.default.parallelism", "20") val log = LogManager.getLogger("Spark-JDBC Program") Logger.getLogger("org").setLevel(Level.ERROR) val conFile = "/home/myuser/ReconTest/inputdir/testconnection.properties" val