apache-spark

column object not callable spark

人走茶凉 提交于 2021-02-10 06:22:14
问题 I tried to install spark and run the commands given in the tutorial but get the following error - https://spark.apache.org/docs/latest/quick-start.html P-MBP:spark-2.0.2-bin-hadoop2.4 prem$ ./bin/pyspark Python 2.7.13 (default, Apr 4 2017, 08:44:49) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to

Adding name of file when using sparklyr::spark_read_json

谁说我不能喝 提交于 2021-02-10 06:14:30
问题 I have millions of json-files, where each of the files contains the same number of columns, lets say x and y . Note that the length of x and y is equal for a single file, but could be different when comparing two different files. The problem is that the only thing that separates the data is the name of the file. So when combining the files I'd like to have the name of the file included as a third column. Is this possible using sparklyr::spark_read_json , i.e. when using wildcards? MWE:

Spark/Scala load Oracle Table to Hive

对着背影说爱祢 提交于 2021-02-10 05:59:06
问题 I am loading few Oracle tables to Hive, it seems to be working but 2 tables are getting error - IllegalArgumentException: requirement failed: Decimal precision 136 exceeds max precision 38 I checked Oracle table and there is no column with Decimal (136) precision, in the source. Here is the Spark/Scala code in spark-shell : val df_oracle = spark.read.format("jdbc").option("url", "jdbc:oracle:thin:@hostname:port:SID").option("user",userName).option("password",passWord).option("driver", "oracle

How does spark choose nodes to run executors?(spark on yarn)

狂风中的少年 提交于 2021-02-10 05:21:26
问题 How does spark choose nodes to run executors?(spark on yarn) We use spark on yarn mode, with a cluster of 120 nodes. Yesterday one spark job create 200 executors, while 11 executors on node1, 10 executors on node2, and other executors distributed equally on the other nodes. Since there are so many executors on node1 and node2, the job run slowly. How does spark select the node to run executors? according to yarn resourceManager? 回答1: Cluster Manager allocates resources across the other

IllegalArgumentException when computing a PCA with Spark ML

試著忘記壹切 提交于 2021-02-10 05:08:42
问题 I have a parquet file containing the id and features columns and I want to apply the pca algorithm. val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user") val features = new VectorAssembler() .setInputCols(Array("id", "features" )) .setOutputCol("features") val pca = new PCA() .setInputCol("features") .setK(50) .fit(dataset) .setOutputCol("pcaFeatures") val result = pca.transform(dataset).select("pcaFeatures") pca.save("/usr/local/spark/dataset/out") but I have this exception

Update the Nested Json with another Nested Json using Python

两盒软妹~` 提交于 2021-02-10 05:02:14
问题 For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON. Can anyone help me with this? I want to implement this in Pyspark. Full Set Json look like this: { "email": "abctest@xxx.com", "firstName": "name01", "id": 6304, "surname": "Optional", "layer01": { "key1": "value1", "key2": "value2", "key3": "value3", "key4": "value4", "layer02": { "key1": "value1", "key2": "value2" }, "layer03": [ { "inner_key01": "inner value01" }, {

Update the Nested Json with another Nested Json using Python

拈花ヽ惹草 提交于 2021-02-10 05:00:58
问题 For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON. Can anyone help me with this? I want to implement this in Pyspark. Full Set Json look like this: { "email": "abctest@xxx.com", "firstName": "name01", "id": 6304, "surname": "Optional", "layer01": { "key1": "value1", "key2": "value2", "key3": "value3", "key4": "value4", "layer02": { "key1": "value1", "key2": "value2" }, "layer03": [ { "inner_key01": "inner value01" }, {

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

一世执手 提交于 2021-02-10 04:57:27
问题 I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example: val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " ")) x.printSchema() x.show() root |-- value: array (nullable = true) | |-- element: string (containsNull = true) +---------+ | value| +---------+ |[2, 4, 6]| |[0, 0, 0]| +---------+ The following result is desired: x.select(..... as "avg_value").show() ------------ |avg_value | ------------

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

回眸只為那壹抹淺笑 提交于 2021-02-10 04:56:51
问题 I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example: val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " ")) x.printSchema() x.show() root |-- value: array (nullable = true) | |-- element: string (containsNull = true) +---------+ | value| +---------+ |[2, 4, 6]| |[0, 0, 0]| +---------+ The following result is desired: x.select(..... as "avg_value").show() ------------ |avg_value | ------------

Convert pyspark dataframe into list of python dictionaries

怎甘沉沦 提交于 2021-02-10 04:50:38
问题 Hi I'm new to pyspark and I'm trying to convert pyspark.sql.dataframe into list of dictionaries. Below is my dataframe, the type is <class 'pyspark.sql.dataframe.DataFrame'>: +------------------+----------+------------------------+ | title|imdb_score|Worldwide_Gross(dollars)| +------------------+----------+------------------------+ | The Eight Hundred| 7.2| 460699653| | Bad Boys for Life| 6.6| 426505244| | Tenet| 7.8| 334000000| |Sonic the Hedgehog| 6.5| 308439401| | Dolittle| 5.6| 245229088|