apache-spark | 易学教程

column object not callable spark

阅读更多关于 column object not callable spark

问题 I tried to install spark and run the commands given in the tutorial but get the following error - https://spark.apache.org/docs/latest/quick-start.html P-MBP:spark-2.0.2-bin-hadoop2.4 prem$ ./bin/pyspark Python 2.7.13 (default, Apr 4 2017, 08:44:49) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to

Adding name of file when using sparklyr::spark_read_json

阅读更多关于 Adding name of file when using sparklyr::spark_read_json

问题 I have millions of json-files, where each of the files contains the same number of columns, lets say x and y . Note that the length of x and y is equal for a single file, but could be different when comparing two different files. The problem is that the only thing that separates the data is the name of the file. So when combining the files I'd like to have the name of the file included as a third column. Is this possible using sparklyr::spark_read_json , i.e. when using wildcards? MWE:

Spark/Scala load Oracle Table to Hive

阅读更多关于 Spark/Scala load Oracle Table to Hive

问题 I am loading few Oracle tables to Hive, it seems to be working but 2 tables are getting error - IllegalArgumentException: requirement failed: Decimal precision 136 exceeds max precision 38 I checked Oracle table and there is no column with Decimal (136) precision, in the source. Here is the Spark/Scala code in spark-shell : val df_oracle = spark.read.format("jdbc").option("url", "jdbc:oracle:thin:@hostname:port:SID").option("user",userName).option("password",passWord).option("driver", "oracle

How does spark choose nodes to run executors?(spark on yarn)

阅读更多关于 How does spark choose nodes to run executors?(spark on yarn)

问题 How does spark choose nodes to run executors?(spark on yarn) We use spark on yarn mode, with a cluster of 120 nodes. Yesterday one spark job create 200 executors, while 11 executors on node1, 10 executors on node2, and other executors distributed equally on the other nodes. Since there are so many executors on node1 and node2, the job run slowly. How does spark select the node to run executors? according to yarn resourceManager? 回答1: Cluster Manager allocates resources across the other

IllegalArgumentException when computing a PCA with Spark ML

阅读更多关于 IllegalArgumentException when computing a PCA with Spark ML

问题 I have a parquet file containing the id and features columns and I want to apply the pca algorithm. val dataset = spark.read.parquet("/usr/local/spark/dataset/data/user") val features = new VectorAssembler() .setInputCols(Array("id", "features" )) .setOutputCol("features") val pca = new PCA() .setInputCol("features") .setK(50) .fit(dataset) .setOutputCol("pcaFeatures") val result = pca.transform(dataset).select("pcaFeatures") pca.save("/usr/local/spark/dataset/out") but I have this exception

Update the Nested Json with another Nested Json using Python

阅读更多关于 Update the Nested Json with another Nested Json using Python

问题 For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON. Can anyone help me with this? I want to implement this in Pyspark. Full Set Json look like this: { "email": "abctest@xxx.com", "firstName": "name01", "id": 6304, "surname": "Optional", "layer01": { "key1": "value1", "key2": "value2", "key3": "value3", "key4": "value4", "layer02": { "key1": "value1", "key2": "value2" }, "layer03": [ { "inner_key01": "inner value01" }, {

Update the Nested Json with another Nested Json using Python

阅读更多关于 Update the Nested Json with another Nested Json using Python

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

阅读更多关于 How to obtain the average of an array-type column in scala-spark over all row entries per entry?

问题 I got an array column with 512 double elements, and want to get the average. Take an array column with length=3 as example: val x = Seq("2 4 6", "0 0 0").toDF("value").withColumn("value", split($"value", " ")) x.printSchema() x.show() root |-- value: array (nullable = true) | |-- element: string (containsNull = true) +---------+ | value| +---------+ |[2, 4, 6]| |[0, 0, 0]| +---------+ The following result is desired: x.select(..... as "avg_value").show() ------------ |avg_value | ------------

How to obtain the average of an array-type column in scala-spark over all row entries per entry?

阅读更多关于 How to obtain the average of an array-type column in scala-spark over all row entries per entry?

Convert pyspark dataframe into list of python dictionaries

阅读更多关于 Convert pyspark dataframe into list of python dictionaries

问题 Hi I'm new to pyspark and I'm trying to convert pyspark.sql.dataframe into list of dictionaries. Below is my dataframe, the type is <class 'pyspark.sql.dataframe.DataFrame'>: +------------------+----------+------------------------+ | title|imdb_score|Worldwide_Gross(dollars)| +------------------+----------+------------------------+ | The Eight Hundred| 7.2| 460699653| | Bad Boys for Life| 6.6| 426505244| | Tenet| 7.8| 334000000| |Sonic the Hedgehog| 6.5| 308439401| | Dolittle| 5.6| 245229088|