pyspark | 易学教程

Pyspark : Change nested column datatype

阅读更多关于 Pyspark : Change nested column datatype

问题 How can we change the datatype of a nested column in Pyspark? For rxample, how can I change the data type of value from string to int? Reference:how to change a Dataframe column from String type to Double type in pyspark { "x": "12", "y": { "p": { "name": "abc", "value": "10" }, "q": { "name": "pqr", "value": "20" } } } 回答1: You can read the json data using from pyspark import SQLContext sqlContext = SQLContext(sc) data_df = sqlContext.read.json("data.json", multiLine = True) data_df

PYSPARK: how to visualize a GraphFrame?

阅读更多关于 PYSPARK: how to visualize a GraphFrame?

问题 Suppose that I have created the following graph. My question is how can I visualize it? # Create a Vertex DataFrame with unique ID column "id" v = sqlContext.createDataFrame([ ("a", "Alice", 34), ("b", "Bob", 36), ("c", "Charlie", 30), ], ["id", "name", "age"]) # Create an Edge DataFrame with "src" and "dst" columns e = sqlContext.createDataFrame([ ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow"), ], ["src", "dst", "relationship"]) # Create a GraphFrame from graphframes import

Spark RDD: How to calculate statistics most efficiently?

阅读更多关于 Spark RDD: How to calculate statistics most efficiently?

问题 Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib: This approach has the advantage of easily-adaptable to use other mllib.stat

Spark RDD: How to calculate statistics most efficiently?

阅读更多关于 Spark RDD: How to calculate statistics most efficiently?

PySpark: dynamic union of DataFrames with different columns

阅读更多关于 PySpark: dynamic union of DataFrames with different columns

问题 Consider the arrays as shown here. I have 3 sets of array: Array 1: C1 C2 C3 1 2 3 9 5 6 Array 2: C2 C3 C4 11 12 13 10 15 16 Array 3: C1 C4 111 112 110 115 I need the output as following, the input I can get any one value for C1, ..., C4 but while joining I need to get correct values and if the value is not there then it should be zero. Expected output: C1 C2 C3 C4 1 2 3 0 9 5 6 0 0 11 12 13 0 10 15 16 111 0 0 112 110 0 0 115 I have written pyspark code but I have hardcoded the value for the

AWS Glue Crawler Classifies json file as UNKNOWN

阅读更多关于 AWS Glue Crawler Classifies json file as UNKNOWN

问题 I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB. I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN. Has anyone else run into this issue? Is there a

Transforming PySpark RDD with Scala

阅读更多关于 Transforming PySpark RDD with Scala

问题 TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though. I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not

Dataframe transpose with pyspark in Apache Spark

阅读更多关于 Dataframe transpose with pyspark in Apache Spark

问题 I have a dataframe df that have following structure: +-----+-----+-----+-------+ | s |col_1|col_2|col_...| +-----+-----+-----+-------+ | f1 | 0.0| 0.6| ... | | f2 | 0.6| 0.7| ... | | f3 | 0.5| 0.9| ... | | ...| ...| ...| ... | And I want to calculate the transpose of this dataframe so it will be look like +-------+-----+-----+-------+------+ | s | f1 | f2 | f3 | ...| +-------+-----+-----+-------+------+ |col_1 | 0.0| 0.6| 0.5 | ...| |col_2 | 0.6| 0.7| 0.9 | ...| |col_...| ...| ...| ... | ...|

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

阅读更多关于 Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

问题 I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout

PySpark: How to create a nested JSON from spark data frame?

阅读更多关于 PySpark: How to create a nested JSON from spark data frame?

问题 I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True) Update1: As per @MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside