pyspark

Pyspark : Change nested column datatype

别来无恙 提交于 2020-01-11 12:14:08
问题 How can we change the datatype of a nested column in Pyspark? For rxample, how can I change the data type of value from string to int? Reference:how to change a Dataframe column from String type to Double type in pyspark { "x": "12", "y": { "p": { "name": "abc", "value": "10" }, "q": { "name": "pqr", "value": "20" } } } 回答1: You can read the json data using from pyspark import SQLContext sqlContext = SQLContext(sc) data_df = sqlContext.read.json("data.json", multiLine = True) data_df

PYSPARK: how to visualize a GraphFrame?

喜欢而已 提交于 2020-01-11 11:48:07
问题 Suppose that I have created the following graph. My question is how can I visualize it? # Create a Vertex DataFrame with unique ID column "id" v = sqlContext.createDataFrame([ ("a", "Alice", 34), ("b", "Bob", 36), ("c", "Charlie", 30), ], ["id", "name", "age"]) # Create an Edge DataFrame with "src" and "dst" columns e = sqlContext.createDataFrame([ ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow"), ], ["src", "dst", "relationship"]) # Create a GraphFrame from graphframes import

Spark RDD: How to calculate statistics most efficiently?

懵懂的女人 提交于 2020-01-11 10:38:28
问题 Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib: This approach has the advantage of easily-adaptable to use other mllib.stat

Spark RDD: How to calculate statistics most efficiently?

北慕城南 提交于 2020-01-11 10:38:28
问题 Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib: This approach has the advantage of easily-adaptable to use other mllib.stat

PySpark: dynamic union of DataFrames with different columns

二次信任 提交于 2020-01-11 04:18:10
问题 Consider the arrays as shown here. I have 3 sets of array: Array 1: C1 C2 C3 1 2 3 9 5 6 Array 2: C2 C3 C4 11 12 13 10 15 16 Array 3: C1 C4 111 112 110 115 I need the output as following, the input I can get any one value for C1, ..., C4 but while joining I need to get correct values and if the value is not there then it should be zero. Expected output: C1 C2 C3 C4 1 2 3 0 9 5 6 0 0 11 12 13 0 10 15 16 111 0 0 112 110 0 0 115 I have written pyspark code but I have hardcoded the value for the

AWS Glue Crawler Classifies json file as UNKNOWN

折月煮酒 提交于 2020-01-11 02:49:26
问题 I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB. I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN. Has anyone else run into this issue? Is there a

Transforming PySpark RDD with Scala

戏子无情 提交于 2020-01-10 19:50:30
问题 TL;DR - I have what looks like a DStream of Strings in a PySpark application. I want to send it as a DStream[String] to a Scala library. Strings are not converted by Py4j, though. I'm working on a PySpark application that pulls data from Kafka using Spark Streaming. My messages are strings and I would like to call a method in Scala code, passing it a DStream[String] instance. However, I'm unable to receive proper JVM strings in the Scala code. It looks to me like the Python strings are not

Dataframe transpose with pyspark in Apache Spark

南楼画角 提交于 2020-01-10 09:04:53
问题 I have a dataframe df that have following structure: +-----+-----+-----+-------+ | s |col_1|col_2|col_...| +-----+-----+-----+-------+ | f1 | 0.0| 0.6| ... | | f2 | 0.6| 0.7| ... | | f3 | 0.5| 0.9| ... | | ...| ...| ...| ... | And I want to calculate the transpose of this dataframe so it will be look like +-------+-----+-----+-------+------+ | s | f1 | f2 | f3 | ...| +-------+-----+-----+-------+------+ |col_1 | 0.0| 0.6| 0.5 | ...| |col_2 | 0.6| 0.7| 0.9 | ...| |col_...| ...| ...| ... | ...|

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

大城市里の小女人 提交于 2020-01-10 02:46:15
问题 I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) at org.apache.spark.rpc.RpcTimeout

PySpark: How to create a nested JSON from spark data frame?

怎甘沉沦 提交于 2020-01-10 02:21:08
问题 I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True) Update1: As per @MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside