pyspark

Pyspark Structured Streaming Kafka configuration error

你说的曾经没有我的故事 提交于 2019-12-23 04:20:07
问题 I've been using pyspark for Spark Streaming (Spark 2.0.2) with Kafka (0.10.1.0) successfully before, but my purposes are better suited for Structured Streaming. I've attempted to use the example online: https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html with the following analogous code: ds1 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() query = ds1 .writeStream .outputMode(

Pyspark KMeans clustering features column IllegalArgumentException

点点圈 提交于 2019-12-23 04:10:11
问题 pyspark==2.4.0 Here is the code giving the exception: LDA = spark.read.parquet('./LDA.parquet/') LDA.printSchema() from pyspark.ml.clustering import KMeans from pyspark.ml.evaluation import ClusteringEvaluator kmeans = KMeans(featuresCol='topic_vector_fix_dim').setK(15).setSeed(1) model = kmeans.fit(LDA) root |-- Id: string (nullable = true) |-- topic_vector_fix_dim: array (nullable = true) | |-- element: double (containsNull = true) IllegalArgumentException: 'requirement failed: Column topic

How to transform DataFrame per one column to create two new columns in pyspark?

谁说胖子不能爱 提交于 2019-12-23 03:39:19
问题 I have a dataframe "x", In which their are two columns "x1" and "x2" x1(status) x2 kv,true 45 bm,true 65 mp,true 75 kv,null 450 bm,null 550 mp,null 650 I want to convert this dataframe into a format in which data is filtered according to its status and value x1 true null kv 45 450 bm 65 550 mp 75 650 Is there a way to do this, I am using pyspark datadrame 回答1: Yes, there is a way. First split the first column by , using split function, then split this dataframe into two dataframes (using

How concatenate Two array in pyspark

微笑、不失礼 提交于 2019-12-23 03:29:22
问题 I have a pyspark Dataframe. Example: ID | phone | name <array> | age <array> ------------------------------------------------- 12 | 827556 | ['AB','AA'] | ['CC'] ------------------------------------------------- 45 | 87346 | null | ['DD'] ------------------------------------------------- 56 | 98356 | ['FF'] | null ------------------------------------------------- 34 | 87345 | ['AA','BB'] | ['BB'] I want to concatenate the 2 arrays name and age. I did it like this: df = df.withColumn("new

Assign SQL schema to Spark DataFrame

百般思念 提交于 2019-12-23 03:27:16
问题 I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark. This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema? create_table_sql = ''' CREATE TABLE public.example ( id LONG, example VARCHAR(80

Why does Spark running in Google Dataproc store temporary files on external storage (GCS) instead of local disk or HDFS while using saveAsTextFile?

对着背影说爱祢 提交于 2019-12-23 02:55:23
问题 I have run the following PySpark code: from pyspark import SparkContext sc = SparkContext() data = sc.textFile('gs://bucket-name/input_blob_path') sorted_data = data.sortBy(lambda x: sort_criteria(x)) sorted_data.saveAsTextFile( 'gs://bucket-name/output_blob_path', compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec" ) Job finished successfully. However, during the job execution Spark created many temporary blobs in the following path gs://bucket-name/output_blob_path/_temporary/0/

Collecting the result of PySpark Dataframe filter into a variable

て烟熏妆下的殇ゞ 提交于 2019-12-23 02:49:12
问题 I am using the PySpark dataframe. My dataset contains three attributes, id , name and address . I am trying to delete the corresponding row based on the name value. What I've been trying is to get unique id of the row I want to delete ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect() The output I am getting is the following: [Row(id='382')] I am wondering how can I use id to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing

PySpark convert struct field inside array to string

爷,独闯天下 提交于 2019-12-23 02:28:19
问题 I have a dataframe with schema like this: |-- order: string (nullable = true) |-- travel: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- place: struct (nullable = true) | | | |-- name: string (nullable = true) | | | |-- address: string (nullable = true) | | | |-- latitude: double (nullable = true) | | | |-- longitude: double (nullable = true) | | |-- distance_in_kms: float (nullable = true) | | |-- estimated_time: struct (nullable = true) | | | |-- seconds: long

How to convert some pyspark dataframe's column into a dict with its column name and combine them to be a json column?

柔情痞子 提交于 2019-12-23 02:18:46
问题 I have data in the following format, and I want to change its format using pyspark with two columns ('tag' and 'data'). The 'tag' column values are unique, and the 'data' column values are a json string obtained from the orginial column 'date、stock、price' in which combine 'stock' and 'price' to be the 'A' columns value, combine 'date' and 'num' to be the 'B' columns value. I didn't find or write good funcitions to realize this effect. my spark version is 2.1.0 original DataFrame date, stock,

pyspark Collect causing memory to shoot up 80GB

十年热恋 提交于 2019-12-23 02:01:47
问题 I have a Spark job that reads a CSV file and does a bunch of joins and renaming columns. The file size is in MB x = info_collect.collect()

 x size in python is around 100MB however I get a memory crash, checking Gangla the memory goes up 80GB. I have no idea why collection 100MB can cause memory to spike like that. Could someone please advice? 来源: https://stackoverflow.com/questions/52483267/pyspark-collect-causing-memory-to-shoot-up-80gb