pyspark | 易学教程

Pyspark Structured Streaming Kafka configuration error

阅读更多关于 Pyspark Structured Streaming Kafka configuration error

问题 I've been using pyspark for Spark Streaming (Spark 2.0.2) with Kafka (0.10.1.0) successfully before, but my purposes are better suited for Structured Streaming. I've attempted to use the example online: https://spark.apache.org/docs/2.1.0/structured-streaming-kafka-integration.html with the following analogous code: ds1 = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() query = ds1 .writeStream .outputMode(

Pyspark KMeans clustering features column IllegalArgumentException

阅读更多关于 Pyspark KMeans clustering features column IllegalArgumentException

问题 pyspark==2.4.0 Here is the code giving the exception: LDA = spark.read.parquet('./LDA.parquet/') LDA.printSchema() from pyspark.ml.clustering import KMeans from pyspark.ml.evaluation import ClusteringEvaluator kmeans = KMeans(featuresCol='topic_vector_fix_dim').setK(15).setSeed(1) model = kmeans.fit(LDA) root |-- Id: string (nullable = true) |-- topic_vector_fix_dim: array (nullable = true) | |-- element: double (containsNull = true) IllegalArgumentException: 'requirement failed: Column topic

How to transform DataFrame per one column to create two new columns in pyspark?

阅读更多关于 How to transform DataFrame per one column to create two new columns in pyspark?

问题 I have a dataframe "x", In which their are two columns "x1" and "x2" x1(status) x2 kv,true 45 bm,true 65 mp,true 75 kv,null 450 bm,null 550 mp,null 650 I want to convert this dataframe into a format in which data is filtered according to its status and value x1 true null kv 45 450 bm 65 550 mp 75 650 Is there a way to do this, I am using pyspark datadrame 回答1: Yes, there is a way. First split the first column by , using split function, then split this dataframe into two dataframes (using

How concatenate Two array in pyspark

阅读更多关于 How concatenate Two array in pyspark

问题 I have a pyspark Dataframe. Example: ID | phone | name <array> | age <array> ------------------------------------------------- 12 | 827556 | ['AB','AA'] | ['CC'] ------------------------------------------------- 45 | 87346 | null | ['DD'] ------------------------------------------------- 56 | 98356 | ['FF'] | null ------------------------------------------------- 34 | 87345 | ['AA','BB'] | ['BB'] I want to concatenate the 2 arrays name and age. I did it like this: df = df.withColumn("new

Assign SQL schema to Spark DataFrame

阅读更多关于 Assign SQL schema to Spark DataFrame

问题 I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark. This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema? create_table_sql = ''' CREATE TABLE public.example ( id LONG, example VARCHAR(80

Why does Spark running in Google Dataproc store temporary files on external storage (GCS) instead of local disk or HDFS while using saveAsTextFile?

阅读更多关于 Why does Spark running in Google Dataproc store temporary files on external storage (GCS) instead of local disk or HDFS while using saveAsTextFile?

问题 I have run the following PySpark code: from pyspark import SparkContext sc = SparkContext() data = sc.textFile('gs://bucket-name/input_blob_path') sorted_data = data.sortBy(lambda x: sort_criteria(x)) sorted_data.saveAsTextFile( 'gs://bucket-name/output_blob_path', compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec" ) Job finished successfully. However, during the job execution Spark created many temporary blobs in the following path gs://bucket-name/output_blob_path/_temporary/0/

Collecting the result of PySpark Dataframe filter into a variable

阅读更多关于 Collecting the result of PySpark Dataframe filter into a variable

问题 I am using the PySpark dataframe. My dataset contains three attributes, id , name and address . I am trying to delete the corresponding row based on the name value. What I've been trying is to get unique id of the row I want to delete ID = df.filter(df["name"] == "Bruce").select(df["id"]).collect() The output I am getting is the following: [Row(id='382')] I am wondering how can I use id to delete a row. Also, how can i replace certain value in a dataframe with another? For example, replacing

PySpark convert struct field inside array to string

阅读更多关于 PySpark convert struct field inside array to string

How to convert some pyspark dataframe's column into a dict with its column name and combine them to be a json column?

阅读更多关于 How to convert some pyspark dataframe's column into a dict with its column name and combine them to be a json column?

问题 I have data in the following format, and I want to change its format using pyspark with two columns ('tag' and 'data'). The 'tag' column values are unique, and the 'data' column values are a json string obtained from the orginial column 'date、stock、price' in which combine 'stock' and 'price' to be the 'A' columns value, combine 'date' and 'num' to be the 'B' columns value. I didn't find or write good funcitions to realize this effect. my spark version is 2.1.0 original DataFrame date, stock,

pyspark Collect causing memory to shoot up 80GB

阅读更多关于 pyspark Collect causing memory to shoot up 80GB

问题 I have a Spark job that reads a CSV file and does a bunch of joins and renaming columns. The file size is in MB x = info_collect.collect()   x size in python is around 100MB however I get a memory crash, checking Gangla the memory goes up 80GB. I have no idea why collection 100MB can cause memory to spike like that. Could someone please advice? 来源： https://stackoverflow.com/questions/52483267/pyspark-collect-causing-memory-to-shoot-up-80gb