pyspark | 易学教程

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

阅读更多关于 PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

问题 The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically): import sys, os, multiprocessing from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions from pyspark.conf import SparkConf from pyspark.sql import SparkSession

How to read multiple partitioned .gzip files into a Spark Dataframe?

阅读更多关于 How to read multiple partitioned .gzip files into a Spark Dataframe?

问题 I have the following folder of partitioned data- my_folder |--part-0000.gzip |--part-0001.gzip |--part-0002.gzip |--part-0003.gzip I try to read this data into a dataframe using- >>> my_df = spark.read.csv("/path/to/my_folder/*") >>> my_df.show(5) +--------------------+ | _c0| +--------------------+ |��[I��...| |��RUu�[*Ք��g��T...| |�t�� qd��8~��...| |�(��b4�:��I�...| |��!y�)�PC��ќ\�...| +--------------------+ only showing top 5 rows Also tried using this to check the data- >>> rdd =

pyspark - Convert sparse vector obtained after one hot encoding into columns

阅读更多关于 pyspark - Convert sparse vector obtained after one hot encoding into columns

问题 I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer

Cosine similarity of word2vec more than 1

阅读更多关于 Cosine similarity of word2vec more than 1

Spark Streaming - processing binary data file

阅读更多关于 Spark Streaming - processing binary data file

问题 I'm using pyspark 1.6.0. I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data. In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......") This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the

Spark Streaming - processing binary data file

阅读更多关于 Spark Streaming - processing binary data file

pyspark: grouby and then get max value of each group

阅读更多关于 pyspark: grouby and then get max value of each group

问题 I would like to group by a value and then find the max value in each group using PySpark. I have the following code but now I am bit stuck on how to extract the max value. # some file contains tuples ('user', 'item', 'occurrences') data_file = sc.textData('file:///some_file.txt') # Create the triplet so I index stuff data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2]))) # Group by the user i.e. r[0] grouped = data_file.groupBy(lambda r: r[0]) # Here is where

Why does Spark (on Google Dataproc) not use all vcores?

阅读更多关于 Why does Spark (on Google Dataproc) not use all vcores?

问题 I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation gcloud dataproc clusters create cluster_name --bucket="profiling- job-default" \ --zone=europe-west1-c \ --master-boot-disk-size=500GB \ --worker-boot-disk-size=500GB \ --master

Kafka Structured Streaming KafkaSourceProvider could not be instantiated

阅读更多关于 Kafka Structured Streaming KafkaSourceProvider could not be instantiated

问题 I am working on a streaming project where I have a kafka stream of ping statistics like so : 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=1 ttl=62 time=0.913 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=2 ttl=62 time=0.936 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=3 ttl=62 time=0.980 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=4 ttl=62 time=0.889 ms I am trying to read this as a structured stream in

Difference of elements in list in PySpark

阅读更多关于 Difference of elements in list in PySpark

问题 I have a PySpark dataframe ( df ) with a column which contains lists with two elements. The two elements in the list are not ordered by ascending or descending orders. +--------+----------+-------+ | version| timestamp| list | +--------+-----+----|-------+ | v1 |2012-01-10| [5,2] | | v1 |2012-01-11| [2,5] | | v1 |2012-01-12| [3,2] | | v2 |2012-01-12| [2,3] | | v2 |2012-01-11| [1,2] | | v2 |2012-01-13| [2,1] | +--------+----------+-------+ I want to take difference betweeen the first and the