pyspark

PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

冷暖自知 提交于 2021-02-07 19:42:06
问题 The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically): import sys, os, multiprocessing from pyspark.sql import DataFrame, DataFrameStatFunctions, DataFrameNaFunctions from pyspark.conf import SparkConf from pyspark.sql import SparkSession

How to read multiple partitioned .gzip files into a Spark Dataframe?

扶醉桌前 提交于 2021-02-07 19:41:50
问题 I have the following folder of partitioned data- my_folder |--part-0000.gzip |--part-0001.gzip |--part-0002.gzip |--part-0003.gzip I try to read this data into a dataframe using- >>> my_df = spark.read.csv("/path/to/my_folder/*") >>> my_df.show(5) +--------------------+ | _c0| +--------------------+ |��[I���...| |��RUu�[*Ք��g��T...| |�t��� �qd��8~��...| |�(���b4�:������I�...| |���!y�)�PC��ќ\�...| +--------------------+ only showing top 5 rows Also tried using this to check the data- >>> rdd =

pyspark - Convert sparse vector obtained after one hot encoding into columns

僤鯓⒐⒋嵵緔 提交于 2021-02-07 18:43:41
问题 I am using apache Spark ML lib to handle categorical features using one hot encoding. After writing the below code I am getting a vector c_idx_vec as output of one hot encoding. I do understand how to interpret this output vector but I am unable to figure out how to convert this vector into columns so that I get a new transformed dataframe.Take this dataset for example: >>> fd = spark.createDataFrame( [(1.0, "a"), (1.5, "a"), (10.0, "b"), (3.2, "c")], ["x","c"]) >>> ss = StringIndexer

Cosine similarity of word2vec more than 1

大憨熊 提交于 2021-02-07 14:49:26
问题 I used a word2vec algorithm of spark to compute documents vector of a text. I then used the findSynonyms function of the model object to get synonyms of few words. I see something like this: w2vmodel.findSynonyms('science',4).show(5) +------------+------------------+ | word| similarity| +------------+------------------+ | physics| 1.714908638833209| | fiction|1.5189824643358183| |neuroscience|1.4968051528391833| | psychology| 1.458865636374223| +------------+------------------+ I do not

Spark Streaming - processing binary data file

巧了我就是萌 提交于 2021-02-07 14:39:33
问题 I'm using pyspark 1.6.0. I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data. In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......") This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the

Spark Streaming - processing binary data file

二次信任 提交于 2021-02-07 14:37:38
问题 I'm using pyspark 1.6.0. I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data. In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......") This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the

pyspark: grouby and then get max value of each group

末鹿安然 提交于 2021-02-07 13:12:55
问题 I would like to group by a value and then find the max value in each group using PySpark. I have the following code but now I am bit stuck on how to extract the max value. # some file contains tuples ('user', 'item', 'occurrences') data_file = sc.textData('file:///some_file.txt') # Create the triplet so I index stuff data_file = data_file.map(lambda l: l.split()).map(lambda l: (l[0], l[1], float(l[2]))) # Group by the user i.e. r[0] grouped = data_file.groupBy(lambda r: r[0]) # Here is where

Why does Spark (on Google Dataproc) not use all vcores?

杀马特。学长 韩版系。学妹 提交于 2021-02-07 12:31:55
问题 I'm running a spark job on a Google DataProc cluster. But looks like Spark is not using all the vcores available in the cluster as you can see below Based on some other questions like this and this, i have setup the cluster to use DominantResourceCalculator to consider both vcpus and memory for resource allocation gcloud dataproc clusters create cluster_name --bucket="profiling- job-default" \ --zone=europe-west1-c \ --master-boot-disk-size=500GB \ --worker-boot-disk-size=500GB \ --master

Kafka Structured Streaming KafkaSourceProvider could not be instantiated

橙三吉。 提交于 2021-02-07 11:38:13
问题 I am working on a streaming project where I have a kafka stream of ping statistics like so : 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=1 ttl=62 time=0.913 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=2 ttl=62 time=0.936 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=3 ttl=62 time=0.980 ms 64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=4 ttl=62 time=0.889 ms I am trying to read this as a structured stream in

Difference of elements in list in PySpark

白昼怎懂夜的黑 提交于 2021-02-07 10:59:28
问题 I have a PySpark dataframe ( df ) with a column which contains lists with two elements. The two elements in the list are not ordered by ascending or descending orders. +--------+----------+-------+ | version| timestamp| list | +--------+-----+----|-------+ | v1 |2012-01-10| [5,2] | | v1 |2012-01-11| [2,5] | | v1 |2012-01-12| [3,2] | | v2 |2012-01-12| [2,3] | | v2 |2012-01-11| [1,2] | | v2 |2012-01-13| [2,1] | +--------+----------+-------+ I want to take difference betweeen the first and the