pyspark | 易学教程

Spark pyspark vs spark-submit

阅读更多关于 Spark pyspark vs spark-submit

问题 The documentation on spark-submit says the following: The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. Regarding the pyspark it says the following: You can also use bin/pyspark to launch an interactive Python shell. This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right? 回答1: There is no practical difference between these two. If not

Partitioning of Data Frame in Pyspark using Custom Partitioner

阅读更多关于 Partitioning of Data Frame in Pyspark using Custom Partitioner

问题 Looking for some info on using custom partitioner in Pyspark. I have a dataframe holding country data for various countries. So if I do repartition on country column, it will distribute my data into n partitions and keeping similar country data to specific partitions. This is creating a skew partition data when I see using glom() method. Some countries like USA and CHN has huge amount of data in particular dataframe. I want to repartition my dataframe such that if the countries are USA and

UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

阅读更多关于 UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

问题 I am running spark 2.4.2 locally through pyspark for an ML project in NLP. Part of the pre-processing steps in the Pipeline involve the use of pandas_udf functions optimized through pyarrow . Each time I operate with the pre-processed spark dataframe the following warning appears: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " I tried updating pyarrow but didn't manage to avoid the warning. My

How to retrieve all columns using pyspark collect_list functions

阅读更多关于 How to retrieve all columns using pyspark collect_list functions

问题 I have a pyspark 2.0.1. I'm trying to groupby my data frame & retrieve the value for all the fields from my data frame. I found that z=data1.groupby('country').agg(F.collect_list('names')) will give me values for country & names attribute & for names attribute it will give column header as collect_list(names) . But for my job I have dataframe with around 15 columns & I will run a loop & will change the groupby field each time inside loop & need the output for all of the remaining fields.Can

PySpark PCA: avoiding NotConvergedException

阅读更多关于 PySpark PCA: avoiding NotConvergedException

问题 I'm attempting to reduce a wide dataset (51 features, ~1300 individuals) using PCA through the ml.linalg method as follows: 1) Named my columns as one list: features = indi_prep_df.select([c for c in indi_prep_df.columns if c not in{'indi_nbr','label'}]).columns 2) Imported the necessary libraries from pyspark.ml.feature import PCA as PCAML from pyspark.ml.linalg import Vector from pyspark.ml.feature import VectorAssembler from pyspark.ml.linalg import DenseVector 3) Collapsed the features to

SparkSQL sql syntax for nth item in array

阅读更多关于 SparkSQL sql syntax for nth item in array

问题 I have a json object that has an unfortunate combination of nesting and arrays. So its not totally obvious how to query it with spark sql. here is a sample object: { stuff: [ {a:1,b:2,c:3} ] } so, in javascript, to get the value for c , I'd write myData.stuff[0].c And in my spark sql query, if that array wasn't there, I'd be able to use dot notation: SELECT stuff.c FROM blah but I can't, because the innermost object is wrapped in an array. I've tried: SELECT stuff.0.c FROM blah // FAIL SELECT

structured streaming Kafka 2.1->Zeppelin 0.8->Spark 2.4: spark does not use jar

阅读更多关于 structured streaming Kafka 2.1->Zeppelin 0.8->Spark 2.4: spark does not use jar

问题 I have a Kafka 2.1 message broker and want to do some processing with data of the messages within Spark 2.4. I want to use Zeppelin 0.8.1 notebooks for rapid prototyping. I downloaded the spark-streaming-kafka-0-10_2.11.jar that is necessarry for structured streaming (http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) and added it as "Dependencies-artifact" to the "spark"-interpreter of Zeppelin (that also deals with the %pyspark paragraphs). I restarted this

Rate limit with Apache Spark GCS connector

阅读更多关于 Rate limit with Apache Spark GCS connector

问题 I'm using Spark on a Google Compute Engine cluster with the Google Cloud Storage connector (instead of HDFS, as recommended), and get a lot of "rate limit" errors, as follows: java.io.IOException: Error inserting: bucket: *****, object: ***** at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.wrapException(GoogleCloudStorageImpl.java:1600) at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl$3.run(GoogleCloudStorageImpl.java:475) at java.util.concurrent.ThreadPoolExecutor.runWorker

PySpark replace Null with Array

阅读更多关于 PySpark replace Null with Array

问题 After a join by ID, my data frame looks as follows: ID | Features | Vector 1 | (50,[...] | Array[1.1,2.3,...] 2 | (50,[...] | Null I ended up with Null values for some IDs in the column 'Vector'. I would like to replace these Null values by an array of zeros with 300 dimensions (same format as non-null vector entries). df.fillna does not work here since it's an array I would like to insert. Any idea how to accomplish this in PySpark? ---edit--- Similarly to this post my current approach: df

Python spark Dataframe to Elasticsearch

阅读更多关于 Python spark Dataframe to Elasticsearch

问题 I can't figure out how to write a dataframe to elasticsearch using python from spark. I followed the steps from here. Here is my code: # Read file df = sqlContext.read \ .format('com.databricks.spark.csv') \ .options(header='true') \ .load('/vagrant/data/input/input.csv', schema = customSchema) df.registerTempTable("data") # KPIs kpi1 = sqlContext.sql("SELECT * FROM data") es_conf = {"es.nodes" : "10.10.10.10","es.port" : "9200","es.resource" : "kpi"} kpi1.rdd.saveAsNewAPIHadoopFile( path='-'