pyspark | 易学教程

how to retrieve a column from pyspark dataframe and and insert it as new column within existing pyspark dataframe?

阅读更多关于 how to retrieve a column from pyspark dataframe and and insert it as new column within existing pyspark dataframe?

问题 The problem is: I've got a pyspark dataframe like this df1: +--------+ |index | +--------+ | 121| | 122| | 123| | 124| | 125| | 121| | 121| | 126| | 127| | 120| | 121| | 121| | 121| | 127| | 129| | 132| | 122| | 121| | 121| | 121| +--------+ I want to retrieve index column from df1 and insert it in the existing dataframe df2 ( with same lengths). df2: +--------------------+--------------------+ | fact1| fact2| +--------------------+--------------------+ | 2.4899928731985597|-0

Invalid status code '400' from .. error payload: "requirement failed: Session isn't active

阅读更多关于 Invalid status code '400' from .. error payload: "requirement failed: Session isn't active

问题 I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below: df.coalesce(1).write.csv('Data1.csv',header = 'true') After an hour of runtime I am getting the below error. Error: Invalid status code from http://.....session isn't active. My config is like: spark.conf.set("spark.dynamicAllocation.enabled","true") spark.conf.set("shuffle.service.enabled","true") spark.conf.set("spark.dynamicAllocation.minExecutors",6) spark.conf.set("spark.executor.heartbeatInterval",

Pyspark Merge WrappedArrays Within a Dataframe

阅读更多关于 Pyspark Merge WrappedArrays Within a Dataframe

问题 The current Pyspark dataframe has this structure (a list of WrappedArrays for col2): +---+---------------------------------------------------------------------+ |id |col2 | +---+---------------------------------------------------------------------+ |a |[WrappedArray(code2), WrappedArray(code1, code3)] | +---+---------------------------------------------------------------------+ |b |[WrappedArray(code5), WrappedArray(code6, code8)] | +---+-------------------------------------------------------

Spark Python: How to calculate Jaccard Similarity between each line within an RDD?

阅读更多关于 Spark Python: How to calculate Jaccard Similarity between each line within an RDD?

问题 I have a table of around 50k distinct rows, and 2 columns. You can think of each row being a movie, and columns being the attributes of that movie - "ID": id of that movie, "Tags":some content tags of the movie, in form of a list of strings for each movie . Data looks something like this: movie_1, ['romantic','comedy','English']; movie_2, ['action','kongfu','Chinese'] My goal is to first calculate the jacquard similarity between each Movie based on their corresponding tags, and once that's

Spark Python: How to calculate Jaccard Similarity between each line within an RDD?

阅读更多关于 Spark Python: How to calculate Jaccard Similarity between each line within an RDD?

PySpark 2.1: Importing module with UDF's breaks Hive connectivity

阅读更多关于 PySpark 2.1: Importing module with UDF's breaks Hive connectivity

问题 I'm currently working with Spark 2.1 and have a main script that calls a helper module that contains all my transformation methods. In other words: main.py helper.py At the top of my helper.py file I have several custom UDFs that I have defined in the following manner: def reformat(s): return reformat_logic(s) reformat_udf = udf(reformat, StringType()) Before I broke off all the UDFs into the helper file, I was able to connect to my Hive metastore through my SparkSession object using spark

How to Distribute Multiprocessing Pool to Spark Workers

阅读更多关于 How to Distribute Multiprocessing Pool to Spark Workers

问题 I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR master node in AWS. (Eventually it will be 100k csv files hence the need for distributed reading). import findspark import boto3 from multiprocessing.pool import ThreadPool import logging import sys findspark.init() from pyspark import SparkContext, SparkConf, sql conf = SparkConf().setMaster("local[*]") conf.set(

PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\n\t=”. Please use alias to rename it [duplicate]

阅读更多关于 PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\n\t=”. Please use alias to rename it [duplicate]

问题 This question already has answers here : Spark Dataframe validating column names for parquet writes (scala) (4 answers) Closed last year . I'm trying to load Parquet data into PySpark , where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error propagating from the JVM side of PySpark . I've attached the stack trace below. Is there a way I can load

pyspark: drop columns that have same values in all rows

阅读更多关于 pyspark: drop columns that have same values in all rows

问题 Related question: How to drop columns which have same values in all rows via pandas or spark dataframe? So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact. However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe? Thanks 回答1: You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1

Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

阅读更多关于 Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

问题 Using Spark 2.0, I am trying to run a simple VectorAssembler in a pyspark ML pipeline, like so: feature_assembler = VectorAssembler(inputCols=['category_count', 'name_count'], \ outputCol="features") pipeline = Pipeline(stages=[feature_assembler]) model = pipeline.fit(df_train) model_output = model.transform(df_train) When I try to look at the output using model_output.select("features").show(1) I get the error Py4JJavaError Traceback (most recent call last) <ipython-input-95-7a3e3d4f281c> in