pyspark | 易学教程

pySpark withColumn with a function

阅读更多关于 pySpark withColumn with a function

问题 I have a dataframe which has 2 columns: account_id and email_address, now I want to add one more column 'updated_email_address' which i call some function on email_address to get the updated_email_address. here is my code: def update_email(email): print("== email to be updated: " + email) today = datetime.date.today() updated = substring(email, -8, 8) + str(today.strftime('%m')) + str(today.strftime('%d')) + "_updated" return updated df.withColumn('updated_email_address', update_email(df

How to Determine The Partition Size in an Apache Spark Dataframe

阅读更多关于 How to Determine The Partition Size in an Apache Spark Dataframe

问题 I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark Can someone help me expand on answers to determine the partition size of dataframe? Thanks 回答1: Tuning the partition size is inevitably, linked to tuning the number of partitions . There're at least 3 factors to consider in this scope: Level of parallelism A "good" high level of

How to Determine The Partition Size in an Apache Spark Dataframe

阅读更多关于 How to Determine The Partition Size in an Apache Spark Dataframe

sc is not defined in SparkContext

阅读更多关于 sc is not defined in SparkContext

问题 My Spark package is spark-2.2.0-bin-hadoop2.7. I exported spark variables as export SPARK_HOME=/home/harry/spark-2.2.0-bin-hadoop2.7 export PATH=$SPARK_HOME/bin:$PATH I opened spark notebook by pyspark I am able to load packages from spark from pyspark import SparkContext, SQLContext from pyspark.ml.regression import LinearRegression print(SQLContext) output is <class 'pyspark.sql.context.SQLContext'> But my error is print(sc) "sc is undefined" plz can anyone help me out ...! 回答1: In

PySpark get related records from its array object values

阅读更多关于 PySpark get related records from its array object values

PySpark get related records from its array object values

阅读更多关于 PySpark get related records from its array object values

Pyspark / pyspark kernels not working in jupyter notebook

阅读更多关于 Pyspark / pyspark kernels not working in jupyter notebook

问题 Here are installed kernels: $jupyter-kernelspec list Available kernels: apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala apache_toree_sql /usr/local/share/jupyter/kernels/apache_toree_sql pyspark3kernel /usr/local/share/jupyter/kernels/pyspark3kernel pysparkkernel /usr/local/share/jupyter/kernels/pysparkkernel python3 /usr/local/share/jupyter/kernels/python3 sparkkernel /usr/local/share/jupyter/kernels/sparkkernel sparkrkernel /usr/local/share/jupyter/kernels

Item-item recommendation based on cosine similarity

阅读更多关于 Item-item recommendation based on cosine similarity

问题 As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one. My problem is that the solutions I've come across perform poorly on my dataset. I've tried : Calculating the cosine similarity between all the rows of a dataframe in pyspark Using

Item-item recommendation based on cosine similarity

阅读更多关于 Item-item recommendation based on cosine similarity

Item-item recommendation based on cosine similarity

阅读更多关于 Item-item recommendation based on cosine similarity