pyspark

pySpark withColumn with a function

隐身守侯 提交于 2020-12-13 18:49:30
问题 I have a dataframe which has 2 columns: account_id and email_address, now I want to add one more column 'updated_email_address' which i call some function on email_address to get the updated_email_address. here is my code: def update_email(email): print("== email to be updated: " + email) today = datetime.date.today() updated = substring(email, -8, 8) + str(today.strftime('%m')) + str(today.strftime('%d')) + "_updated" return updated df.withColumn('updated_email_address', update_email(df

How to Determine The Partition Size in an Apache Spark Dataframe

浪尽此生 提交于 2020-12-13 03:33:33
问题 I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark Can someone help me expand on answers to determine the partition size of dataframe? Thanks 回答1: Tuning the partition size is inevitably, linked to tuning the number of partitions . There're at least 3 factors to consider in this scope: Level of parallelism A "good" high level of

How to Determine The Partition Size in an Apache Spark Dataframe

社会主义新天地 提交于 2020-12-13 03:32:47
问题 I have been using an excellent answer to a question posted on SE here to determine the number of partitions, and the distribution of partitions across a dataframe Need to Know Partitioning Details in Dataframe Spark Can someone help me expand on answers to determine the partition size of dataframe? Thanks 回答1: Tuning the partition size is inevitably, linked to tuning the number of partitions . There're at least 3 factors to consider in this scope: Level of parallelism A "good" high level of

sc is not defined in SparkContext

假如想象 提交于 2020-12-13 03:17:31
问题 My Spark package is spark-2.2.0-bin-hadoop2.7. I exported spark variables as export SPARK_HOME=/home/harry/spark-2.2.0-bin-hadoop2.7 export PATH=$SPARK_HOME/bin:$PATH I opened spark notebook by pyspark I am able to load packages from spark from pyspark import SparkContext, SQLContext from pyspark.ml.regression import LinearRegression print(SQLContext) output is <class 'pyspark.sql.context.SQLContext'> But my error is print(sc) "sc is undefined" plz can anyone help me out ...! 回答1: In

PySpark get related records from its array object values

巧了我就是萌 提交于 2020-12-13 03:12:44
问题 I have a spark dataframe that has an ID column and along with other columns, it has an array column that contains the IDs of its related records, as its value. example dataframe will be of ID | NAME | RELATED_IDLIST -------------------------- 123 | mike | [345,456] 345 | alen | [789] 456 | sam | [789,999] 789 | marc | [111] 555 | dan | [333] From the above, I need to append all the related child Id's to the array column of the parent ID. The resultant DF should be like ID | NAME | RELATED

PySpark get related records from its array object values

℡╲_俬逩灬. 提交于 2020-12-13 03:12:37
问题 I have a spark dataframe that has an ID column and along with other columns, it has an array column that contains the IDs of its related records, as its value. example dataframe will be of ID | NAME | RELATED_IDLIST -------------------------- 123 | mike | [345,456] 345 | alen | [789] 456 | sam | [789,999] 789 | marc | [111] 555 | dan | [333] From the above, I need to append all the related child Id's to the array column of the parent ID. The resultant DF should be like ID | NAME | RELATED

Pyspark / pyspark kernels not working in jupyter notebook

送分小仙女□ 提交于 2020-12-11 02:52:47
问题 Here are installed kernels: $jupyter-kernelspec list Available kernels: apache_toree_scala /usr/local/share/jupyter/kernels/apache_toree_scala apache_toree_sql /usr/local/share/jupyter/kernels/apache_toree_sql pyspark3kernel /usr/local/share/jupyter/kernels/pyspark3kernel pysparkkernel /usr/local/share/jupyter/kernels/pysparkkernel python3 /usr/local/share/jupyter/kernels/python3 sparkkernel /usr/local/share/jupyter/kernels/sparkkernel sparkrkernel /usr/local/share/jupyter/kernels

Item-item recommendation based on cosine similarity

☆樱花仙子☆ 提交于 2020-12-07 07:20:09
问题 As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one. My problem is that the solutions I've come across perform poorly on my dataset. I've tried : Calculating the cosine similarity between all the rows of a dataframe in pyspark Using

Item-item recommendation based on cosine similarity

那年仲夏 提交于 2020-12-07 07:19:13
问题 As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one. My problem is that the solutions I've come across perform poorly on my dataset. I've tried : Calculating the cosine similarity between all the rows of a dataframe in pyspark Using

Item-item recommendation based on cosine similarity

こ雲淡風輕ζ 提交于 2020-12-07 07:17:42
问题 As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one. My problem is that the solutions I've come across perform poorly on my dataset. I've tried : Calculating the cosine similarity between all the rows of a dataframe in pyspark Using