pyspark | 易学教程

Unable to import SparkContext

阅读更多关于 Unable to import SparkContext

问题 I'm working on CentOS, I've setup $SPARK_HOME and also added path to bin in $PATH . I can run pyspark from anywhere. But when I try to create python file and uses this statement; from pyspark import SparkConf, SparkContext it throws following error python pysparktask.py Traceback (most recent call last): File "pysparktask.py", line 1, in <module> from pyspark import SparkConf, SparkContext ModuleNotFoundError: No module named 'pyspark' I tried to install it again using pip . pip install

How to initialize a master in SparkConf in order to run distributed on a k8s cluster?

阅读更多关于 How to initialize a master in SparkConf in order to run distributed on a k8s cluster?

问题 I have deployed k8s cluster with 3 nodes, deployed hdfs.I've written a simple pyspark script and want to deploy it on k8s cluster, but dont know how to initialize spark context correctly: what to need to pass as a master to SparkConf().setMaster ??(When i set master as k8s://https://172.20.234.174:6443 i'm getting errors) The command i'm using to deploy on k8s: bin/spark-submit \ --name spark_k8s_hello_world_0 \ --master k8s://https://172.20.234.174:6443 \ --deploy-mode cluster \ --conf spark

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

阅读更多关于 PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

问题 I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a ( string or BINARY ) column. And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ@mail.gmail.com%3E which uses scala instead of pyspark as an example: Configuration conf = new Configuration(); + conf.set("parquet

Any way to access methods from individual stages in PySpark PipelineModel?

阅读更多关于 Any way to access methods from individual stages in PySpark PipelineModel?

问题 I've created a PipelineModel for doing LDA in Spark 2.0 (via PySpark API): def create_lda_pipeline(minTokenLength=1, minDF=1, minTF=1, numTopics=10, seed=42, pattern='[\W]+'): """ Create a pipeline for running an LDA model on a corpus. This function does not need data and will not actually do any fitting until invoked by the caller. Args: minTokenLength: minDF: minimum number of documents word is present in corpus minTF: minimum number of times word is found in a document numTopics: seed:

Any way to access methods from individual stages in PySpark PipelineModel?

阅读更多关于 Any way to access methods from individual stages in PySpark PipelineModel?

Pyspark Data Frame: Access to a Column

阅读更多关于 Pyspark Data Frame: Access to a Column

问题 I hope every one of you is ok and the Covid19 is not affecting your life too much. I am struggling with a PySpark code, in particular, I'd like to call a function on an object col which is not iterable. from pyspark.sql.functions import col, lower, regexp_replace, split from googletrans import Translator def clean_text(c): c = lower(c) c = regexp_replace(c, r"^rt ", "") c = regexp_replace(c, r"(https?\://)\S+", "") c = regexp_replace(c, "[^a-zA-Z0-9\\s]", "") #removePunctuation c = regexp

Pyspark Data Frame: Access to a Column

阅读更多关于 Pyspark Data Frame: Access to a Column

Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

阅读更多关于 Pyspark - Calculate RMSE between actuals and predictions for a groupby - AssertionError: all exprs should be Column

问题 I have a function that calculates RMSE for the preds and actuals of an entire dataframe: def calculate_rmse(df, actual_column, prediction_column): RMSE = F.udf(lambda x, y: ((x - y) ** 2)) df = df.withColumn( "RMSE", RMSE(F.col(actual_column), F.col(prediction_column)) ) rmse = df.select(F.avg("RMSE") ** 0.5).collect() rmse = rmse[0]["POWER(avg(RMSE), 0.5)"] return rmse test = calculate_rmse(my_df, 'actuals', 'preds') 3690.4535 I would like to apply this to a groupby statement, but when I do,

check if a row value is null in spark dataframe

阅读更多关于 check if a row value is null in spark dataframe

问题 I am using a custom function in pyspark to check a condition for each row in a spark dataframe and add columns if condition is true. The code is as below: from pyspark.sql.types import * from pyspark.sql.functions import * from pyspark.sql import Row def customFunction(row): if (row.prod.isNull()): prod_1 = "new prod" return (row + Row(prod_1)) else: prod_1 = row.prod return (row + Row(prod_1)) sdf = sdf_temp.map(customFunction) sdf.show() I get the error mention below: AttributeError:

How to get the correlation matrix of a pyspark data frame?

阅读更多关于 How to get the correlation matrix of a pyspark data frame?

问题 I have a big pyspark data frame. I want to get its correlation matrix. I know how to get it with a pandas data frame.But my data is too big to convert to pandas. So I need to get the result with pyspark data frame.I searched other similar questions, the answers don't work for me. Can any body help me? thanks! Data example: data example 回答1: Welcome to SO! Example data I prepared some dummy data for easier replication (perhaps next time you may supply some easy to copy data, too ;-)): data =