pyspark

how to retrieve a column from pyspark dataframe and and insert it as new column within existing pyspark dataframe?

[亡魂溺海] 提交于 2019-12-31 03:56:09
问题 The problem is: I've got a pyspark dataframe like this df1: +--------+ |index | +--------+ | 121| | 122| | 123| | 124| | 125| | 121| | 121| | 126| | 127| | 120| | 121| | 121| | 121| | 127| | 129| | 132| | 122| | 121| | 121| | 121| +--------+ I want to retrieve index column from df1 and insert it in the existing dataframe df2 ( with same lengths). df2: +--------------------+--------------------+ | fact1| fact2| +--------------------+--------------------+ | 2.4899928731985597|-0

Invalid status code '400' from .. error payload: "requirement failed: Session isn't active

て烟熏妆下的殇ゞ 提交于 2019-12-31 03:54:12
问题 I am running Pyspark scripts to write a dataframe to a csv in jupyter Notebook as below: df.coalesce(1).write.csv('Data1.csv',header = 'true') After an hour of runtime I am getting the below error. Error: Invalid status code from http://.....session isn't active. My config is like: spark.conf.set("spark.dynamicAllocation.enabled","true") spark.conf.set("shuffle.service.enabled","true") spark.conf.set("spark.dynamicAllocation.minExecutors",6) spark.conf.set("spark.executor.heartbeatInterval",

Pyspark Merge WrappedArrays Within a Dataframe

只谈情不闲聊 提交于 2019-12-31 03:06:05
问题 The current Pyspark dataframe has this structure (a list of WrappedArrays for col2): +---+---------------------------------------------------------------------+ |id |col2 | +---+---------------------------------------------------------------------+ |a |[WrappedArray(code2), WrappedArray(code1, code3)] | +---+---------------------------------------------------------------------+ |b |[WrappedArray(code5), WrappedArray(code6, code8)] | +---+-------------------------------------------------------

Spark Python: How to calculate Jaccard Similarity between each line within an RDD?

时光怂恿深爱的人放手 提交于 2019-12-31 02:42:27
问题 I have a table of around 50k distinct rows, and 2 columns. You can think of each row being a movie, and columns being the attributes of that movie - "ID": id of that movie, "Tags":some content tags of the movie, in form of a list of strings for each movie . Data looks something like this: movie_1, ['romantic','comedy','English']; movie_2, ['action','kongfu','Chinese'] My goal is to first calculate the jacquard similarity between each Movie based on their corresponding tags, and once that's

Spark Python: How to calculate Jaccard Similarity between each line within an RDD?

扶醉桌前 提交于 2019-12-31 02:41:46
问题 I have a table of around 50k distinct rows, and 2 columns. You can think of each row being a movie, and columns being the attributes of that movie - "ID": id of that movie, "Tags":some content tags of the movie, in form of a list of strings for each movie . Data looks something like this: movie_1, ['romantic','comedy','English']; movie_2, ['action','kongfu','Chinese'] My goal is to first calculate the jacquard similarity between each Movie based on their corresponding tags, and once that's

PySpark 2.1: Importing module with UDF's breaks Hive connectivity

那年仲夏 提交于 2019-12-31 02:37:49
问题 I'm currently working with Spark 2.1 and have a main script that calls a helper module that contains all my transformation methods. In other words: main.py helper.py At the top of my helper.py file I have several custom UDFs that I have defined in the following manner: def reformat(s): return reformat_logic(s) reformat_udf = udf(reformat, StringType()) Before I broke off all the UDFs into the helper file, I was able to connect to my Hive metastore through my SparkSession object using spark

How to Distribute Multiprocessing Pool to Spark Workers

时光总嘲笑我的痴心妄想 提交于 2019-12-31 02:34:06
问题 I am trying to use multiprocessing to read 100 csv files in parallel (and subsequently process them separately in parallel). Here is my code running in Jupyter hosted on my EMR master node in AWS. (Eventually it will be 100k csv files hence the need for distributed reading). import findspark import boto3 from multiprocessing.pool import ThreadPool import logging import sys findspark.init() from pyspark import SparkContext, SparkConf, sql conf = SparkConf().setMaster("local[*]") conf.set(

PySpark: org.apache.spark.sql.AnalysisException: Attribute name … contains invalid character(s) among “ ,;{}()\n\t=”. Please use alias to rename it [duplicate]

烂漫一生 提交于 2019-12-31 01:55:11
问题 This question already has answers here : Spark Dataframe validating column names for parquet writes (scala) (4 answers) Closed last year . I'm trying to load Parquet data into PySpark , where a column has a space in the name: df = spark.read.parquet('my_parquet_dump') df.select(df['Foo Bar'].alias('foobar')) Even though I have aliased the column, I'm still getting this error and error propagating from the JVM side of PySpark . I've attached the stack trace below. Is there a way I can load

pyspark: drop columns that have same values in all rows

那年仲夏 提交于 2019-12-31 01:45:19
问题 Related question: How to drop columns which have same values in all rows via pandas or spark dataframe? So I have a pyspark dataframe, and I want to drop the columns where all values are the same in all rows while keeping other columns intact. However the answers in the above question are only for pandas. Is there a solution for pyspark dataframe? Thanks 回答1: You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1

Spark ML Pipeline Causes java.lang.Exception: failed to compile … Code … grows beyond 64 KB

耗尽温柔 提交于 2019-12-30 18:55:34
问题 Using Spark 2.0, I am trying to run a simple VectorAssembler in a pyspark ML pipeline, like so: feature_assembler = VectorAssembler(inputCols=['category_count', 'name_count'], \ outputCol="features") pipeline = Pipeline(stages=[feature_assembler]) model = pipeline.fit(df_train) model_output = model.transform(df_train) When I try to look at the output using model_output.select("features").show(1) I get the error Py4JJavaError Traceback (most recent call last) <ipython-input-95-7a3e3d4f281c> in