pyspark

py4j.protocol.Py4JJavaError: An error occurred while calling o788.save. : com.mongodb.MongoTimeoutException, WritableServerSelector

偶尔善良 提交于 2020-01-25 08:59:26
问题 Pyspark version: 2.4.4 MongoDB version: 4.2.0 RAM: 64GB CPU Core:32 running script: spark-submit --executor-memory 8G --driver-memory 8G --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1 demographic.py when I run the code I am getting the error: "py4j.protocol.Py4JJavaError: An error occurred while calling o764.save. : com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type

How to zip two array columns in Spark SQL

徘徊边缘 提交于 2020-01-25 08:10:27
问题 I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below: df['column_1']: 'abc, def, ghi' df['column_2']: '1.0, 2.0, 3.0' I wanted to join these two columns in a third column like below for each row of my dataframe. df['column_3']: [abc_1.0, def_2.0, ghi_3.0] I have successfully done so in python using the code below but the dataframe is quite large and it

'Column' object is not callable with Regex and Pyspark

限于喜欢 提交于 2020-01-25 07:58:09
问题 I need to extract the integers only from url stings in the column "Page URL" and append those extracted integers to a new column. I am using PySpark. My code below: from pyspark.sql.functions import col, regexp_extract spark_df_url.withColumn("new_column", regexp_extract(col("Page URL"), "\d+", 1).show()) I have the following error: TypeError: 'Column' object is not callable. 回答1: You may use spark_df_url.withColumn("new_column", regexp_extract("Page URL", "\d+", 0)) Specify the name of the

create another columns for checking different value in pyspark

十年热恋 提交于 2020-01-25 06:50:48
问题 I wish to have below expected output: My code: import numpy as np pd_dataframe = pd.DataFrame({'id': [i for i in range(10)], 'values': [10,5,3,-1,0,-10,-4,10,0,10]}) sp_dataframe = spark.createDataFrame(pd_dataframe) sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType()) sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values')) sp_dataframe.show() I wanted to create another column with which it returns an additional of 1 when the value is different from previous row.

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

懵懂的女人 提交于 2020-01-25 06:48:25
问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

自闭症网瘾萝莉.ら 提交于 2020-01-25 06:48:09
问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

Show all pyspark columns after group and agg

陌路散爱 提交于 2020-01-25 06:40:52
问题 I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it. # Normal way of creating dataframe in pyspark sdataframe_temp = spark.createDataFrame([ (2,2,'0-2'), (2,23,'22-24')], ['a', 'b', 'c'] ) sdataframe_temp2 = spark.createDataFrame([ (4,6,'4-6'), (5,7,'6-8')], ['a', 'b', 'c'] ) # Concat two different pyspark dataframe sdataframe_union_1_2 = sdataframe

Show all pyspark columns after group and agg

被刻印的时光 ゝ 提交于 2020-01-25 06:40:08
问题 I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it. # Normal way of creating dataframe in pyspark sdataframe_temp = spark.createDataFrame([ (2,2,'0-2'), (2,23,'22-24')], ['a', 'b', 'c'] ) sdataframe_temp2 = spark.createDataFrame([ (4,6,'4-6'), (5,7,'6-8')], ['a', 'b', 'c'] ) # Concat two different pyspark dataframe sdataframe_union_1_2 = sdataframe

Pyspark dataframe get all values of a column

筅森魡賤 提交于 2020-01-25 00:25:49
问题 I want to get all values of a column in pyspark dataframe. I did some search, but I never find a efficient and short solution. Assuming I want to get a values in the column called "name". I have a solution: sum(dataframe.select("name").toPandas().values.tolist(),[]) It works, but it is not efficient since it converts to pandas then flatten the list... Is there a better and short solution? 回答1: Below Options will give better performance than sum . Using collect_list import pyspark.sql

How to extract data from asn1 data file and load it into a dataframe?

橙三吉。 提交于 2020-01-24 22:11:06
问题 My ultimate goal is to load meta data received from PubMed into a pyspark dataframe. So far, I have managed to download the data I want from the PubMed data base using a shell script. The downloaded data is in asn1 format. Here is an example of a data entry: Pubmed-entry ::= { pmid 31782536, medent { em std { year 2019, month 11, day 30, hour 6, minute 0 }, cit { title { name "Impact of CYP2C19 genotype and drug interactions on voriconazole plasma concentrations: a spain pharmacogenetic