pyspark | 易学教程

py4j.protocol.Py4JJavaError: An error occurred while calling o788.save. : com.mongodb.MongoTimeoutException, WritableServerSelector

阅读更多关于 py4j.protocol.Py4JJavaError: An error occurred while calling o788.save. : com.mongodb.MongoTimeoutException, WritableServerSelector

问题 Pyspark version: 2.4.4 MongoDB version: 4.2.0 RAM: 64GB CPU Core:32 running script: spark-submit --executor-memory 8G --driver-memory 8G --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1 demographic.py when I run the code I am getting the error: "py4j.protocol.Py4JJavaError: An error occurred while calling o764.save. : com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type

How to zip two array columns in Spark SQL

阅读更多关于 How to zip two array columns in Spark SQL

问题 I have a Pandas dataframe. I have tried to join two columns containing string values into a list first and then using zip, I joined each element of the list with '_'. My data set is like below: df['column_1']: 'abc, def, ghi' df['column_2']: '1.0, 2.0, 3.0' I wanted to join these two columns in a third column like below for each row of my dataframe. df['column_3']: [abc_1.0, def_2.0, ghi_3.0] I have successfully done so in python using the code below but the dataframe is quite large and it

'Column' object is not callable with Regex and Pyspark

阅读更多关于 'Column' object is not callable with Regex and Pyspark

问题 I need to extract the integers only from url stings in the column "Page URL" and append those extracted integers to a new column. I am using PySpark. My code below: from pyspark.sql.functions import col, regexp_extract spark_df_url.withColumn("new_column", regexp_extract(col("Page URL"), "\d+", 1).show()) I have the following error: TypeError: 'Column' object is not callable. 回答1: You may use spark_df_url.withColumn("new_column", regexp_extract("Page URL", "\d+", 0)) Specify the name of the

create another columns for checking different value in pyspark

阅读更多关于 create another columns for checking different value in pyspark

问题 I wish to have below expected output: My code: import numpy as np pd_dataframe = pd.DataFrame({'id': [i for i in range(10)], 'values': [10,5,3,-1,0,-10,-4,10,0,10]}) sp_dataframe = spark.createDataFrame(pd_dataframe) sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType()) sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values')) sp_dataframe.show() I wanted to create another column with which it returns an additional of 1 when the value is different from previous row.

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

阅读更多关于 Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

问题 I have the same problem as asked here but I need a solution in pyspark and without breeze. For example if my pyspark dataframe look like this: user | weight | vec "u1" | 0.1 | [2, 4, 6] "u1" | 0.5 | [4, 8, 12] "u2" | 0.5 | [20, 40, 60] where column weight has type double and column vec has type Array[Double], I would like to get the weighted sum of the vectors per user, so that I get a dataframe that look like this: user | wsum "u1" | [2.2, 4.4, 6.6] "u2" | [10, 20, 30] To do this I have

Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

阅读更多关于 Multiply two pyspark dataframe columns with different types (array[double] vs double) without breeze

Show all pyspark columns after group and agg

阅读更多关于 Show all pyspark columns after group and agg

问题 I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it. # Normal way of creating dataframe in pyspark sdataframe_temp = spark.createDataFrame([ (2,2,'0-2'), (2,23,'22-24')], ['a', 'b', 'c'] ) sdataframe_temp2 = spark.createDataFrame([ (4,6,'4-6'), (5,7,'6-8')], ['a', 'b', 'c'] ) # Concat two different pyspark dataframe sdataframe_union_1_2 = sdataframe

Show all pyspark columns after group and agg

阅读更多关于 Show all pyspark columns after group and agg

Pyspark dataframe get all values of a column

阅读更多关于 Pyspark dataframe get all values of a column

问题 I want to get all values of a column in pyspark dataframe. I did some search, but I never find a efficient and short solution. Assuming I want to get a values in the column called "name". I have a solution: sum(dataframe.select("name").toPandas().values.tolist(),[]) It works, but it is not efficient since it converts to pandas then flatten the list... Is there a better and short solution? 回答1: Below Options will give better performance than sum . Using collect_list import pyspark.sql

How to extract data from asn1 data file and load it into a dataframe?

阅读更多关于 How to extract data from asn1 data file and load it into a dataframe?

问题 My ultimate goal is to load meta data received from PubMed into a pyspark dataframe. So far, I have managed to download the data I want from the PubMed data base using a shell script. The downloaded data is in asn1 format. Here is an example of a data entry: Pubmed-entry ::= { pmid 31782536, medent { em std { year 2019, month 11, day 30, hour 6, minute 0 }, cit { title { name "Impact of CYP2C19 genotype and drug interactions on voriconazole plasma concentrations: a spain pharmacogenetic