pyspark

Pysaprk multi groupby with different column

痞子三分冷 提交于 2021-02-04 16:28:26
问题 I have data like below year name percent sex 1880 John 0.081541 boy 1881 William 0.080511 boy 1881 John 0.050057 boy I need to groupby and count using different columns df_year = df.groupby('year').count() df_name = df.groupby('name').count() df_sex = df.groupby('sex').count() then I have to create a Window to get the top-3 data by each column window = Window.partitionBy('year').orderBy(col("count").desc()) top4_res = df_year.withColumn('topn', func.row_number().over(window)).\ filter(col(

How to match/extract multi-line pattern from file in pysark

帅比萌擦擦* 提交于 2021-02-04 15:51:50
问题 I have a huge file of rdf triplets (subject predicate objects) as shown in the image below. The goals it extract the bold items and have the following output Item_Id | quantityAmount | quantityUnit | rank ----------------------------------------------- Q31 24954 Meter BestRank Q25 582 Kilometer NormalRank I want to extract lines that follow the following pattern subject is given a pointer ( <Q31> <prop/P1082> <Pointer_Q31-87RF> . ) Pointer has a ranking ( <Pointer_Q31-87RF> <rank> <BestRank>

Subtract consecutive columns in a Pandas or Pyspark Dataframe

北战南征 提交于 2021-02-04 15:51:45
问题 I would like to perform the following operation in a pandas or pyspark dataframe but i still havent found a solution. I want to subtract the values from consecutive columns in a dataframe. The operation I am describing can be seen in the image below. Bear in mind that the output dataframe wont have any values on first column as the first column in the input table cannot be subtracted by its previous one as it doesn't exist. 回答1: diff has an axis param so you can just do this in one step: In

Subtract consecutive columns in a Pandas or Pyspark Dataframe

北城余情 提交于 2021-02-04 15:51:24
问题 I would like to perform the following operation in a pandas or pyspark dataframe but i still havent found a solution. I want to subtract the values from consecutive columns in a dataframe. The operation I am describing can be seen in the image below. Bear in mind that the output dataframe wont have any values on first column as the first column in the input table cannot be subtracted by its previous one as it doesn't exist. 回答1: diff has an axis param so you can just do this in one step: In

How to speed up spark df.write jdbc to postgres database?

最后都变了- 提交于 2021-02-04 12:16:14
问题 I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write: df.write.format('jdbc').options( url=psql_url_spark, driver=spark_env['PSQL_DRIVER'], dbtable="{schema}.{table}".format(schema=schema, table=table), user=spark_env['PSQL_USER'], password=spark_env['PSQL_PASS'], batchsize=2000000, queryTimeout=690 ).mode(mode).save() I tried increasing the batchsize but that didn't help, as

Convert timestamp to date in spark dataframe

喜欢而已 提交于 2021-02-04 10:52:24
问题 I've seen here: How to convert Timestamp to Date format in DataFrame? the way to convert a timestamp in datetype, but at least for me, it doesn't work. Here is what I've tried # Create dataframe df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',]) # Convert to timestamp df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\ .otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\ # Convert timestamp to date again df

How to read multiline CSV file in Pyspark

我们两清 提交于 2021-02-04 08:26:12
问题 I'm using this tweets dataset with Pyspark in order to process it and get some trends according to the tweet's location. But I'm having a problem when I try to create the dataframe. I'm using spark.read.options(header="True").csv("hashtag_donaldtrump.csv") to create the dataframe, but if I look at the tweets column, this is the result I get: Do you know how can I clean the CSV file so it can be processed by Spark? Thank you in advance! 回答1: It looks like a multiline csv. Try doing df = spark

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

走远了吗. 提交于 2021-01-29 22:41:16
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following

Discard Bad record and load only good records to dataframe from json file in pyspark

不羁的心 提交于 2021-01-29 21:35:52
问题 The API generated json file looks like below. The Format of the JSON file is not correct. can we handle the bad records to discard and load only good rows to dataframe using pyspark. { "name": "PowerAmplifier", "Component": "12uF Capacitor\n1/21Resistor\n3 Inductor In Henry\PowerAmplifier\n ", "url": "https://www.onsemi.com/products/amplifiers-comparators/", "image": "https://www.onsemi.com/products/amplifiers-comparators/", "ThresholdTime": "48min", "MFRDate": "2019-05-08", "FallTime":

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

亡梦爱人 提交于 2021-01-29 20:34:26
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following