apache-spark

Pysaprk multi groupby with different column

痞子三分冷 提交于 2021-02-04 16:28:26
问题 I have data like below year name percent sex 1880 John 0.081541 boy 1881 William 0.080511 boy 1881 John 0.050057 boy I need to groupby and count using different columns df_year = df.groupby('year').count() df_name = df.groupby('name').count() df_sex = df.groupby('sex').count() then I have to create a Window to get the top-3 data by each column window = Window.partitionBy('year').orderBy(col("count").desc()) top4_res = df_year.withColumn('topn', func.row_number().over(window)).\ filter(col(

How to match/extract multi-line pattern from file in pysark

帅比萌擦擦* 提交于 2021-02-04 15:51:50
问题 I have a huge file of rdf triplets (subject predicate objects) as shown in the image below. The goals it extract the bold items and have the following output Item_Id | quantityAmount | quantityUnit | rank ----------------------------------------------- Q31 24954 Meter BestRank Q25 582 Kilometer NormalRank I want to extract lines that follow the following pattern subject is given a pointer ( <Q31> <prop/P1082> <Pointer_Q31-87RF> . ) Pointer has a ranking ( <Pointer_Q31-87RF> <rank> <BestRank>

Batched API call inside apache spark?

梦想的初衷 提交于 2021-02-04 15:00:50
问题 I am a beginner to Apache Spark and I do have the following task: I am reading records from a datasource that - within the spark transformations - need to be enhanced by data from a call to an external webservice before they can be processed any further. The webservice will accept parallel calls to a certain extent, but only allows a few hundred records to be sent at once. Also, it's quite slow, so batching up as much as possible and parallel requests are definitely helping here. Is there are

How to speed up spark df.write jdbc to postgres database?

最后都变了- 提交于 2021-02-04 12:16:14
问题 I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write: df.write.format('jdbc').options( url=psql_url_spark, driver=spark_env['PSQL_DRIVER'], dbtable="{schema}.{table}".format(schema=schema, table=table), user=spark_env['PSQL_USER'], password=spark_env['PSQL_PASS'], batchsize=2000000, queryTimeout=690 ).mode(mode).save() I tried increasing the batchsize but that didn't help, as

Calculate the running time for spark sql

99封情书 提交于 2021-02-04 11:39:07
问题 I'm trying to run a couple of spark SQL statements and want to calculate their running time. One of the solution is to resort to log. I’m wondering is there any other simpler methods to do it. Something like the following: import time startTimeQuery = time.clock() df = sqlContext.sql(query) df.show() endTimeQuery = time.clock() runTimeQuery = endTimeQuery - startTimeQuery 回答1: If you're using spark-shell (scala) you could try defining a timing function like this: def show_timing[T](proc: => T

Convert timestamp to date in spark dataframe

喜欢而已 提交于 2021-02-04 10:52:24
问题 I've seen here: How to convert Timestamp to Date format in DataFrame? the way to convert a timestamp in datetype, but at least for me, it doesn't work. Here is what I've tried # Create dataframe df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',]) # Convert to timestamp df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\ .otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\ # Convert timestamp to date again df

How do I check for equality using Spark Dataframe without SQL Query?

守給你的承諾、 提交于 2021-02-04 09:14:41
问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

How do I check for equality using Spark Dataframe without SQL Query?

此生再无相见时 提交于 2021-02-04 09:09:09
问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

How do I check for equality using Spark Dataframe without SQL Query?

戏子无情 提交于 2021-02-04 09:09:07
问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

How to read multiline CSV file in Pyspark

我们两清 提交于 2021-02-04 08:26:12
问题 I'm using this tweets dataset with Pyspark in order to process it and get some trends according to the tweet's location. But I'm having a problem when I try to create the dataframe. I'm using spark.read.options(header="True").csv("hashtag_donaldtrump.csv") to create the dataframe, but if I look at the tweets column, this is the result I get: Do you know how can I clean the CSV file so it can be processed by Spark? Thank you in advance! 回答1: It looks like a multiline csv. Try doing df = spark