apache-spark | 易学教程

Pysaprk multi groupby with different column

阅读更多关于 Pysaprk multi groupby with different column

问题 I have data like below year name percent sex 1880 John 0.081541 boy 1881 William 0.080511 boy 1881 John 0.050057 boy I need to groupby and count using different columns df_year = df.groupby('year').count() df_name = df.groupby('name').count() df_sex = df.groupby('sex').count() then I have to create a Window to get the top-3 data by each column window = Window.partitionBy('year').orderBy(col("count").desc()) top4_res = df_year.withColumn('topn', func.row_number().over(window)).\ filter(col(

How to match/extract multi-line pattern from file in pysark

阅读更多关于 How to match/extract multi-line pattern from file in pysark

问题 I have a huge file of rdf triplets (subject predicate objects) as shown in the image below. The goals it extract the bold items and have the following output Item_Id | quantityAmount | quantityUnit | rank ----------------------------------------------- Q31 24954 Meter BestRank Q25 582 Kilometer NormalRank I want to extract lines that follow the following pattern subject is given a pointer ( <Q31> <prop/P1082> <Pointer_Q31-87RF> . ) Pointer has a ranking ( <Pointer_Q31-87RF> <rank> <BestRank>

Batched API call inside apache spark?

阅读更多关于 Batched API call inside apache spark?

问题 I am a beginner to Apache Spark and I do have the following task: I am reading records from a datasource that - within the spark transformations - need to be enhanced by data from a call to an external webservice before they can be processed any further. The webservice will accept parallel calls to a certain extent, but only allows a few hundred records to be sent at once. Also, it's quite slow, so batching up as much as possible and parallel requests are definitely helping here. Is there are

How to speed up spark df.write jdbc to postgres database?

阅读更多关于 How to speed up spark df.write jdbc to postgres database?

问题 I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write: df.write.format('jdbc').options( url=psql_url_spark, driver=spark_env['PSQL_DRIVER'], dbtable="{schema}.{table}".format(schema=schema, table=table), user=spark_env['PSQL_USER'], password=spark_env['PSQL_PASS'], batchsize=2000000, queryTimeout=690 ).mode(mode).save() I tried increasing the batchsize but that didn't help, as

Calculate the running time for spark sql

阅读更多关于 Calculate the running time for spark sql

问题 I'm trying to run a couple of spark SQL statements and want to calculate their running time. One of the solution is to resort to log. I’m wondering is there any other simpler methods to do it. Something like the following: import time startTimeQuery = time.clock() df = sqlContext.sql(query) df.show() endTimeQuery = time.clock() runTimeQuery = endTimeQuery - startTimeQuery 回答1: If you're using spark-shell (scala) you could try defining a timing function like this: def show_timing[T](proc: => T

Convert timestamp to date in spark dataframe

阅读更多关于 Convert timestamp to date in spark dataframe

问题 I've seen here: How to convert Timestamp to Date format in DataFrame? the way to convert a timestamp in datetype, but at least for me, it doesn't work. Here is what I've tried # Create dataframe df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',]) # Convert to timestamp df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\ .otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\ # Convert timestamp to date again df

How do I check for equality using Spark Dataframe without SQL Query?

阅读更多关于 How do I check for equality using Spark Dataframe without SQL Query?

问题 I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble. Heres my code df.select(df("state")==="TX").show() this returns the state column with boolean values instead of just TX Ive also tried df.select(df("state")=="TX").show() but this doesn't work either. 回答1: I had the same issue, and the following syntax worked for me: df.filter(df("state")==="TX").show() I'm using Spark 1.6. 回答2: There is another simple sql like option. With Spark 1

How do I check for equality using Spark Dataframe without SQL Query?

阅读更多关于 How do I check for equality using Spark Dataframe without SQL Query?

How do I check for equality using Spark Dataframe without SQL Query?

阅读更多关于 How do I check for equality using Spark Dataframe without SQL Query?

How to read multiline CSV file in Pyspark

阅读更多关于 How to read multiline CSV file in Pyspark

问题 I'm using this tweets dataset with Pyspark in order to process it and get some trends according to the tweet's location. But I'm having a problem when I try to create the dataframe. I'm using spark.read.options(header="True").csv("hashtag_donaldtrump.csv") to create the dataframe, but if I look at the tweets column, this is the result I get: Do you know how can I clean the CSV file so it can be processed by Spark? Thank you in advance! 回答1: It looks like a multiline csv. Try doing df = spark