spark-dataframe | 易学教程

How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

阅读更多关于 How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

I have a Dataframe A that contains a column of array string. ... |-- browse: array (nullable = true) | |-- element: string (containsNull = true) ... For example three sample rows would be +---------+--------+---------+ | column 1| browse| column n| +---------+--------+---------+ | foo1| [X,Y,Z]| bar1| | foo2| [K,L]| bar2| | foo3| [M]| bar3| And another Dataframe B that contains a column of string |-- browsenodeid: string (nullable = true) Some sample rows for it would be +------------+ |browsenodeid| +------------+ | A| | Z| | M| How can I filter A so that I keep all the rows whose browse

GroupBy Operation of DataFrame takes lot of time in spark 2.0

阅读更多关于 GroupBy Operation of DataFrame takes lot of time in spark 2.0

问题 In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy

AttributeError: module 'pandas' has no attribute 'to_csv'

阅读更多关于 AttributeError: module 'pandas' has no attribute 'to_csv'

I took some rows from csv file like this pd.DataFrame(CV_data.take(5), columns=CV_data.columns) and performed some functions on it. now i want to save it in csv again but it is giving error module 'pandas' has no attribute 'to_csv' I am trying to save it like this pd.to_csv(CV_data, sep='\t', encoding='utf-8') here is my full code. how can i save my resulting data in csv or excel? # Disable warnings, set Matplotlib inline plotting and load Pandas package import warnings warnings.filterwarnings('ignore') %matplotlib inline import pandas as pd pd.options.display.mpl_style = 'default' CV_data =

Pyspark - how to backfill a DataFrame?

阅读更多关于 Pyspark - how to backfill a DataFrame?

How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame ? The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter. In pandas you can use the following to backfill a time series: Create data import pandas as pd index = pd.date_range('2017-01-01', '2017-01-05') data = [1, 2, 3, None, 5] df = pd.DataFrame({'data': data}, index=index) Giving Out[1]: data 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 3.0 2017-01-04 NaN 2017-01-05 5.0 Backfill the dataframe df = df.fillna(method='bfill')

Avoid losing data type for the partitioned data when writing from Spark

阅读更多关于 Avoid losing data type for the partitioned data when writing from Spark

I have a dataframe like below. itemName, itemCategory Name1, C0 Name2, C1 Name3, C0 I would like to save this dataframe as partitioned parquet file: df.write.mode("overwrite").partitionBy("itemCategory").parquet(path) For this dataframe, when I read the data back, it will have String the data type for itemCategory . However at times, I have dataframe from other tenants as below. itemName, itemCategory Name1, 0 Name2, 1 Name3, 0 In this case, after being written as partition, when read back, the resulting dataframe will have Int for the data type of itemCategory . Parquet file has the metadata

How to profile pyspark jobs

阅读更多关于 How to profile pyspark jobs

问题 I want to understand profiling in pyspark codes. Following this: https://github.com/apache/spark/pull/2351 >>> sc._conf.set("spark.python.profile", "true") >>> rdd = sc.parallelize(range(100)).map(str) >>> rdd.count() 100 >>> sc.show_profiles() ============================================================ Profile of RDD<id=1> ============================================================ 284 function calls (276 primitive calls) in 0.001 seconds Ordered by: internal time, cumulative time ncalls

Should I Avoid groupby() in Dataset/Dataframe? [duplicate]

阅读更多关于 Should I Avoid groupby() in Dataset/Dataframe? [duplicate]

This question already has an answer here : DataFrame / Dataset groupBy behaviour/optimization (1 answer) Closed last year . I know that in RDD's we were discouraged from using groupByKey, and encouraged to use alternatives such as reduceByKey(), and aggregateByKey() since these other methods would reduce first on each partition, and then perform groupByKey() and thus reduces the amount of data being shuffled. Now, my question is if this still applies to Dataset/Dataframe? I was thinking that since catalyst engine does a lot of optimization, that the catalyst will automatically know that it

reuse the result of a select expression in the “GROUP BY” clause?

阅读更多关于 reuse the result of a select expression in the “GROUP BY” clause?

问题 In MySQL, I can have a query like this: select cast(from_unixtime(t.time, '%Y-%m-%d %H:00') as datetime) as timeHour , ... from some_table t group by timeHour, ... order by timeHour, ... where timeHour in the GROUP BY is the result of a select expression. But I just tried a query similar to that in Sqark SQL , and I got an error of Error: org.apache.spark.sql.AnalysisException: cannot resolve '`timeHour`' given input columns: ... My query for Spark SQL was this: select cast(t.unixTime as

Define spark udf by reflection on a String

阅读更多关于 Define spark udf by reflection on a String

问题 I am trying to define a udf in spark(2.0) from a string containing scala function definition.Here is the snippet: val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe import universe._ import scala.reflect.runtime.currentMirror import scala.tools.reflect.ToolBox val toolbox = currentMirror.mkToolBox() val f = udf(toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]) sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show This gives me

find mean and corr of 10,000 columns in pyspark Dataframe

阅读更多关于 find mean and corr of 10,000 columns in pyspark Dataframe

问题 I have DF with 10K columns and 70Million rows. I want to calculate the mean and corr of 10K columns. I did below code but it wont work due to code size 64K issue (https://issues.apache.org/jira/browse/SPARK-16845) Data: region dept week sal val1 val2 val3 ... val10000 US CS 1 1 2 1 1 ... 2 US CS 2 1.5 2 3 1 ... 2 US CS 3 1 2 2 2.1 2 US ELE 1 1.1 2 2 2.1 2 US ELE 2 2.1 2 2 2.1 2 US ELE 3 1 2 1 2 .... 2 UE CS 1 2 2 1 2 .... 2 Code: aggList = [func.mean(col) for col in df.columns] #exclude keys