spark-dataframe | 易学教程

Reading DataFrame from partitioned parquet file

阅读更多关于 Reading DataFrame from partitioned parquet file

问题 How to read partitioned parquet with condition as dataframe, this works fine, val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") Partition is there for day=1 to day=30 is it possible to read something like (day = 5 to 6) or day=5,day=6 , val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*") If I put * it gives me all 30 days

PySpark: How to fillna values in dataframe for specific columns?

阅读更多关于 PySpark: How to fillna values in dataframe for specific columns?

问题 I have the following sample DataFrame: a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | And I want to replace null values only in the first 2 columns - Column "a" and "b": a | b | c | 1 | 2 | 4 | 0 | 0 | null| 0 | 3 | 4 | Here is the code to create sample dataframe: rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)]) df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"]) I know how to replace all null values using: df2 = df2.fillna(0) And when I try this, I lose the third column

Spark DataFrame: does groupBy after orderBy maintain that order?

阅读更多关于 Spark DataFrame: does groupBy after orderBy maintain that order?

问题 I have a Spark 2.0 dataframe example with the following structure: id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. It contains 24 entries for each id (one for each hour of the day) and is ordered by id, hour using the orderBy function. I have created an Aggregator groupConcat : def groupConcat(separator: String, columnToConcat: Int) = new Aggregator[Row, String, String] with Serializable { override def zero: String = "" override def reduce(b:

TypeError: Column is not iterable - How to iterate over ArrayType()?

阅读更多关于 TypeError: Column is not iterable - How to iterate over ArrayType()?

问题 Consider the following DataFrame: +------+-----------------------+ |type |names | +------+-----------------------+ |person|[john, sam, jane] | |pet |[whiskers, rover, fido]| +------+-----------------------+ Which can be created with the following code: import pyspark.sql.functions as f data = [ ('person', ['john', 'sam', 'jane']), ('pet', ['whiskers', 'rover', 'fido']) ] df = sqlCtx.createDataFrame(data, ["type", "names"]) df.show(truncate=False) Is there a way to directly modify the

Scala String Variable Substitution

阅读更多关于 Scala String Variable Substitution

问题 I have spark code written in scala. Spark Reads meta tables (already in spark as temp table) which stores the SQL to be executed. Problem I am facing is that we have queries which uses variables (defined in scala code) I tried different methods but I am not able to substitute variable with value. var begindate= s"2017-01-01"; var enddate = s"2017-01-05"; Msg.print_info(s"begin processing from ${beginDate} to ${endDate}"); //Reading SQL from MetaData table stored in spark as meta_table (temp

How to refer broadcast variable in dataframes

阅读更多关于 How to refer broadcast variable in dataframes

Refresh Dataframe in Spark real-time Streaming without stopping process

阅读更多关于 Refresh Dataframe in Spark real-time Streaming without stopping process

问题 in my application i get a stream of accounts from Kafka queue (using Spark streaming with kafka) And i need to fetch attributes related to these accounts from S3 so im planning to cache S3 resultant dataframe as the S3 data will not updated atleast for a day for now, it might change to 1hr or 10 mins very soon in future .So the question is how can i refresh the cached dataframe periodically without stopping process. **Update:Im planning to publish an event into kafka whenever there is an

get the distinct elements of an ArrayType column in a spark dataframe

阅读更多关于 get the distinct elements of an ArrayType column in a spark dataframe

问题 I have a dataframe with 3 columns named id , feat1 and feat2 . feat1 and feat2 are in the form of Array of String: Id, feat1,feat2 ------------------ 1, ["feat1_1","feat1_2","feat1_3"],[] 2, ["feat1_2"],["feat2_1","feat2_2"] 3,["feat1_4"],["feat2_3"] I want to get the list of distinct elements inside each feature column, so the output will be: distinct_feat1,distinct_feat2 ----------------------------- ["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3] what is the best

Read a json file with 12 nested level into hive in AZURE hdinsights

阅读更多关于 Read a json file with 12 nested level into hive in AZURE hdinsights

问题 I tried to create a schema for the json file manually and tried to create a Hive table and i am getting column type name length 10888 exceeds max allowed length 2000 . I am guessing i have to change the metastore details but i am not sure where is the config located In azure Hdinsights . Other way I tried was I got the schema from spark dataframe and i tried to create table from the view but still I get the same error. this are the steps i tried in spark val tne1 = sc.wholeTextFiles("wasb

saveAsTextFile hangs in spark java.io.IOException: Connection reset by peer in Data frame

阅读更多关于 saveAsTextFile hangs in spark java.io.IOException: Connection reset by peer in Data frame

问题 I am running an application in spark which do the simple diff between two data frame . I execute as jar file in my cluster environment . My cluster environment is 94 node cluster. There are two data set 2 GB and 4 GB which mapped to data frame . My job is working fine for the very small size files ... I personal think saveAsTextFile takes more time in my application Below my cluster connfig details Total Vmem allocated for Containers 394.80 GB Total Vmem allocated for Containers 394.80 GB