spark-dataframe

Reading DataFrame from partitioned parquet file

心不动则不痛 提交于 2019-12-17 15:39:11
问题 How to read partitioned parquet with condition as dataframe, this works fine, val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*") Partition is there for day=1 to day=30 is it possible to read something like (day = 5 to 6) or day=5,day=6 , val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*") If I put * it gives me all 30 days

PySpark: How to fillna values in dataframe for specific columns?

落爺英雄遲暮 提交于 2019-12-17 10:59:10
问题 I have the following sample DataFrame: a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | And I want to replace null values only in the first 2 columns - Column "a" and "b": a | b | c | 1 | 2 | 4 | 0 | 0 | null| 0 | 3 | 4 | Here is the code to create sample dataframe: rdd = sc.parallelize([(1,2,4), (0,None,None), (None,3,4)]) df2 = sqlContext.createDataFrame(rdd, ["a", "b", "c"]) I know how to replace all null values using: df2 = df2.fillna(0) And when I try this, I lose the third column

Spark DataFrame: does groupBy after orderBy maintain that order?

不想你离开。 提交于 2019-12-17 07:31:10
问题 I have a Spark 2.0 dataframe example with the following structure: id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. It contains 24 entries for each id (one for each hour of the day) and is ordered by id, hour using the orderBy function. I have created an Aggregator groupConcat : def groupConcat(separator: String, columnToConcat: Int) = new Aggregator[Row, String, String] with Serializable { override def zero: String = "" override def reduce(b:

TypeError: Column is not iterable - How to iterate over ArrayType()?

眉间皱痕 提交于 2019-12-17 04:07:09
问题 Consider the following DataFrame: +------+-----------------------+ |type |names | +------+-----------------------+ |person|[john, sam, jane] | |pet |[whiskers, rover, fido]| +------+-----------------------+ Which can be created with the following code: import pyspark.sql.functions as f data = [ ('person', ['john', 'sam', 'jane']), ('pet', ['whiskers', 'rover', 'fido']) ] df = sqlCtx.createDataFrame(data, ["type", "names"]) df.show(truncate=False) Is there a way to directly modify the

Scala String Variable Substitution

雨燕双飞 提交于 2019-12-14 04:18:28
问题 I have spark code written in scala. Spark Reads meta tables (already in spark as temp table) which stores the SQL to be executed. Problem I am facing is that we have queries which uses variables (defined in scala code) I tried different methods but I am not able to substitute variable with value. var begindate= s"2017-01-01"; var enddate = s"2017-01-05"; Msg.print_info(s"begin processing from ${beginDate} to ${endDate}"); //Reading SQL from MetaData table stored in spark as meta_table (temp

How to refer broadcast variable in dataframes

蓝咒 提交于 2019-12-14 04:17:00
问题 I use spark1.6. I tried to broadcast a RDD and am not sure how to access the broadcasted variable in the data frames? I have two dataframes employee & department. Employee Dataframe ------------------- Emp Id | Emp Name | Emp_Age ------------------ 1 | john | 25 2 | David | 35 Department Dataframe -------------------- Dept Id | Dept Name | Emp Id ----------------------------- 1 | Admin | 1 2 | HR | 2 import scala.collection.Map val df_emp = hiveContext.sql("select * from emp") val df_dept =

Refresh Dataframe in Spark real-time Streaming without stopping process

大兔子大兔子 提交于 2019-12-14 03:53:23
问题 in my application i get a stream of accounts from Kafka queue (using Spark streaming with kafka) And i need to fetch attributes related to these accounts from S3 so im planning to cache S3 resultant dataframe as the S3 data will not updated atleast for a day for now, it might change to 1hr or 10 mins very soon in future .So the question is how can i refresh the cached dataframe periodically without stopping process. **Update:Im planning to publish an event into kafka whenever there is an

get the distinct elements of an ArrayType column in a spark dataframe

空扰寡人 提交于 2019-12-14 01:10:38
问题 I have a dataframe with 3 columns named id , feat1 and feat2 . feat1 and feat2 are in the form of Array of String: Id, feat1,feat2 ------------------ 1, ["feat1_1","feat1_2","feat1_3"],[] 2, ["feat1_2"],["feat2_1","feat2_2"] 3,["feat1_4"],["feat2_3"] I want to get the list of distinct elements inside each feature column, so the output will be: distinct_feat1,distinct_feat2 ----------------------------- ["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3] what is the best

Read a json file with 12 nested level into hive in AZURE hdinsights

一世执手 提交于 2019-12-14 01:05:09
问题 I tried to create a schema for the json file manually and tried to create a Hive table and i am getting column type name length 10888 exceeds max allowed length 2000 . I am guessing i have to change the metastore details but i am not sure where is the config located In azure Hdinsights . Other way I tried was I got the schema from spark dataframe and i tried to create table from the view but still I get the same error. this are the steps i tried in spark val tne1 = sc.wholeTextFiles("wasb

saveAsTextFile hangs in spark java.io.IOException: Connection reset by peer in Data frame

丶灬走出姿态 提交于 2019-12-13 19:04:01
问题 I am running an application in spark which do the simple diff between two data frame . I execute as jar file in my cluster environment . My cluster environment is 94 node cluster. There are two data set 2 GB and 4 GB which mapped to data frame . My job is working fine for the very small size files ... I personal think saveAsTextFile takes more time in my application Below my cluster connfig details Total Vmem allocated for Containers 394.80 GB Total Vmem allocated for Containers 394.80 GB