apache-spark-sql

How to execute multiple queries in parallel and distributed?

人盡茶涼 提交于 2020-05-24 05:28:31
问题 I am using spark 2.4.1 version and java 8. I have scenario like: Will be provided a list of classifiers from a property file to process. These classifiers determines the data what to pull and process. Something like the below: val classifiers = Seq("classifierOne","classifierTwo","classifierThree"); for( classifier : classifiers ){ // read from CassandraDB table val acutalData = spark.read(.....).where(<classifier conditition>) // the data varies depend on the classifier passed in // this

How to convert date to the first day of month in a PySpark Dataframe column?

此生再无相见时 提交于 2020-05-23 12:50:26
问题 I have the following DataFrame: +----------+ | date| +----------+ |2017-01-25| |2017-01-21| |2017-01-12| +----------+ Here is the code the create above DataFrame: import pyspark.sql.functions as f rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)]) df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd")) df.show() I want a new column with the first date of month for each row, just replace the day to "01" in all the dates +

How to convert date to the first day of month in a PySpark Dataframe column?

扶醉桌前 提交于 2020-05-23 12:48:30
问题 I have the following DataFrame: +----------+ | date| +----------+ |2017-01-25| |2017-01-21| |2017-01-12| +----------+ Here is the code the create above DataFrame: import pyspark.sql.functions as f rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)]) df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd")) df.show() I want a new column with the first date of month for each row, just replace the day to "01" in all the dates +

Spark Window Functions - rangeBetween dates

岁酱吖の 提交于 2020-05-21 01:56:08
问题 I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like: Window \ .partitionBy('id') \ .orderBy('start') and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just

How to store JSON dataframe with comma sepearted

心已入冬 提交于 2020-05-17 08:46:46
问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list

How to store JSON dataframe with comma sepearted

ぐ巨炮叔叔 提交于 2020-05-17 08:46:18
问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list

how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

…衆ロ難τιáo~ 提交于 2020-05-17 07:10:14
问题 im using ubuntu im using spark dependency using intellij Command 'spark' not found, but can be installed with: .. (when i enter spark in shell) i have two user amine , and hadoop_amine (where hadoop hdfs is set) when i try to save a dataframe to HDFS (spark scala): procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json") i got this error Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied:

Is it possible to partition data based on column which may contain empty value sometimes? how to handle it?

回眸只為那壹抹淺笑 提交于 2020-05-17 07:02:09
问题 I am using spark-sql-2.4.1v. I need to join two datasets ds1 and ds2 (new-column) based on some field. val resultDs = ds1.join(ds2, , "inner"); resultDs now contain "new-column" but for the records which not met does not have this "new-column". Hence for those records I need to set this "new-column" null/empty. But as per my business requirement I need to parition the resultDs on "new-column". so how generally these kind of scenarios are handled ? pleas advice. 来源: https://stackoverflow.com

How to resolve com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast… Java Spark

雨燕双飞 提交于 2020-05-17 06:31:05
问题 Hi I am new to Java Spark, and have been looking for solutions for couple of days. I am working on loading MongoDB data into hive table, however, I found some error while saveAsTable that occurs this error com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(oid,StringType,true)) (value: BsonString{value='54d3e8aeda556106feba7fa2'}) I've tried increase the sampleSize, different mongo-spark-connector versions, ... but non of working

SaveAsTable in Spark Scala: HDP3.x

不羁岁月 提交于 2020-05-17 06:08:08
问题 I have one dataframe in Spark I'm saving it in my hive as a table.But getting below error message. java.lang.RuntimeException: com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector does not allow create table as select.at scala.sys.package$.error(package.scala:27) can anyone please help me how should i save this as table in hive. val df3 = df1.join(df2, df1("inv_num") === df2("inv_num") // Join both dataframes on id column ).withColumn("finalSalary", when(df1("salary") < df2("salary"),