apache-spark-sql | 易学教程

How to execute multiple queries in parallel and distributed?

阅读更多关于 How to execute multiple queries in parallel and distributed?

问题 I am using spark 2.4.1 version and java 8. I have scenario like: Will be provided a list of classifiers from a property file to process. These classifiers determines the data what to pull and process. Something like the below: val classifiers = Seq("classifierOne","classifierTwo","classifierThree"); for( classifier : classifiers ){ // read from CassandraDB table val acutalData = spark.read(.....).where(<classifier conditition>) // the data varies depend on the classifier passed in // this

How to convert date to the first day of month in a PySpark Dataframe column?

阅读更多关于 How to convert date to the first day of month in a PySpark Dataframe column?

问题 I have the following DataFrame: +----------+ | date| +----------+ |2017-01-25| |2017-01-21| |2017-01-12| +----------+ Here is the code the create above DataFrame: import pyspark.sql.functions as f rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)]) df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd")) df.show() I want a new column with the first date of month for each row, just replace the day to "01" in all the dates +

How to convert date to the first day of month in a PySpark Dataframe column?

阅读更多关于 How to convert date to the first day of month in a PySpark Dataframe column?

Spark Window Functions - rangeBetween dates

阅读更多关于 Spark Window Functions - rangeBetween dates

问题 I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like: Window \ .partitionBy('id') \ .orderBy('start') and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just

How to store JSON dataframe with comma sepearted

阅读更多关于 How to store JSON dataframe with comma sepearted

问题 I need to write record of dataframe to a json file. If I write the dataframe into the file it stores like {"a":1} {"b":2} , I want to write the dataframe like this [{"a":1} ,{"b":2}] . Can you please help me. Thanks in advance. 回答1: Use to_json function to create array of json objects then use .saveAsTextFile to save the json object. Example: #sample dataframe df=spark.createDataFrame([("a",1),("b",2)],["id","name"]) from pyspark.sql.functions import * df.groupBy(lit("1")).\ agg(collect_list

How to store JSON dataframe with comma sepearted

阅读更多关于 How to store JSON dataframe with comma sepearted

how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

阅读更多关于 how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

问题 im using ubuntu im using spark dependency using intellij Command 'spark' not found, but can be installed with: .. (when i enter spark in shell) i have two user amine , and hadoop_amine (where hadoop hdfs is set) when i try to save a dataframe to HDFS (spark scala): procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json") i got this error Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied:

Is it possible to partition data based on column which may contain empty value sometimes? how to handle it?

阅读更多关于 Is it possible to partition data based on column which may contain empty value sometimes? how to handle it?

问题 I am using spark-sql-2.4.1v. I need to join two datasets ds1 and ds2 (new-column) based on some field. val resultDs = ds1.join(ds2, , "inner"); resultDs now contain "new-column" but for the records which not met does not have this "new-column". Hence for those records I need to set this "new-column" null/empty. But as per my business requirement I need to parition the resultDs on "new-column". so how generally these kind of scenarios are handled ? pleas advice. 来源： https://stackoverflow.com

How to resolve com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast… Java Spark

阅读更多关于 How to resolve com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast… Java Spark

问题 Hi I am new to Java Spark, and have been looking for solutions for couple of days. I am working on loading MongoDB data into hive table, however, I found some error while saveAsTable that occurs this error com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(oid,StringType,true)) (value: BsonString{value='54d3e8aeda556106feba7fa2'}) I've tried increase the sampleSize, different mongo-spark-connector versions, ... but non of working

SaveAsTable in Spark Scala: HDP3.x

阅读更多关于 SaveAsTable in Spark Scala: HDP3.x

问题 I have one dataframe in Spark I'm saving it in my hive as a table.But getting below error message. java.lang.RuntimeException: com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector does not allow create table as select.at scala.sys.package$.error(package.scala:27) can anyone please help me how should i save this as table in hive. val df3 = df1.join(df2, df1("inv_num") === df2("inv_num") // Join both dataframes on id column ).withColumn("finalSalary", when(df1("salary") < df2("salary"),