pyspark

How to correctly set python version in Spark?

a 夏天 提交于 2020-05-17 07:52:17
问题 My spark version is 2.4.0, it has python2.7 and python 3.7 . The default version is python2.7. Now I want to submit a pyspark program which uses python3.7. I tried two ways, but both of them don't work. spark2-submit --master yarn \ --conf "spark.pyspark.python=/usr/bin/python3" \ --conf "spark.pyspark.driver.python=/usr/bin/python3" pi.py It doesn't work and says Cannot run program "/usr/bin/python3": error=13, Permission denied But actually, I have the permission, for example, I can use

how to Intialize the spark shell with a specific user to save data to hdfs by apache spark

…衆ロ難τιáo~ 提交于 2020-05-17 07:10:14
问题 im using ubuntu im using spark dependency using intellij Command 'spark' not found, but can be installed with: .. (when i enter spark in shell) i have two user amine , and hadoop_amine (where hadoop hdfs is set) when i try to save a dataframe to HDFS (spark scala): procesed.write.format("json").save("hdfs://localhost:54310/mydata/enedis/POC/processed.json") i got this error Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied:

Drop partition columns when writing parquet in pyspark

≯℡__Kan透↙ 提交于 2020-05-17 07:07:14
问题 I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files. Here is my approach to partitioning and writing the data: df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col'))) df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet') This properly creates

PySpark Dataframe Performance Tuning

北战南征 提交于 2020-05-17 06:25:07
问题 I am trying to consolidate some scripts; to give us one read of the DB rather than every script reading the same data from Hive. So moving to a read-once; process many model. I've persisted the dataframes & repartition the output after each aggregation; but I need it to be faster, if anything, those things have slowed it down. We have 20TB+ of data per day, so I had assumed that persisting the data, if it's going to be read many times, would make things faster, but it hasn't. Also, I have

What's the difference between RDD and Dataframe in Spark? [duplicate]

我是研究僧i 提交于 2020-05-17 06:09:38
问题 This question already has answers here : Difference between DataFrame, Dataset, and RDD in Spark (15 answers) Closed 9 months ago . Hi I am relatively new to apache spark. I wanted to understand the difference between RDD,dataframe and datasets. For example, I am pulling data from s3 bucket. df=spark.read.parquet("s3://output/unattributedunattributed*") In this case when I am loading data from s3, what would be RDD? Also since RDD is immutable , I can change value for df so df couldn't be rdd

pyspark: text file is read but data frame is showing an error

萝らか妹 提交于 2020-05-17 05:57:34
问题 I am trying to read a text file from local to a pyspark DF. with df = spark.read.text("file:///<path>") This runs successfully and gives out a dataframe. df. printSchema() output: root |-- value: string (nullable = true) but when try to access df it gives out the following error. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 350, in show print(self._jdf.showString(n, 20, vertical)) File "/usr

pyspark: text file is read but data frame is showing an error

放肆的年华 提交于 2020-05-17 05:57:00
问题 I am trying to read a text file from local to a pyspark DF. with df = spark.read.text("file:///<path>") This runs successfully and gives out a dataframe. df. printSchema() output: root |-- value: string (nullable = true) but when try to access df it gives out the following error. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 350, in show print(self._jdf.showString(n, 20, vertical)) File "/usr

In Spark, how to write header in a file, if there is no row in a dataframe?

懵懂的女人 提交于 2020-05-16 04:36:16
问题 I want to write a header in a file if there is no row in dataframe, Currently when I write an empty dataframe to a file then file is created but it does not have header in it. I am writing dataframe using these setting and command: Dataframe.repartition(1) \ .write \ .format("com.databricks.spark.csv") \ .option("ignoreLeadingWhiteSpace", False) \ .option("ignoreTrailingWhiteSpace", False) \ .option("header", "true") \ .save('/mnt/Bilal/Dataframe'); I want the header row in the file, even if

PySpark MLLib Random Forest classifier repeatability issue

℡╲_俬逩灬. 提交于 2020-05-16 01:31:21
问题 I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page. https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html. This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely

PySpark MLLib Random Forest classifier repeatability issue

浪子不回头ぞ 提交于 2020-05-16 01:31:11
问题 I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page. https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html. This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely