pyspark

PySpark SparkSession Builder with Kubernetes Master

怎甘沉沦 提交于 2020-02-19 04:03:41
问题 I recently saw a pull request that was merged to the Apache/Spark repository that apparently adds initial Python bindings for PySpark on K8s. I posted a comment to the PR asking a question about how to use spark-on-k8s in a Python Jupyter notebook, and was told to ask my question here. My question is: Is there a way to create SparkContexts using PySpark's SparkSession.Builder with master set to k8s://<...>:<...> , and have the resulting jobs run on spark-on-k8s , instead of on local ? E.g.:

reading a file in hdfs from pyspark

余生长醉 提交于 2020-02-17 13:33:54
问题 I'm trying to read a file in my hdfs. Here's a showing of my hadoop file structure. hduser@GVM:/usr/local/spark/bin$ hadoop fs -ls -R / drwxr-xr-x - hduser supergroup 0 2016-03-06 17:28 /inputFiles drwxr-xr-x - hduser supergroup 0 2016-03-06 17:31 /inputFiles/CountOfMonteCristo -rw-r--r-- 1 hduser supergroup 2685300 2016-03-06 17:31 /inputFiles/CountOfMonteCristo/BookText.txt Here's my pyspark code: from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("myFirstApp")

reading a file in hdfs from pyspark

て烟熏妆下的殇ゞ 提交于 2020-02-17 13:32:59
问题 I'm trying to read a file in my hdfs. Here's a showing of my hadoop file structure. hduser@GVM:/usr/local/spark/bin$ hadoop fs -ls -R / drwxr-xr-x - hduser supergroup 0 2016-03-06 17:28 /inputFiles drwxr-xr-x - hduser supergroup 0 2016-03-06 17:31 /inputFiles/CountOfMonteCristo -rw-r--r-- 1 hduser supergroup 2685300 2016-03-06 17:31 /inputFiles/CountOfMonteCristo/BookText.txt Here's my pyspark code: from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("myFirstApp")

reading a file in hdfs from pyspark

浪子不回头ぞ 提交于 2020-02-17 13:32:07
问题 I'm trying to read a file in my hdfs. Here's a showing of my hadoop file structure. hduser@GVM:/usr/local/spark/bin$ hadoop fs -ls -R / drwxr-xr-x - hduser supergroup 0 2016-03-06 17:28 /inputFiles drwxr-xr-x - hduser supergroup 0 2016-03-06 17:31 /inputFiles/CountOfMonteCristo -rw-r--r-- 1 hduser supergroup 2685300 2016-03-06 17:31 /inputFiles/CountOfMonteCristo/BookText.txt Here's my pyspark code: from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("myFirstApp")

How to build a sparkSession in Spark 2.0 using pyspark?

送分小仙女□ 提交于 2020-02-17 05:57:40
问题 I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language. My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is

How to build a sparkSession in Spark 2.0 using pyspark?

ぃ、小莉子 提交于 2020-02-17 05:55:09
问题 I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language. My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is

get datatype of column using pyspark

允我心安 提交于 2020-02-17 05:51:08
问题 We are reading data from MongoDB Collection . Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and weight are the columns quantity weight --------- -------- 12300 656 123566000000 789.6767 1238 56.22 345 23 345566677777789 21 Actually we didn't defined data type for any column of mongo collection. When I query to the count from pyspark dataframe

get datatype of column using pyspark

不羁岁月 提交于 2020-02-17 05:51:00
问题 We are reading data from MongoDB Collection . Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and weight are the columns quantity weight --------- -------- 12300 656 123566000000 789.6767 1238 56.22 345 23 345566677777789 21 Actually we didn't defined data type for any column of mongo collection. When I query to the count from pyspark dataframe

pyspark dataframe aggregate a column by sliding time window

和自甴很熟 提交于 2020-02-16 13:16:07
问题 I would like to transform a column to multiple columns in a pyspark dataframe. the original dataframe: client_id value1 name1 a_date dhd 589 ecdu 2020-1-1 dhd 575 tygp 2020-1-1 dhd 821 rdsr 2020-1-1 dhd 872 rgvd 2019-12-10 dhd 619 bhnd 2019-12-10 dhd 781 prti 2019-12-10 UPDATE: the range of the dates at two months may be less than 30 days. The range between two dates in two months is not fixed. It can be in the range between 30 days and 15 days. e.g. 2020-1-1 and 2019-12-18. There are

While submit job with pyspark, how to access static files upload with --files argument?

眉间皱痕 提交于 2020-02-16 11:32:39
问题 for example, i have a folder: / - test.py - test.yml and the job is submited to spark cluster with: gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py" in the test.py , I want to access the static file I uploaded. with open('test.yml') as test_file: logging.info(test_file.read()) but got the following exception: IOError: [Errno 2] No such file or directory: 'test.yml' How to access the file I uploaded? 回答1: Files distributed using SparkContext.addFile (and --files ) can be