pyspark

pyspark: how to show current directory?

吃可爱长大的小学妹 提交于 2019-12-21 20:26:56
问题 Hi I'm using pyspark interactively. I think I'm failing loading a LOCAL file correctly. how do I check current directory, so that I can go to browser to take a look at that actual file? Or is the default directory where pyspark is? Thanks 回答1: You can't load local file unless you have same file in all workers under same path. For example if you want to read data.csv file in spark, copy this file to all workers under same path(say /tmp/data.csv). Now you can use sc.textFile("file:///tmp/data

Can I convert pandas dataframe to spark rdd?

心不动则不痛 提交于 2019-12-21 20:24:50
问题 Pbm: a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ? 回答1: You can use the SQLContext object to invoke the createDataFrame method, which takes an input data which can optionally be a Pandas DataFrame object. 回答2: Lets say dataframe is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this rdd_data = spark.createDataFrame(dataframe)\ .rdd In case, if

How to pass params to a ML Pipeline.fit method?

帅比萌擦擦* 提交于 2019-12-21 20:20:07
问题 I am trying to build a clustering mechanism using Google Dataproc + Spark Google Bigquery Create a job using Spark ML KMeans+pipeline As follows: Create user level based feature table in bigquery Example: How the feature table looks like userid |x1 |x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9 |x10 00013 |0.01 | 0 |0 |0 |0 |0 |0 |0.06 |0.09 | 0.001 Spin up a default setting cluster, am using gcloud command line interface to create the cluster and run jobs as shown here Using the starter code provided, I

How to overwrite data with PySpark's JDBC without losing schema?

不羁岁月 提交于 2019-12-21 20:17:10
问题 I have a DataFrame that I'm willing to write it to a PostgreSQL database. If I simply use the "overwrite" mode, like: df.write.jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES) The table is recreated and the data is saved. But the problem is that I'd like to keep the PRIMARY KEY and Indexes in the table. So, I'd like to either overwrite only the data, keeping the table schema or to add the primary key constraint and indexes afterward. Can either

How to create new DataFrame with dict

坚强是说给别人听的谎言 提交于 2019-12-21 18:57:13
问题 I had one dict , like: cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"} and one DataFrame A , like: +---+ |key| +---- | k1| | k2| | k3| | k4| +---+ to create the DataFame above with code: data = [('k1'), ('k2'), ('k3'), ('k4')] A = spark.createDataFrame(data, ['key']) I want to get the new DataFrame, like: +---+----------+----------+ |key| v1 | v2 | +---+----------+----------+ | k1|true |false | | k2|true |false | | k3|false |true | | k4|false |true | +---+----------+----------+ I

How to reference a dataframe when in an UDF on another dataframe?

会有一股神秘感。 提交于 2019-12-21 18:10:12
问题 How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe? Here's a dummy example. I am creating two dataframes scores and lastnames , and within each lies a column that is the same across the two dataframes. In the UDF applied on scores , I want to filter on lastnames and return a string found in lastname . from pyspark import SparkContext from pyspark import SparkConf from pyspark.sql import SQLContext from pyspark.sql.types import * sc = SparkContext(

How to reference a dataframe when in an UDF on another dataframe?

泪湿孤枕 提交于 2019-12-21 18:10:07
问题 How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe? Here's a dummy example. I am creating two dataframes scores and lastnames , and within each lies a column that is the same across the two dataframes. In the UDF applied on scores , I want to filter on lastnames and return a string found in lastname . from pyspark import SparkContext from pyspark import SparkConf from pyspark.sql import SQLContext from pyspark.sql.types import * sc = SparkContext(

SparkContext.getOrCreate() purpose

落花浮王杯 提交于 2019-12-21 17:56:25
问题 What is the purpose of the getOrCreate method from SparkContext class? I don't understand when we should use this method. If I have 2 spark applications that are run with spark-submit , and in the main method I instantiate the spark context with SparkContext.getOrCreate , both app will have the same context? Or the purpose is simpler, and the only purpose is when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton

SparkContext.getOrCreate() purpose

独自空忆成欢 提交于 2019-12-21 17:56:13
问题 What is the purpose of the getOrCreate method from SparkContext class? I don't understand when we should use this method. If I have 2 spark applications that are run with spark-submit , and in the main method I instantiate the spark context with SparkContext.getOrCreate , both app will have the same context? Or the purpose is simpler, and the only purpose is when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton

E-num / get Dummies in pyspark

雨燕双飞 提交于 2019-12-21 17:52:09
问题 I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example The code in python looks like that: enum = ['column1','column2'] for e in enum: print e temp = pd.get_dummies(data[e],drop_first=True,prefix=e) data = pd.concat([data,temp], axis=1) data.drop(e,axis=1,inplace