pyspark | 易学教程

pyspark: how to show current directory?

阅读更多关于 pyspark: how to show current directory?

问题 Hi I'm using pyspark interactively. I think I'm failing loading a LOCAL file correctly. how do I check current directory, so that I can go to browser to take a look at that actual file? Or is the default directory where pyspark is? Thanks 回答1: You can't load local file unless you have same file in all workers under same path. For example if you want to read data.csv file in spark, copy this file to all workers under same path(say /tmp/data.csv). Now you can use sc.textFile("file:///tmp/data

Can I convert pandas dataframe to spark rdd?

阅读更多关于 Can I convert pandas dataframe to spark rdd?

问题 Pbm: a) Read a local file into Panda dataframe say PD_DF b) Manipulate/Massge the PD_DF and add columns to dataframe c) Need to write PD_DF to HDFS using spark. How do I do it ? 回答1: You can use the SQLContext object to invoke the createDataFrame method, which takes an input data which can optionally be a Pandas DataFrame object. 回答2: Lets say dataframe is of type pandas.core.frame.DataFrame then in spark 2.1 - Pyspark I did this rdd_data = spark.createDataFrame(dataframe)\ .rdd In case, if

How to pass params to a ML Pipeline.fit method?

阅读更多关于 How to pass params to a ML Pipeline.fit method?

问题 I am trying to build a clustering mechanism using Google Dataproc + Spark Google Bigquery Create a job using Spark ML KMeans+pipeline As follows: Create user level based feature table in bigquery Example: How the feature table looks like userid |x1 |x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9 |x10 00013 |0.01 | 0 |0 |0 |0 |0 |0 |0.06 |0.09 | 0.001 Spin up a default setting cluster, am using gcloud command line interface to create the cluster and run jobs as shown here Using the starter code provided, I

How to overwrite data with PySpark's JDBC without losing schema?

阅读更多关于 How to overwrite data with PySpark's JDBC without losing schema?

问题 I have a DataFrame that I'm willing to write it to a PostgreSQL database. If I simply use the "overwrite" mode, like: df.write.jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES) The table is recreated and the data is saved. But the problem is that I'd like to keep the PRIMARY KEY and Indexes in the table. So, I'd like to either overwrite only the data, keeping the table schema or to add the primary key constraint and indexes afterward. Can either

How to create new DataFrame with dict

阅读更多关于 How to create new DataFrame with dict

问题 I had one dict , like: cMap = {"k1" : "v1", "k2" : "v1", "k3" : "v2", "k4" : "v2"} and one DataFrame A , like: +---+ |key| +---- | k1| | k2| | k3| | k4| +---+ to create the DataFame above with code: data = [('k1'), ('k2'), ('k3'), ('k4')] A = spark.createDataFrame(data, ['key']) I want to get the new DataFrame, like: +---+----------+----------+ |key| v1 | v2 | +---+----------+----------+ | k1|true |false | | k2|true |false | | k3|false |true | | k4|false |true | +---+----------+----------+ I

How to reference a dataframe when in an UDF on another dataframe?

阅读更多关于 How to reference a dataframe when in an UDF on another dataframe?

问题 How do you reference a pyspark dataframe when in the execution of an UDF on another dataframe? Here's a dummy example. I am creating two dataframes scores and lastnames , and within each lies a column that is the same across the two dataframes. In the UDF applied on scores , I want to filter on lastnames and return a string found in lastname . from pyspark import SparkContext from pyspark import SparkConf from pyspark.sql import SQLContext from pyspark.sql.types import * sc = SparkContext(

How to reference a dataframe when in an UDF on another dataframe?

阅读更多关于 How to reference a dataframe when in an UDF on another dataframe?

SparkContext.getOrCreate() purpose

阅读更多关于 SparkContext.getOrCreate() purpose

问题 What is the purpose of the getOrCreate method from SparkContext class? I don't understand when we should use this method. If I have 2 spark applications that are run with spark-submit , and in the main method I instantiate the spark context with SparkContext.getOrCreate , both app will have the same context? Or the purpose is simpler, and the only purpose is when I create a spark app, and I don't want to send the spark context as a parameter to a method, and I will get it as a singleton

SparkContext.getOrCreate() purpose

阅读更多关于 SparkContext.getOrCreate() purpose

E-num / get Dummies in pyspark

阅读更多关于 E-num / get Dummies in pyspark

问题 I would like to create a function in PYSPARK that get Dataframe and list of parameters (codes/categorical features) and return the data frame with additional dummy columns like the categories of the features in the list PFA the Before and After DF: before and After data frame- Example The code in python looks like that: enum = ['column1','column2'] for e in enum: print e temp = pd.get_dummies(data[e],drop_first=True,prefix=e) data = pd.concat([data,temp], axis=1) data.drop(e,axis=1,inplace